Token-based dictionary pattern matching for text analytics

Raphael Polig; Kubilay Atasu; Christoph Hagleitner

doi:10.1109/FPL.2013.6645535

FPL 2013

Conference paper

02 Sep 2013

Token-based dictionary pattern matching for text analytics

View publication

Abstract

When performing queries for text analytics on unstructured text data, a large amount of the processing time is spent on regular expressions and dictionary matching. In this paper we present a compilable architecture for token-bound pattern matching with support for token pattern sequence detection. The architecture presented is capable of detecting several hundreds of dictionaries, each containing thousands of elements at high throughput. A programmable state machine is used as pattern detection engine to achieve deterministic performance while maintaining low storage requirements. For the detection of token sequences, a dedicated circuitry is compiled based on a non-deterministic automaton. A cascaded result lookup ensures efficient storage while allowing multi-token elements to be detected and multiple dictionary hits to be reported. We implemented on an Altera Stratix IV GX530, and were able to process up to 16 documents in parallel at a peak throughput rate of 9.7 Gb/s. © 2013 IEEE.

Paper