A high-speed and large-scale dictionary matching engine for Information Extraction systems

Kanak Agarwal; Raphael Polig

doi:10.1109/ASAP.2013.6567551

ASAP 2013

Conference paper

09 Sep 2013

A high-speed and large-scale dictionary matching engine for Information Extraction systems

View publication

Abstract

Dictionary matching is a commonly used operation in Information Extraction (IE) systems. It involves matching a set of strings in a document against a dictionary of pre-defined patterns. In this paper, we describe a high performance and scalable hardware architecture to enable high throughput dictionary matching on very large dictionaries for text analytics applications. Our hardware accelerator employs a novel hashing based approach instead of commonly used deterministic finite automata (DFA) based algorithms. A limitation of the DFA based approaches is that they typically process one character every cycle, while the proposed hash based scheme can process a string token every cycle, thus achieving significantly higher processing throughput than the DFA based implementations. Our measurement results based on a prototype implementation on an Altera Stratix IV FPGA device indicate that our hardware dictionary matching engine can process typical document streams at a processing rate of ∼1.5GB/s (∼12 Gbps) while simultaneously allowing support for large dictionary sizes containing up to ∼100K patterns, thus making it very useful for IE workload acceleration. © 2013 IEEE.

Workshop paper