Feature-rich Regular Expression Matching Accelerator for Text Analytics
The volume of textual data accessible on our planet is increasing every day. Extracting information hidden in this “Big Data” is a computationally intensive task. A key step of information extraction is the conversion of free text into a structured format. This step is typically achieved using regular expressions (regexs) and dictionaries. Unlike network intrusion detection systems, information extraction systems detect and report where precisely the specific and relevant information starts and ends within text documents. To improve precision and to eliminate ambiguity, regex matchers used in information extraction systems must support start and end offset position reporting, capturing groups, and specific regex-matching semantics, such as leftmost matching. This work describes a scalable regex-matching accelerator that supports such advanced regex-matching features and can be efficiently implemented in reconfigurable logic. Experiments on proprietary and open source regex sets comprising hundreds of regexs demonstrate an up to sixfold improvement of the area-delay product with respect to previous work.