Resource-efficient regular expression matching architecture for text analytics

Kubilay Atasu

doi:10.1109/ASAP.2014.6868623

ASAP 2014

Conference paper

18 Jun 2014

Resource-efficient regular expression matching architecture for text analytics

View publication

Abstract

Text analytics systems, such as IBM's SystemT software, rely on regular expressions (regexs) and dictionaries for transforming unstructured data into a structured format. Unlike network intrusion detection systems, text analytics systems compute and report precisely where the specific and sensitive information starts and ends in a text document. Therefore, advanced regex matching functions, such as start-offset reporting, capturing groups, and leftmost match computation are heavily used in text analytics systems. We present a novel regex matching architecture that supports such functions in a resource-efficient way. The resource efficiency is achieved by 1) eliminating state replication, 2) avoiding expensive offset comparison operations in leftmost match computation, and 3) minimizing the number of offset registers. Experiments on regex sets from text analytics and network intrusion detection domains, using an Altera Stratix IV FPGA, show that the proposed architecture achieves a more than threefold reduction of the logic resources used and a more than 1.25-fold increase of the clock frequency with respect to a recently proposed architecture that supports identical features. © 2014 IEEE.

Paper