About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
ASAP 2014
Conference paper
Resource-efficient regular expression matching architecture for text analytics
Abstract
Text analytics systems, such as IBM's SystemT software, rely on regular expressions (regexs) and dictionaries for transforming unstructured data into a structured format. Unlike network intrusion detection systems, text analytics systems compute and report precisely where the specific and sensitive information starts and ends in a text document. Therefore, advanced regex matching functions, such as start-offset reporting, capturing groups, and leftmost match computation are heavily used in text analytics systems. We present a novel regex matching architecture that supports such functions in a resource-efficient way. The resource efficiency is achieved by 1) eliminating state replication, 2) avoiding expensive offset comparison operations in leftmost match computation, and 3) minimizing the number of offset registers. Experiments on regex sets from text analytics and network intrusion detection domains, using an Altera Stratix IV FPGA, show that the proposed architecture achieves a more than threefold reduction of the logic resources used and a more than 1.25-fold increase of the clock frequency with respect to a recently proposed architecture that supports identical features. © 2014 IEEE.