Publication
SIGMOD 2011
Conference paper

The SystemT IDE: An integrated development environment for information extraction rules

View publication

Abstract

Information Extraction (IE)-the problem of extracting structured information from unstructured text - has become the key enabler for many enterprise applications such as semantic search, business analytics and regulatory compliance. While rule-based IE systems are widely used in practice due to their well-known "explainability," developing high-quality information extraction rules is known to be a labor-intensive and time-consuming iterative process. Our demonstration showcases SystemT IDE, the integrated development environment for SystemT, a state-of-the-art rule-based IE system from IBMResearch that has been successfully embedded in multiple IBM enterprise products. SystemT IDE facilitates the development, test and analysis of high-quality IE rules by means of sophisticated techniques, ranging from data management to machine learning. We show how to build high-quality IE annotators using a suite of tools provided by SystemT IDE, including computing data provenance, learning basic features such as regular expressions and dictionaries, and automatically refining rules based on labeled examples. © 2011 Authors.