New IBM Research NLP enhancements coming to IBM Watson Discovery

The field of natural language processing (NLP) for business is burgeoning with new discoveries and advancements. Many companies are developing solutions that allow enterprises to extract meaningful insights from textual data.

IBM Research’s efforts in NLP over the years have been noteworthy. From IBM Watson’s historic win on Jeopardy! in 2011, to our release of IBM Project Debater in 2018 and the system’s advancements since, we constantly pursue innovation in NLP to make it easier to understand the language of businesses. Some of these areas include:

Extracting meaning from complex, multi-format documents: Enterprise information typically resides in PDF documents which are notoriously difficult to process given their intended use for printing and reading, not machine consumption. To extract meaning from PDFs, we first convert them, based on low-level features (e.g., characters and graphic lines), into a form that captures meaningful document structure (e.g., titles, sections, headers and lists) and other key information, like tables, diagrams, charts and figures. We then develop advanced NLP solutions to extract important information from these complex documents.
Creating multi-lingual systems: Unlike numbers and images, language varies from country to country and even within specific regions within the same country. As a result, enterprise NLP solutions must work in many languages and without the need to undergo retraining each time they encounter a new language.
Empowering subject matter experts: To be effective, NLP solutions must capture the knowledge of an organization’s lawyers, customer support agents, marketers, HR employees and other professionals. Tools that enable these subject matter experts (SMEs) to customize NLP are critical because most companies do not have access to NLP experts. And if they do, those developers are often not familiar with the business’s specific semantics.

The magic of IBM Research in Watson Discovery

Today, we’re excited to announce that key areas of these NLP research efforts—including smart document understanding, advanced pattern detection, and advanced customization of NLP models—are being infused into IBM Watson Discovery, a platform applying the latest in AI and NLP to retrieve business-critical insights from documents. These new capabilities include:

Pre-trained document structure understanding

At IBM Research, we’re working to create models that have deeper understanding of complex documents, including documents with low image quality, and multi-format images such as tables, charts, and pictures. More on our recent efforts in this space can be read, here.

Drawing on this, Watson Discovery’s Smart Document Understanding feature now includes a new pre-trained model that automatically understands the visual structure and layout of a document without additional training from a developer or data scientist.

Automatic text pattern detection

When it comes to analyzing massive amounts of documents that hold similar information but in different formats and phrases, a business SME must be able to identify business-specific text patterns quickly and effectively across documents.

Originating in IBM Research, a new beta in Watson Discovery’s Enterprise and Plus plans allows users to quickly identify business-specific text patterns within their documents, such as key financial metrics mentioned throughout an annual report. Specifically, it enables more efficient ways of labeling data and training models from as few as two examples, and then refine them based on user feedback. To date, this feature exhibits 70-90% accuracy¹ when learning to capture text patterns in a model.

To date, this feature exhibits 70-90% accuracy when learning to capture text patterns in a model.

Looking forward, we are working to expand this capability to support more complex NLP tasks, beyond extraction based on simple text patterns. We are also building a simulation-based evaluation framework to systematically evaluate and improve such NLP customization capabilities at scale.

Advanced NLP customization capabilities

Training NLP models to identify highly customized, business-specific words and phrases is a time-consuming task that requires significant data prep, labeling, and orchestration. Models trained on generic data sets often fail to retrieve the right information.

With a new custom entity extractor feature now available in beta, IBM is simplifying this process by reducing the effort for data prep, simplifying labeling with active learning and bulk annotation capabilities, and enabling simple model deployment that can accelerate training time.

Raising the bar for enterprise NLP

While the new capabilities in Watson Discovery aim to advance AI-powered enterprise search technology for businesses, they also demonstrate IBM’s commitment to cultivating and applying NLP-based technologies that are best-suited for business requirements and needs.

The capabilities released today in Watson Discovery are part of a wider initiative IBM Research is working on, enabling easy and intuitive customization of NLP models that enables a non-data science SME to customize the NLP models more easily and effectively for their needs, without special training or complex layers of guidance and information to learn.

Additionally, the latest Watson Discovery customization features hits on a key element of trust in a system’s process and outputs. Every company, industry, and use case has its own language made up of a unique taxonomy of concepts and relationships, commonly accepted short forms and abbreviations, accepted patterns of communication and challenges.

Given this, these systems must be easily adaptable to understand and process these individualized needs. And, through a system demonstrating its understanding of these nuances, we can better trust its outputs and decisions.

New IBM Research NLP work front-and-center at EMNLP 2021

IBM Research this week is presenting a number of papers, demos and workshops at the annual Empirical Methods in Natural Language Processing (EMNLP) Conference showcasing the latest in NLP innovation. This includes:

Multi-Domain Multilingual Question Answering: New techniques that can enhance our ability to handle enterprise use cases, such as understanding complex tables in business documents, and being able to retrieve answers that appear in tables embedded in these documents.² An EMNLP tutorial detailing more of this work can be viewed, here.
Knowledge induction techniques to help in slot-filling use cases: In addition to SMEs, a significant portion of an organization’s IP exists in the form of knowledge hidden in the organization’s documents and content systems. As a result, enterprise NLP solutions require additional processing to integrate this pre-existing information with the processing of textual content. New research³^{, 4} we’re presenting at EMNLP in this area can be read in two papers on arXiv, here, and here.
Combining classical and deep learning: Research detailing the potential of newer AI techniques,⁵ like the combination of deep learning and symbolic reasoning to create neuro-symbolic AI systems.

Learn more about:

Natural Language Processing: We’re building advanced AI systems that can parse vast bodies of text to help unlock that data, but also ones flexible enough to be applied to any language problem.

Speech: As more of the world moves online, the demand for systems that can understand users and speak to them in natural language is growing exponentially.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

References

Hanafi, M., Abouzied, A., Chiticariu, L., Li, Y. SEER: Auto-Generating Information Extraction Rules from User-Specified Examples. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI '17). Association for Computing Machinery, New York, NY, USA, 6672–6682. (2017). ↩
Chemmengath, S., Kumar, V., Bharadwaj, S., et al. Topic Transferable Table Question Answering. arXiv. (2021). ↩
Glass, M., Rossiello, G., Chowdhury, F., Gliozzo, A. Robust Retrieval Augmented Generation for Zero-shot Slot Filling. arXiv. (2021). ↩
Dash, S., Rossiello, G., Mihindukulasooriya, N., et al. Open Knowledge Graphs Canonicalization using Variational Autoencoders. arXiv. (2021). ↩
Kimura, D., Ono, M., Chaudhury, S., et al. Neuro-Symbolic Reinforcement Learning with First-Order Logic. arXiv. (2021). ↩

Expanding AI model training and inference for the open-source community
News
Peter Hess
21 Oct 2025
Introducing Thinking-in-Modalities with TerraMind
Technical note
Benedikt Blumenstiel and Johannes Jakubik
20 Oct 2025
- AI
- Physical Sciences
Toucan: A new goldmine for tool-calling AI agents
Release
Kim Martineau
17 Oct 2025
Introducing CUGA: The enterprise-ready configurable generalist agent
Release
Asaf Adi and Avi Yaeli
15 Oct 2025
- AI

The magic of IBM Research in Watson Discovery

Pre-trained document structure understanding

Automatic text pattern detection

Advanced NLP customization capabilities

Raising the bar for enterprise NLP

New IBM Research NLP work front-and-center at EMNLP 2021

Learn more about:

References

Related posts

Expanding AI model training and inference for the open-source community

Introducing Thinking-in-Modalities with TerraMind

Toucan: A new goldmine for tool-calling AI agents

Introducing CUGA: The enterprise-ready configurable generalist agent