12 new Project Debater AI technologies available as cloud APIs

In “An autonomous debating system,” published in Nature, we explain the Project Debater architecture and evaluate its performance.

Argumentation and debating are fundamental capabilities of our human intelligence. Until recently, they have been totally out of reach of AI.

In February 2019 and after six years of work by natural language processing and machine learning researchers and engineers, an IBM AI dubbed Project Debater became the first AI system able to debate humans over complex topics.

And while it may not have ‘won’ the sparring against debate champion Harish Natarajan in San Francisco that year, Project Debater demonstrated how AI could help people build persuasive arguments and make well-informed decisions. The AI became the third in the series of IBM Research AI’s grand challenges, following Deep Blue and Watson.

In our recent paper “An autonomous debating system” published in Nature,¹ we describe Project Debater’s architecture and evaluate its performance. We also offer free access for academic use to 12 of Project Debater’s underlying technologies as cloud APIs, as well as trial and licensing options for developers.

To debate humans, an AI must be equipped certain skills. It has to be able to pinpoint relevant arguments for a given debate topic in a massive corpus, detect the stance of arguments and assess their quality. It also has to identify general, recurring arguments that are relevant for the specific topic, organize the different types of arguments into a compelling narrative, recognize the arguments made by the human opponent, and make a rebuttal. And it has to be able to use competitive debate techniques, such as asking the opponent questions to frame the discussion in a way that favors its position.

This is exactly what we’ve done with Project Debater. It’s been developed as a collection of components, each designed to perform a specific subtask. Over the years, we published more than 50 papers describing these components and released many related datasets for academic use.

Building debating skills

To engage in a debate successfully, a machine requires high level of accuracy from each component. For example, failing to detect the argument’s stance may result in arguing in favor of your opponent – a dire situation in a debate.

This is why it was crucial for us to collect uniquely large-scale, high-quality labeled training datasets for Project Debater. The evidence detection classifier, for instance, was trained on 200,000 labeled examples, and achieved a remarkable precision of 95 percent for top 40 candidates.²

Another major challenge was scalability. For example, we had to apply “wikification” (identifying mentions of Wikipedia concepts) to our 10 billion-sentence corpus – an impossible task for any existing wikification tools. So, we developed a new, fast wikification algorithm that could be applied to massive corpora and achieve competitive accuracy.³

Project Debater’s APIs give access to different capabilities originally developed for the live debating system, as well as related technologies we have developed more recently. The APIs include natural language understanding capabilities that deal with wikification, semantic relatedness between Wikipedia concepts,⁴ short text clustering, and common theme extraction for texts.

The core set of APIs relates to services for argument mining and analysis. These services include detection of sentences containing claims and evidence,² detecting claim boundaries in a sentence, argument quality assessment and stance classification (Pro/Con).⁵^{, 6}^{, 7}

Then there are APIs for two high-level services that create different kinds of summaries, Narrative Generation and Key Point Analysis. When given a set of arguments, Narrative Generation constructs a well-structured speech that supports or contests a given topic, according to the specified polarity.

And Key Point Analysis is a new and promising approach for summarization, with an important quantitative angle. This service summarizes a collection of comments on a given topic as a small set of key points, and the prominence of each key point is given by the number of its matching sentences in the given comments.⁸^{, 9}

Developers are welcome

Key Points Analysis and Narrative Generation have been recently demonstrated in the “That’s Debatable” television series and in the “Grammy Debates with Watson” backstage experience, where they summarized pro and con arguments contributed online by thousands of people, discussing debate topics ranging from social questions to pop culture.

Developers can access the Project Debater API documentation as guests on the main documentation site. They can login as guests, view the documentation and run online interactive demos of most of the services. They can also see the code of complete end-to-end examples using these services.

Screenshot of the landing page for getting started with Project Debater — Getting started on IBM's Project Debater

One example is Mining to Narrative. Given a controversial topic, it demonstrates the creation of a narrative by mining content from a Wikipedia corpus. Another one uses Debater Services to analyze free text surveys for themes, where it identifies themes based on Wikipedia concepts.

Before developers can run code examples or use the Project Debater APIs in their own project, they need to obtain an API key and download the SDK. To request an API key, please visit Project Debater for Academic Use or send an an e-mail request to project.debater@il.ibm.com. You will receive a username and password to login to the Early Access website and can then obtain your personal API key from the API-key tab.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

References

Slonim, N., Bilu, Y., Alzate, C., et al. (2021). An autonomous debating system. Nature, 591(7850), 379–384. ↩
Ein-Dor, L., Eyal Shnarch, Lena Dankin, et al. (2020). Corpus Wide Argument Mining - a Working Solution. ArXiv, abs/1911.10763. ↩ ↩²
Shnayderman, I., Ein-Dor, L., et al. (2019). Fast End-to-End Wikification. ArXiv, abs/1908.06785. ↩
Ein Dor, L., Halfon, A., et al. (2018). Semantic Relatedness of Wikipedia Concepts – Benchmark Data and a Working Solution. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 2571-2575. ↩
Levy, R., Bilu, Y., et al. (2014). Context Dependent Claim Detection. Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1489–1500. ↩
Gretz, S., Friedman, R., Cohen-Karlik, et al. (2020). A Large-scale Dataset for Argument Quality Ranking: Construction and Analysis. ArXiv, abs/1911.11408. ↩
Bar-Haim, R., Bhattacharya, I., et al. (2017). Stance Classification of Context-Dependent Claims. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261. ↩
Bar-Haim, R., Eden, L., et al. (2020). From Arguments to Key Points: Towards Automatic Argument Summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4029–4039. ↩
Bar-Haim, R., Kantor, Y., et al. (2020). Quantitative Argument Summarization and Beyond: Cross-Domain Key Point Analysis. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 39–49. ↩

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them
Q & A
Kim Martineau
06 Aug 2025
IBM Storage Scale delivers real-world performance: an in-depth analysis
Technical note
Brian Belgodere, Chris Miller, John Lewars, Matthew Klos, Yukio Hayashi Leon, Mara Miranda Bautista, and Olaf Weiser
04 Aug 2025
- AI
- Hybrid Cloud Infrastructure
Debugging LLMs to improve their credibility
Research
Kim Martineau
30 Jul 2025
From simulated steps to real-world care: AI learns how we walk for neurology
Research
Peter Hess
29 Jul 2025

Building debating skills

Developers are welcome

References

Related posts

All decisions have trade-offs. IBM’s Wei Sun is an expert at weighing them

IBM Storage Scale delivers real-world performance: an in-depth analysis

Debugging LLMs to improve their credibility

From simulated steps to real-world care: AI learns how we walk for neurology