IBM Project Debater
Debater Datasets

Back to Project Debater

Project Debater Datasets

The development of an automatic debating system naturally involves advancing research in a range of artificial intelligence fields. This page presents several annotated data sets developed as part of Project Debater to facilitate this research. It is organized by research sub-fields explained below.

Argument Mining is a prominent research frontier. Within this field, we distinguish between Argument Detection - the detection and segmentation of argument components such as claims and evidence; and Argument Stance Classification – determining the polarity of an argument component with respect to a given topic.

Beyond argument mining, a debating system should face the challenge of interactivity i.e., the ability to understand and rebut the text of the opponent’s speech. Debate Speech Analysis is a new research field that focuses on this challenge.

Another important aspect of a debating system is the ability to interact with its surroundings in a human-like manner. Namely, it should be able to articulate arguments and listen to arguments made by others. Regarding the former, the Text to Speech system must demonstrate human-like expressiveness to keep human listeners engaged. The latter may call for Speech-to-text systems that are especially designed for a debating scenario.

Finally, a debating system should naturally rely on more fundamental NLP capabilities. One example is the ability to assess the semantic relatedness of various pieces of texts and glue these into a coherent narrative. The system should also have the ability to identify the basic concepts mentioned in the text. The corresponding benchmark data we released thus far in this context are described in the section on Basic NLP.


Yoav Katz, Manager, Project Debater team, IBM Research - Haifa

Yoav Katz,
Manager, Project Debater team,
IBM Research - Haifa


Noam Slonim, Principal Investigator, Project Debater team, IBM Research - Haifa

Noam Slonim,
Principal Investigator, Project Debater team,
IBM Research - Haifa


This page allows you to download copies of the Project Debater Datasets.

The datasets are released under the following licensing and copyright terms, unless specified otherwise in their release notes:

To download, please fill in the request forms below.

Other datasets are expected to be released over time.

  1. Argument Detection

    The various argument detection datasets differ in size (e.g., number of topics), type of element detected (claims, claim sentences, or evidence), and method used for detection (pre-selected articles vs. automatic retrieval). The table below lists the different datasets and provides information on their characteristics:

    Dataset Reference Topics Element Method



    Extracted from ASR (automatic speech recognition) output of debate speeches over controversial topics



    Automatically retrieved Wikipedia sentences with manual crowd-sourcing scoring

    150 (70 train, 30 held-out, 50 test)

    Claim Sentence

    Automatically retrieved Wikipedia sentences

    118 (83 train, 35 test)


    Automatically retrieved Wikipedia sentences

    58 (leave one topic out)


    Pre-selected Wikipedia articles

    33 (leave one topic out)


    Pre-selected Wikipedia articles

    Go to download ↓

  2. Argument Quality

    After arguments are detected, one faces the task of identifying those of a higher quality. To that end, we created the following data sets, with the focus on identifying the better argument.

    Dataset Reference Topics Element Method



    Automatically retrieved Wikipedia sentences



    Actively collected arguments from crowds



    Actively collected arguments from crowds

    Go to download ↓

  3. Argument Stance Classification and Sentiment Analysis

    A debating system must distinguish between arguments that support its side in the debate and those supporting the opponent’s side. The following datasets were developed as part of the work on Project Debater’s stance classification engine.

    1. Claim Stance

      The claim stance dataset includes stance annotations for claims, as well as auxiliary annotations for intermediate stance classification subtasks.

      Dataset Reference Topics Number of Claims Method



      Manually identified and annotated claims from Wikipedia

      Go to download ↓

    2. Sentiment Analysis

      Sentiment analysis is an important sub-component of our stance classification engine. The following two resources address sentiment analysis of complex expressions, which goes beyond simple aggregation of word-level sentiments. The first resource is a sentiment lexicon of idiomatic expressions, like “on cloud nine” and “under fire”. The second resource addresses sentiment composition – predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury”, both “reduced” and “fresh” are followed by a negative word. However, “reduced” flips the negative polarity, resulting in a positive phrase, while “fresh” propagates the negative polarity to the phrase level, resulting in a negative phrase. Accordingly, “reduced” is part of our “reversers” lexicon, and “fresh” is part of the “propagators” lexicon.

      Dataset Reference Content Source

      5,000 frequently occurring idioms with sentiment annotation

      Manually annotated idioms from Wiktionary

      Sentiment composition lexicons containing 2,783 words and sentiment lexicons containing 66K unigrams and 262K bigrams.

      Automatically learned from a large proprietary English corpus

      Go to download ↓

    3. Expert Stance

      Expert evidence (premise) is a commonly used type of argumentation scheme. Prior knowledge about the expert’s stance towards the debate topic can help predict the polarity of such arguments. For example, an argument made by Richard Dawkins about atheism is likely to have a PRO stance, since Dawkins is a well-known atheist. Such information can be extracted from Wikipedia categories: Dawkins, for instance, is listed under “Antitheists”, ”Atheism activists”, “Atheist feminists” and “Critics of religions”. The Wikipedia Category Stance dataset contains stance annotations of Wikipedia categories towards Wikipedia concepts representing controversial topics.

      Dataset Reference Topics Number of Categories Method



      Manually annotated Wikipedia Categories

      Go to download ↓

  4. Debate Speech Analysis

    In order to respond to an opponent’s speech, the system must process the opponent’s voice and ‘understand’ its content. The provided datasets focus on the Automatic Speech Recognition (ASR) output and on the upstream tasks related to understanding opponents' speeches.

    Dataset Reference Speeches Topics Contents



    - Recordings of expert debaters
    - Automatic and manually-corrected transcripts of the speeches, in both raw and cleaned (processed) versions
    - An annotation specifying the response speeches recorded for each speech, and the type of the response (explicit/implicit)
    - Metadata describing the speeches, such as the topic discussed in each speech



    - Recordings of expert debaters
    - 55 general-purpose claim and rebuttal pairs written by an expert human debater
    - An annotation specifying for each of the 50 controversial topics, which of the 55 general-purpose claims is relevant to the topic
    - An annotation of general-purpose claims relevant to a topic, specifying whether a relevant claim was mentioned in speeches discussing the topic
    - An annotation of general-purpose claims and sentences from speeches in which they were mentioned, specifying whether the claim was mentioned in the sentence
    - An annotation of general-purpose rebuttals, specifying whether they are a plausible response to general-purpose claims mentioned in speeches



    Recordings of expert debaters + mined claims annotated in a listening comprehension task



    Recordings of expert debaters + arguments annotated in a listening comprehension task



    Recordings of 10 expert debaters

  5. Go to download ↓

  6. Debate Topic Expansion

    When debating a controversial topic, it is often desirable to expand the boundaries of the discussion, and bring up arguments about related topics. For example, when discussing the pros and cons of the presidential system, it is natural to contrast it with those of the parliamentary system. When debating alternative medicine, we may discuss specific examples, such as homeopathy and naturopathy. Conversely, when discussing bitcoins, we can speak more broadly on cryptocurrency. Debate Topic Expansion is the task of automatically finding related topics that can enrich our arguments and strengthen our case when debating a given topic.

    We distinguish between two types of expansions: "consistent" and "contrastive". Arguing in favor or against a consistent expansion may support the same stance towards the original topic, whereas for contrastive expansions the stance is reversed. For example, "cryptocurrency" is a consistent expansion of "bitcoin", while "parliamentary system" is a contrastive expansion of "presidential system", as we may support the presidential system by criticizing the parliamentary system.

    The Debate Topic Expansion dataset includes pairs (DC,EC) where DC is a debate topic and EC is an automatically-extracted candidate expansion for DC. Both DC and EC are Wikipedia concepts. Each such pair was manually labeled as good/bad expansion.

    Dataset Reference Topics Annotated Expansions Method


    Consistent: 1,741
    Contrastive: 1,326

    Candidate expansions were generated automatically, and then were manually labeled as good/bad expansions

  7. Go to download ↓

  8. Expressive Text to Speech

    The emphasized words dataset was created to train and evaluate a system that receives a written argumentative speech and predicts which words should be emphasized by the Text-to-Speech component.

    Dataset Reference Number of Paragraphs Number of Sentences Source



    The speeches were created based on claims/evidence automatically detected from Wikipedia

  9. Go to download ↓

  10. Basic NLP Tasks

    The following datasets relate to basic NLP tasks, addressed as part of Project Debater.

    1. Semantic Relatedness

      Predicting semantic relatedness between texts is a basic NLP problem with a wide variety of applications. Relatedness can be measured between several types of texts, ranging from words to documents. The relatedness datasets listed below differ in the type of elements considered (words, multi-word-terms, and concepts), number of topics from which the pairs were extracted, and number of annotated pairs.

      Dataset Reference Number of Topics Type of elements Number of pairs

      143 (82 train, 41 test)

      Wikipedia Entities

      19,276 (12,969 train, 6307 test)


      Words and Multi-word Terms


      Go to download ↓

    2. Mention Detection

      The goal of Mention Detection is to map entities/concepts mentioned in text to the correct concept in a knowledge base. This process involves segmenting the text (as some concepts span multiple words) and the disambiguation of terms with more than one meaning.

      Dataset Reference Number of Sentences Number of Topics Source

      3000 (500 train and 500 test for each of the three text sources)


      Mix of Wikipedia articles and ASR/manual transcripts of speeches by expert debaters

      Go to download ↓

    3. Text Clustering

      Text clustering is a widely-studied NLP problem. Clustering can be applied to texts at different levels, from single words to full documents, and can vary with respect to the clustering goal. In thematic clustering, the aim is to cluster texts based on thematic similarity between them, namely grouping together texts that discuss the same theme.

      Thematic clustering of sentences is important for various use cases. For example, in multi-document summarization, one often extracts sentences from multiple documents that should be organized into meaningful sections and paragraphs. Similarly, within the emerging field of computational argumentation, arguments may be found in a widespread set of articles, which further require thematic organization to generate a compelling argumentative narrative.

      Evaluation of thematic clustering methods requires a ground truth dataset of sentence clustering. Unfortunately, sentence clustering is considered a very difficult task for humans. As a result, there is no standard human annotated sentence clustering dataset.

      In the dataset “Thematic Clustering of Sentences” sentences are annotated for their thematic clusters. This annotation enables to evaluate thematic clustering methods. The dataset was generated automatically by leveraging the partition of Wikipedia articles into sections. The underlying assumption of its creation was that the section structure of a Wikipedia article can serve as ground truth for the thematic clustering of its sentences. Details about the way this dataset was generated can be found in the article.

      Dataset Reference Number of Topics Number of Clusters per Topic Number of Sentences per Topic




      Go to download ↓

    4. Concept Abstractness

      During the last decades, the influence of psycholinguistic properties of words on cognitive processes has become a major topic of scientific inquiry. Among the most studied psycholinguistic attributes are concreteness, familiarity, imagery, and average age of acquisition. Abstractness quantifies the degree to which an expression denotes an entity that can be directly perceived by human senses. As an example, the word "feminism" is usually perceived as abstract, but the word "screwdriver" is associated with a concrete meaning.

      We introduce a weakly supervised approach for inferring the property of abstractness of words and expressions in the complete absence of labeled data. Exploiting only minimal linguistic clues and the contextual usage of a concept as manifested in textual data, we train sufficiently powerful classifiers, obtaining high correlation with human labels. The released dataset contains 300K Wikipedia concepts automatically rated for their degree of abstractness.

      Dataset Reference Number of Topics


      Go to download ↓

    5. Concept Controversiality

      With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia’s metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available two new datasets of concepts labeled for controversiality. One is an extension of an already published dataset, and the other is significantly larger than existing datasets, and grades concepts on a 0-1 scale, rather than treating controversiality as a binary label.

      Dataset Reference Number of Topics

      1213 + 3560

      Go to download ↓

    6. Automatic Claim Negation

      The main goal of argumentation mining is to analyze argumentative structures within an argument-rich document, and reason about their composition. Recently, there is also interest in the task of simply detecting claims (sometimes called conclusion) in general documents. In this work we ask how this set of detected claims can be augmented further, by adding to it the negation of each detected claim. This presents two NLP problems: how to automatically negate a claim, and when such a negated claim can plausibly be used. We present first steps into solving both these problems, using a rule-based approach for the former and a statistical one towards the latter.

      Dataset Reference Number of Sentences


      Go to download ↓

  11. Classes of Principled Arguments

    Principled common-place arguments are one of the fundamental building blocks of effective debating. This data set describes a modelling of such arguments by organizing them into Classes of Principled Arguments (CoPAs), and associating these classes with a large set of motions.

    Dataset Reference Content

    Association of CoPAs to 689 motions

    Statistics for the number of claims mentioned in recorded speeches from each CoPA

  12. Go to download ↓

  13. Key Point Analysis

    Key Point Analysis (KPA) is a novel summarization technology that extracts the main points and their salience from a collection of opinions. It has been successfully applied to diverse types of data, including arguments, user reviews and survey responses. KPA maps the input texts to a set of automatically-extracted short sentences and phrases, termed Key Points (KPs), which provide a concise plain-text summary of the data. The prevalence of each KP is quantified as the number of its matching sentences. We release the following benchmarks for the task of mapping input texts to key points.

    Argumentation Datasets

    The ArgKP dataset contains 24,093 (argument, key point) pairs labeled as matching/non-matching, for 28 controversial topics. For each pair, the topic and stance are also indicated. The ArgKP-2021 dataset extends ArgKP with three additional topics, amounting to 31 topics and 27,519 (argument, key point) pairs overall. The ArgKP-2023 dataset includes 7,584 pairs for 10 additional topics (9,281 including “undecided” pairs).

    Municipal Survey Dataset

    The SurveyKP dataset contains 15,075 (sentence, key point) pairs labeled as matching/non-matching (15,189 including “undecided” pairs), for sentences that were sampled from responses to the 2016-2017 Austin Community Survey.

    Dataset Reference Topics Pairs Source



    The sentences were sampled from open-ended responses to the 2016-2017 Austin Community Survey. The key points were automatically extracted by our KPA system from the entire survey, with minor manual edits. The stance of each pair (whether the feedback is positive or negative) is also indicated. The labeling of each (sentence, key point) pair as matching/non-matching was performed manually, by an in-house team of annotators.



    Similar to the ArgKP dataset (see below), ArgKP-2023 includes pro and con arguments from the IBM-ArgQ-Rank-30kArgs dataset, for 10 additional topics. Key points were extracted automatically by our system. The labeling of each (argument, key point) pair as matching/non-matching was performed manually, via crowd-sourcing.



    ArgKP dataset extended with 3 additional topics. The extension arguments were actively collected from the crowd and labeled for stance. As in ArgKP, the key points for each new topic were manually composed by an expert debater. The labeling of each new (argument, key point) pair as matching/non-matching was performed manually, via crowd-sourcing.



    The arguments, along with their topic and stance were taken from the IBM-ArgQ-Rank-30kArgs dataset (see above). The key points for each topic were manually composed by an expert debater. The labeling of each (argument, key point) pair as matching/non-matching was performed manually, via crowd-sourcing.

    Go to download ↓

  14. Claim Generation

    Claim generation is a challenging task whose research is timely considering its potential impact on social media and the dissemination of information. The following dataset contains texts that were generated by various fine-tuned GPT-2 models for a large set of topics, labeled for being plausible and stance-bearing, determining the "claimness" of the generated texts.

    Dataset Reference Generated Texts Topics



    Go to download ↓

  15. Multilingual Argument Mining

    The growing interest in argument mining and computational argumentation brings with it a plethora of Natural Language Understanding (NLU) tasks and corresponding datasets. However, as with many other NLU tasks, the dominant language is English, with resources in other languages being few and far between. This multilingual dataset contains data for three major computational argumentation tasks -- stance classification, evidence detection, and argument quality prediction -- in 5 languages: Italian, French, Spanish, German and Dutch.

    Dataset Reference Topics Contents

    Evidence: 321
    Arguments: 71

    - 6,752 Human-authored arguments in 5 languages, annotated for their stance and quality
    - 30,497 English arguments annotated for their stance and quality, along with their machine-translation to 5 languages
    - 35,211 English Wikipedia sentences annotated for whether they are valid Evidence and their stance towards the discussed topic, along with their machine-translation to 5 languages
    - 210 English arguments and 200 English Wikipedia sentences, manually and automatically translated to German and Italian, with their corresponding labels collected for English and for the translated texts

    Go to download ↓

  16. Targeted Sentiment Analysis

    Understanding sentiment has been a major research area in NLP for decades, from the classification of review documents as expressing a positive or negative sentiment towards their topic, through a similar analysis of single sentences, and, in recent years, going deeper into identifying the sentiment expressed towards single words or phrases. The targeted sentiment analysis dataset provides sentence level sentiment target annotations of data sampled from five sources of product and user reviews.

    Dataset Reference Data sources Contents

    SemEval'14, Stanford Sentiment Treebank, Opinosis, Yelp, Amazon

    A sample of sentences selected from each data source was annotated for targets and their sentiment:
    - SemEval'14: 100 sentences from each domain (laptops and restaurants)
    - SST: 500 sentences
    - Opinosis: 512 sentences
    - Yelp: 501 sentences
    - Amazon: 502 sentences

    Go to download ↓

  17. Intent Classification

    Intent classification or detection is the task of interpreting the user intent when conversing in a task-oriented dialogue setting.
    The following dataset contains diverse queries mapped to intents concerning COVID-19 vaccines, in the context of VIRA, a chatbot that aimes to address COVID-19 vaccine hesitancy (

    Dataset Reference Intents Expressions



    Go to download ↓

  18. Trust Classification

    The following dataset contains annotations of High trust, Low institutional trust and Low agent trust in user inputs to VIRA, a chatbot that aimes to address COVID-19 vaccine hesitancy (

    Dataset Reference User Inputs Source



    Go to download ↓


Debater Datasets - Licensing Notice

Each copy or modified version that you distribute must include a licensing notice stating that the work is released under CC-BY-SA and either a) a hyperlink or URL to the text of the license or b) a copy of the license. For this purpose, a suitable URL is: