IBM at ACL 2023

Toronto, Canada and virtual
This event has ended.


IBM is proud to sponsor the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) in Toronto, Canada. We invite all attendees to visit us during the event at our booth in the Exhibition Center of at the the Westin Harbour Castle.

We look forward to meeting you at the event and telling you more about our latest work and career opportunities at IBM Research. Our team will be presenting a series of workshops, papers and demos related to a broad range of AI topics.

Read our accepted papers at ACL 2023.

For presentation times of workshops, demos, papers, and tutorials see the agenda section below. Note: All times are displayed in your local time.

View the booth demo & staff schedule.

Keep up with emerging research and scientific developments from IBM Research. Subscribe to the Future Forward Newsletter.

We look forward to meeting and seeing you in Toronto!

View our ACL presentation schedule

Career opportunities

Visit us in the Harbor Ballroom to meet with IBM Researchers and recruiting to speak about future job opportunities or 2024 summer internships.

Featured positions to learn more about at ACL:

ACL Attendees - To further engage, let us know you attended the conference and want to be considered for future Research opportunities here: Submit your information to IBM Research 

Sign up to be notified of future openings by joining our Talent Network.

Explore all IBM Research openings


  • Visit us in the Expo center from 9am - 5pm to talk to researchers, recruiters, and interact with live demos.

  • Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.

    Authors: Guy Dar; Mor Geva; Ankit Gupta (IBM); Jonathan Berant

  • Pre-training models with large crawled corpora can lead to issues such as toxicity and bias, as well as copyright and privacy concerns. A promising way of alleviating such concerns is to conduct pre-training with synthetic tasks and data, since no real-world information is ingested by the model. Our goal in this paper is to understand the factors that contribute to the effectiveness of pre-training models when using synthetic resources, particularly in the context of neural machine translation. We propose several novel approaches to pre-training translation models that involve different levels of lexical and structural knowledge, including: 1) generating obfuscated data from a large parallel corpus 2) concatenating phrase pairs extracted from a small word-aligned corpus, and 3) generating synthetic parallel data without real human language corpora. Our experiments on multiple language pairs reveal that pre-training benefits can be realized even with high levels of obfuscation or purely synthetic parallel data. We hope the findings from our comprehensive empirical analysis will shed light on understanding what matters for NMT pre-training, as well as pave the way for the development of more efficient and less toxic models.

    Authors: Zexue He; Graeme Blackwood (IBM); Rameswar Panda (IBM); Julian Mcauley; Rogerio Feris (IBM)

  • Zero-shot learning (ZSL) focuses on annotating texts with entities or relations that have never been seen before during training. This task has a lot of applications in practice due to the lacking labeled data in real-world situations within specific domains. Recent advances in machine learning with large pretrained language models demonstrate significant results in zero-shot learning with numerous novel methods. It is very high demand both in the industry and the research community to have a frame work where people with different backgrounds can easily access the latest ZSL methods or pretrained models. In this work, we create a new ZSL framework called Zshot. The main goal of our work is to provide researchers with a frame work where they can quickly benchmark and compare different state-of-the-art ZRL methods with standard benchmark datasets included in the framework. Moreover, it is designed to support the industry with ready APIs for production under the standard Spacy NLP pipeline. Our API is extendible and evaluable, moreover, we include numerous enhancements such as automatic description generation, boosting the accuracy with pipeline ensembling, and visualization utilities available as a SpaCy extension.

    Authors: Gabriele Picco (IBM); Marcos Martínez Galindo (IBM); Alberto Purpura (IBM); Leopold Fuchs (IBM); Vanessa Lopez (IBM); Lam Thanh Hoang (IBM)

  • To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.

    Authors: Lining Zhang; Simon Mille; Yufang Hou (IBM); Daniel Deutsch; Elizabeth Clark; Yixin Liu; Saad Mahamood; Sebastian Gehrmann; Miruna Clinciu; Khyathi Chandu; João Sedoc

  • Pretraining has been shown to scale well with compute, data size and data diversity. Combining all, multitask mixture of supervised datasets produces improved performance compared to self-supervised pretraining. Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams.

    In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models and pretrained models keep improving each other. We show that ColD Fusion yields comparable benefits to multitask pretraining by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, training and testing on 35 datasets the ColD Fusion outperforms RoBERTa by 2.452.45 points in average without any changes to the architecture.

    Authors: Shachar Don-Yehiya (IBM); Elad Venezian (IBM); Colin Raffel; Noam Slonim (IBM); Yoav Katz (IBM); Leshem Choshen (IBM)

  • Recent work in natural language processing (NLP) has yielded appealing results from scaling; however, using only scale to improve performance means that resource consumption also scales. Resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

    Authors: Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro Martins, André Martins, Peter Milder, Colin Raffel, Jessica Forde, Emma Strubell, Edwin Simpson, Noam Slonim, Jesse Dodge, Iryna Gurevych, Niranjan Balasubramanian, Leon Derczynski and Roy Schwartz

  • Open Information Extraction (OpenIE) has been used in the pipelines of various NLP tasks. Unfortunately, there is no clear consensus on which models to use in which tasks. Muddying things further is the lack of comparisons that take differing training sets into account. In this paper, we present an application-focused empirical survey of neural OpenIE models, training sets, and benchmarks in an effort to help users choose the most suitable OpenIE systems for their applications. We find that the different assumptions made by different models and datasets have a statistically significant effect on performance, making it important to choose the most appropriate model for one's applications. We demonstrate the applicability of our recommendations on a downstream Complex QA application.

    Authors: Kevin Pei; Ishan Jindal (IBM); Kevin Chang; Zhai, Chengxiang; Yunyao Li

  • Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.

    Authors: Ella Neeman; Roee Aharoni; Or Honnovich; Leshem Choshen (IBM); Idan Szpektor; Omri Abend

  • We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licences at \url{}

    Authors: Arnav Mhaske; Harshit Kedia; Sumanth Doddapaneni; Mitesh M. Khapra; Pratyush Kumar; Rudra Murthy Venkataramana (IBM); Anoop Kunchukuttan

  • In the deployment of real-world text clas- sification models, label scarcity is a com- mon problem. As the number of classes increases, this problem becomes even more complex. One way to address this problem is by applying text augmentation methods.

    Authors: Adir Rahamim; Guy Uziel (IBM); Esther Goldbraich (IBM); Ateret Anaby-Tavor (IBM)

  • Text classification datasets from specialised or technical domains are in high demand, especially in industrial applications. However, due to the high cost of annotation such datasets are usually expensive to create. While Active Learning (AL) can reduce the labeling cost, required AL strategies are often only tested on general knowledge domains and tend to use information sources that are not consistent across tasks. We propose Reinforced Active Learning (RAL) to train a Reinforcement Learning policy that utilizes many different aspects of the data and the task in order to select the most informative unlabeled subset dynamically over the course of the AL procedure. We demonstrate the superior performance of the proposed RAL framework compared to strong AL baselines across four intricate multi-class, multi-label text classification datasets taken from specialised domains. In addition, we experiment with a unique data augmentation approach to further reduce the number of samples RAL needs to annotate.

    Authors: Lukas Wertz; Jasmina Bogojeska; Katya Mirylenka (IBM); Jonas Kuhn

  • We propose a method to control the attributes of large language models (LLMs) for the text generation task using Causal ATE scores and counterfactual augmentation. We explore this method in the context of LLM detoxification and propose the Causally Fair Language (CFL) architecture for detoxifying existing pre-trained LLMs in a plug-and-play manner. Our architecture is based on a Structural Causal Model (SCM) that achieves significantly faster training time than many existing detoxification techniques. Further, we achieve state of the art performance in several evaluation metrics using Real Toxicity Prompts. Our experiments show that CFL achieves such a detoxification without much impact on the model perplexity. Using the LM Loss over the BOLD dataset, we show that CFL mitigates the unintended bias of other detoxification techniques

    Authors: Rahul Madhavan; Rishabh Garg (IBM); Kahini Wadhawan (IBM); Sameep Mehta (IBM)

  • Over the past few years, zero-shot prompt-based learning has become a de facto standard in many NLP tasks where training data is unavailable. Particularly for sentiment analysis, much effort has been put into designing high-performing prompt templates. However, two problems exist; First, a large pre-trained language model is often biased to its training data, leading to poor performance in prompt templates the LM has rarely seen. This problem cannot be resolved by scaling. Second, when it comes to various domains, such as the financial and food domain, re-designing prompt templates by human experts for domain adaptation is required, which is time-consuming and inefficient. To remedy both shortcomings, we propose a simple yet strong data construction method to de-bias prompt templates, yielding a large improvement across different domains, pre-trained language models, and prompt templates. Also, we demonstrate the advantage of using our domain-agnostic data over in-domain ground-truth data.

    Authors: Yang Zhao (IBM); Tetsuya Nasukawa (IBM); Masayasu Muraoka (IBM); Bhatta Bhattacharjee (IBM)

Related Events