IBM Research Brazil Forum 2025
- Rio de Janeiro, Brazil
Neural Information Processing Systems (NeurIPS) is a leading machine learning and computational neuroscience conference. IBM Research is excited to sponsor NeurIPS again this year as a Platinum sponsor.
We invite all attendees to visit us during the event at booth number 1209, from Monday, Dec 11 through Thursday, Dec 14.
We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics such as foundation models, trustworthy AI, natural language processing and understanding, knowledge and reasoning, AI automation, human-centered AI, and federated learning.
Presentation times of conference workshops, demos, papers, and tutorials can be found see the agenda section at the bottom of this page. Note: All times are displayed in your local time.
IBM Booth Demo & Staff Schedule
_
Keep up with emerging research and scientific developments from IBM Research.
Subscribe to the Future Forward Newsletter.
Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2024 summer internships.
Featured positions to learn more about at NeurIPS:
Full Time Positions:
2024 Internships:
Sign up to be notified of future openings by joining our Talent Network.
Traditional data integration techniques often require complex coding and a deep understanding of data architectures, which can be daunting for non-specialists. In the evolving landscape of AI, there's a growing need for tools that democratize data access and analysis. We present FlowPilot, a novel system that departs from the current one-shot text-to-SQL paradigms that often fail to answer complex queries.
A key innovation in our work is the automated generation of the training/fine-tuning dataset by leveraging a dynamic set of inputs, including metadata from enterprise catalogs, database schemas, query logs, etc. The generated dataset is then used to fine-tune an LLM tailored for the customer that is able to understand the context of enterprise data by embedding its core knowledge with the relevant schemas, relationships and patterns.
Flowpilot ensures the mitigation of errors during both the training and inference phases by leveraging the uncertainty estimation for the query validity and alignment with the user intent, also by allowing the model to execute and refine statements in a sandbox environment.
A coordinator seamlessly integrates fine tuned text-to-SQL, text-to-Python, and text-to-chart models, delivering thorough answers to a spectrum of data-related questions.
FlowPilot's user-friendly interface comprises three synchronized, AI-powered interactive views: chat, flow, and data. This arrangement provides users with the flexibility to select their preferred mode of interaction with the system throughout their conversation with the databases.
FlowPilot offers an advanced approach to data integration, utilizing generative AI and a streamlined data pre-processing method. It introduces a novel conversational text-to-SQL feature, aiming to make data access simpler and provide reliable responses, thereby enhancing user interactions with enterprise databases.
Presenter(s): Enrico Toniato
The emergence of foundation models has significantly lowered the barriers to applying AI to everyday problems, transforming the way organizations consume, customize and build AI-enabled applications. We are also seeing the emergence of a new persona, the AI Builder, who is in need of dedicated tooling to harness the power of LLMs while mitigating its associated risks.
In this demonstration, we present the Big AI Models (BAM) Laboratory, an experimental platform designed to empower AI builders in the Generative AI space. Initially created over a year ago to address the unique challenge of hosting LLMs with 100B+ parameters, the BAM Laboratory has evolved to allow experimentation for thousands of internal AI builders and researchers throughout the AI application development lifecycle.
Some of its key current areas of incubation include, improving the model selection experience by recommending the right prompt for their use case, driving better alignment of models through tuning on human feedback, and creating AI guardrails to safeguard applications from LLM-related risks (such as Hate/Profanity/Abuse, Hallucination, Social Bias, etc).
Presenter(s): Maya Murad
Today enterprises of all sizes operate in very competitive market. To deliver on business expectations, IT environments continuously become more flexible and dynamic. There is contemporary microservices architecture that simplified the scope of software developers, but roles of IT Operations and System Reliability Engineers (SREs) have become even more complex. Today IT environment can generate millions of transactions a day and they can change every few seconds. The sheer scale and dynamic nature of these distributed hybrid environments is difficult to fully comprehend. The gap between IT complexity and the human ability to manage it is widening. This complexity threatens resiliency and reliability. One of the solutions to this problem that already adopted by many organizations is AIOps, employing Artificial Intelligence to assist IT Operations and SREs. In some cases, SREs analyze incoming events or symptoms before deciding on pursuing investigative actions. Operations or SREs perform problem determination, diagnosis, and resolution based on the symptoms’ information. In the interviews conducted with SREs, they have identified diagnosis as the most difficult task. Being able to troubleshoot a problem and to arrive to a diagnosis is often considered to be an innate skill [1]. There has been a great deal of effort spent on developing methodologies for specifying and reasoning about symptoms/signals provided through monitoring of systems, be they hardware or software. PyRCA and Merlion libraries, for example, have implementation of methods from recent research in metric-based anomaly detection and root cause analysis. These libraries might be quite helpful for researchers seeking to try these published algorithms. We however develop novel methods, that in our experiments demonstrated to be more powerful in each of these areas. Our probable cause identification is based on Causal Learning method, and for anomaly detection we are using a combination of unsupervised methods. We present a demo of the methods we developed, followed by detailed description and results of evaluation. Fault propagation depends on the causal relations in the application, i.e., code written by the developers. Learning these relations requires both static and dynamic analysis of the code; however, observability tools in Cloud do not have access to the code and even when the code is available, doing such analysis is difficult due to large heterogeneity in use of programming languages, runtimes, and third-party services. We isolate probable cause through identifying and modeling causal dependencies between components of hybrid applications, including computer environment architectures leveraging request-response paths which are available at runtime. Extracting all the unique paths and their latencies from the collected data, we identify uniquely anomalous path, or pinpoint to missing monitoring data that is missing. In case of missing monitoring data, passive data collection is not sufficient for diagnosis, and we recommend launching on-demand some probes. We formulate this problem as partially observable Markov Decision Process which aims to select minimum set for determining probable cause. We solve this using reinforcement learning (PPO). The approach of combining three unique elements described above is novel based on our understanding (patent pending). Our anomaly detection will be demonstrated in comparison to the methods in Merlion libraries. Given our targeted user for IT Operations and Observability are not data scientists, and reliable labeled data is extremely limited, we strongly favor unsupervised methods over supervised or semi-supervised methods. Using publicly available SMD dataset we’ll show that the combination of the methods we use could perform as well, and in some case outperform semi-supervised methods in the library.
Reference to be provided on demand
Presenter(s): Saurabh Jha
AI for IT Operations (AIOps) is a powerful platform for Site Reliability Engineers to automate and streamline operational workflows. Automated log analysis, a critical task in AIOps, provides key insights to identify and address faults. Logs can capture a variety of information on an application, giving a deeper view of potential issues and helping to diagnose an ongoing problem. Tasks like format detection, classification, parsing, anomaly detection, and summarization are the key components of automated log analysis. These tasks require supervised learning with massive labeled data; however, there are multiple challenges due to the limited labeled and diverse nature of log data. Large Language Models (LLMs) like BERT and GPT3 are trained using self-supervision on unlabeled data. These models provide generalized representations that can be effectively used for various downstream tasks with limited labeled data. This demo will showcase LLM for log data, BERTOps - a model for AIOps that uses the IBM Slate model as a base. Our experiments demonstrate that BERTOps, when fine-tuned using a limited amount of labeled data (few-shot setting) tailored to each specific AIOps downstream task, surpasses the performance of state-of-the-art transformer models. This underscores its significance as a cost-effective and valuable augmentation to the AIOps platform. We will also show a demo and an interactive user interface that provides a summarized view of the log data and the detected anomalous log windows to help diagnose a fault. The demo uses a framework incorporating the various fine-tuned models on BERTOps. We will also demonstrate why this framework is useful when domain experts are required for log diagnosis in a complex industrial application setting while significantly reducing manual effort and visual overload. The demo will highlight specific use cases and applications of the framework in IBM Software Support, IBM Automation and IBM Consulting.
Presenter(s): Ruchi Mahindru
The fast-increasing complexity of modern IT in multi cloud environments is bringing unprecedented management challenges to Site Reliability Engineers (SREs) to meet Service Level Objectives (SLOs) and keep systems up and running effectively. To put in perspective, an availability SLO of 99.99% allows for 4.3 minutes of downtime per month, hardly something that can be attained by simply reacting to incidents. In this demo, we introduce our approach to address this challenge by transforming ITOps from being reactive to becoming proactive by leveraging large language models and advanced AI capabilities. The main goal of our work is to automate as much as possible the implementation of resolutions for upcoming IT issues before they turn into outages. Our demo consists of three steps: (1) Issue Diagnosis, where we have developed language model based log data representation, built an AI system for probable cause identification using novel causal analysis and reinforcement learning, complemented with LLM-based summarization techniques easing consumption of diagnosis results by SREs and by downstream issue resolution analytics; (2) Action Recommendation, which leverages state-of-the-art generative AI techniques to produce actionable recommendations; (3) Automation, where action recommendation outputs are transformed into code that can be executed to resolve the incidents.
Presenter(s): Yu Deng
While Foundation Models (FM) have greatly transformed AI solutions for language and vision, they often fall short in addressing sensor and numerical time-series data, which is widely used in various industries. At IBM Research, our dedicated team focuses exclusively on advancing Time Series foundation models and has made significant contributions with influential papers presented at top AI conferences. Our team has been pioneers in this space where we defined the first inaugural architecture for several popular Time-series FM backbones, including the first transformer for multi-variate time-series representation learning (TST, KDD 21), the first patched time-series transformer (PatchTST, ICLR 23), the first patched MLP-Mixer for time series (TSMixer, KDD 23), and the first multimodal transfer learning for new product time-series forecasting (NPF, KDD 20). Our line of work not only attempts to improve State-of-the-art accuracies (SOTA), but also focuses on achieving it with extremely reduced memory and computing requirements. Our latest Models (PatchTST and TSMixer) are the leading SOTAs in this space with a significant reduction (2-3X) in compute and memory requirements. For effective mindshare and open collaboration, we have released our SOTA models through various open-source channels (500+ stars, 100+ forks, and several blogs written by popular linked/medium influencers). In fact - our SOTA Models like PatchTST are so popular that - within a few months of its open source, they got quickly incorporated into almost all the famous time-series libraries like GluonTS, NeuralForecast, and timeseriesAI(tsai). Our SOTA models (PatchTST and TSMixer) are currently in the process of integrating into the HuggingFace Transformer repository and will be available at the time of the demonstration. In this session, we would like to provide a demo of our SOTA models to a larger scientific community and also showcase interesting applications in diverse industrial settings across electricity, weather, traffic, retail, etc. Through illustrative notebooks and demos, we plan to discuss the best practices and the impact of various modeling approaches, design choices, and hyper-parameters that affect the performance across datasets and use cases from different industries. We will also provide insights on the various pretraining and finetuning workflow templates that we have standardized for various industrial settings to quickly get started. This demo session will be hands-on using our open source libraries and we will release the demo notebooks and associated artifacts for wider use.
Presenter(s): Nam Nguyen
Code translation aims to convert source code from one programming language (PL) to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are actively exploring their potential to automate code translation, i.e., generating code in target PL from its equivalent in another PL. The pre-requisite for advancing the state of LLM-based code translation is to understand their limitations. To that end, we present a large-scale empirical study to investigate the ability of LLMs, including general LLMs and code LLMs, for code translation across pairs of different languages, including C, C++, Go, Java, and Python. Our analysis involves the translation of 1,700 code samples from three distinct benchmarks and real-world projects, revealing LLMs are yet to be reliably used to automate code translation---with incorrect translations ranging from 52.7% to 97.9% across the studied LLMs. Further manual investigation of unsuccessful translations among all PLs identifies 14 root causes for translation bugs. Based on the insights from the empirical study, we propose a prompt-crafting approach to provide additional context for LLMs, improving the performance of LLM-based code translation by 5.5% on average across different PLs, LLMs, and benchmarks. Our study is the first of its kind, in terms of its scale and breadth, that provides insights into the current limitations of LLMs in code translation and opportunities for improving them. Our collected extensive dataset---consisting of 1,700 code samples written in five PLs with 10K+ tests, 43K+ translated code, 1,725 manually labeled bugs, and 1,365 bug-fix pairs generated using LLMs---can help drive research in this area.
Presenter(s): Rahul Krishna
Within enterprises, there is a growing need to intelligently navigate data lakes. Of particular importance to enterprises is the ability to find related tables in data repositories. These tables can be unionable, joinable, or subsets of each other. Example applications of this type of discovery include privacy enforcement and analytical queries that span multiple tables. There are now a number of pretrained models targeting the processing of tabular data, but none that target the data discovery use case in particular. There is also a dearth of benchmark tasks to help build the learning of data discovery tasks for neural tabular models. To help with neural tabular learning of data discovery, we developed a benchmark suite, LakeBench, for a diverse set of data discovery tasks based on government data from CKAN, Socrata, and the European Central Bank. Inspired by what has been shown to work well for data discovery tasks, we also used a novel approach based on data sketches to create a neural model TabSketchFM for data discovery. We contrast the data sketch based approach of TabSketchFM against row based approaches of other models and show that for data discovery tasks, data sketch based approaches are more effective. We examine which specific types of data sketches help which tasks with ablation studies. Finally we perform initial experiments to leverage models such as TabSketchFM in search, showing that they can re-rank and even improve top-k search results of the existing non-neural systems.
Presenter(s): Kavitha Srinivas & Julian Dolby
The emergence of foundation models has significantly lowered the barriers to applying AI to everyday problems, transforming the way organizations consume, customize and build AI-enabled applications. We are also seeing the emergence of a new persona, the AI Builder, who is in need of dedicated tooling to harness the power of LLMs while mitigating its associated risks.
In this demonstration, we present the Big AI Models (BAM) Laboratory, an experimental platform designed to empower AI builders in the Generative AI space. Initially created over a year ago to address the unique challenge of hosting LLMs with 100B+ parameters, the BAM Laboratory has evolved to allow experimentation for thousands of internal AI builders and researchers throughout the AI application development lifecycle.
Some of its key current areas of incubation include, improving the model selection experience by recommending the right prompt for their use case, driving better alignment of models through tuning on human feedback, and creating AI guardrails to safeguard applications from LLM-related risks (such as Hate/Profanity/Abuse, Hallucination, Social Bias, etc).
Presenter(s): Maya Murad
Traditional data integration techniques often require complex coding and a deep understanding of data architectures, which can be daunting for non-specialists. In the evolving landscape of AI, there's a growing need for tools that democratize data access and analysis. We present FlowPilot, a novel system that departs from the current one-shot text-to-SQL paradigms that often fail to answer complex queries.
A key innovation in our work is the automated generation of the training/fine-tuning dataset by leveraging a dynamic set of inputs, including metadata from enterprise catalogs, database schemas, query logs, etc. The generated dataset is then used to fine-tune an LLM tailored for the customer that is able to understand the context of enterprise data by embedding its core knowledge with the relevant schemas, relationships and patterns.
Flowpilot ensures the mitigation of errors during both the training and inference phases by leveraging the uncertainty estimation for the query validity and alignment with the user intent, also by allowing the model to execute and refine statements in a sandbox environment.
A coordinator seamlessly integrates fine tuned text-to-SQL, text-to-Python, and text-to-chart models, delivering thorough answers to a spectrum of data-related questions.
FlowPilot's user-friendly interface comprises three synchronized, AI-powered interactive views: chat, flow, and data. This arrangement provides users with the flexibility to select their preferred mode of interaction with the system throughout their conversation with the databases.
FlowPilot offers an advanced approach to data integration, utilizing generative AI and a streamlined data pre-processing method. It introduces a novel conversational text-to-SQL feature, aiming to make data access simpler and provide reliable responses, thereby enhancing user interactions with enterprise databases.
Presenter(s): Enrico Toniato
Today enterprises of all sizes operate in very competitive market. To deliver on business expectations, IT environments continuously become more flexible and dynamic. There is contemporary microservices architecture that simplified the scope of software developers, but roles of IT Operations and System Reliability Engineers (SREs) have become even more complex. Today IT environment can generate millions of transactions a day and they can change every few seconds. The sheer scale and dynamic nature of these distributed hybrid environments is difficult to fully comprehend. The gap between IT complexity and the human ability to manage it is widening. This complexity threatens resiliency and reliability. One of the solutions to this problem that already adopted by many organizations is AIOps, employing Artificial Intelligence to assist IT Operations and SREs. In some cases, SREs analyze incoming events or symptoms before deciding on pursuing investigative actions. Operations or SREs perform problem determination, diagnosis, and resolution based on the symptoms’ information. In the interviews conducted with SREs, they have identified diagnosis as the most difficult task. Being able to troubleshoot a problem and to arrive to a diagnosis is often considered to be an innate skill [1]. There has been a great deal of effort spent on developing methodologies for specifying and reasoning about symptoms/signals provided through monitoring of systems, be they hardware or software. PyRCA and Merlion libraries, for example, have implementation of methods from recent research in metric-based anomaly detection and root cause analysis. These libraries might be quite helpful for researchers seeking to try these published algorithms. We however develop novel methods, that in our experiments demonstrated to be more powerful in each of these areas. Our probable cause identification is based on Causal Learning method, and for anomaly detection we are using a combination of unsupervised methods. We present a demo of the methods we developed, followed by detailed description and results of evaluation. Fault propagation depends on the causal relations in the application, i.e., code written by the developers. Learning these relations requires both static and dynamic analysis of the code; however, observability tools in Cloud do not have access to the code and even when the code is available, doing such analysis is difficult due to large heterogeneity in use of programming languages, runtimes, and third-party services. We isolate probable cause through identifying and modeling causal dependencies between components of hybrid applications, including computer environment architectures leveraging request-response paths which are available at runtime. Extracting all the unique paths and their latencies from the collected data, we identify uniquely anomalous path, or pinpoint to missing monitoring data that is missing. In case of missing monitoring data, passive data collection is not sufficient for diagnosis, and we recommend launching on-demand some probes. We formulate this problem as partially observable Markov Decision Process which aims to select minimum set for determining probable cause. We solve this using reinforcement learning (PPO). The approach of combining three unique elements described above is novel based on our understanding (patent pending). Our anomaly detection will be demonstrated in comparison to the methods in Merlion libraries. Given our targeted user for IT Operations and Observability are not data scientists, and reliable labeled data is extremely limited, we strongly favor unsupervised methods over supervised or semi-supervised methods. Using publicly available SMD dataset we’ll show that the combination of the methods we use could perform as well, and in some case outperform semi-supervised methods in the library.
Reference to be provided on demand
Presenter(s): Saurabh Jha
AI for IT Operations (AIOps) is a powerful platform for Site Reliability Engineers to automate and streamline operational workflows. Automated log analysis, a critical task in AIOps, provides key insights to identify and address faults. Logs can capture a variety of information on an application, giving a deeper view of potential issues and helping to diagnose an ongoing problem. Tasks like format detection, classification, parsing, anomaly detection, and summarization are the key components of automated log analysis. These tasks require supervised learning with massive labeled data; however, there are multiple challenges due to the limited labeled and diverse nature of log data. Large Language Models (LLMs) like BERT and GPT3 are trained using self-supervision on unlabeled data. These models provide generalized representations that can be effectively used for various downstream tasks with limited labeled data. This demo will showcase LLM for log data, BERTOps - a model for AIOps that uses the IBM Slate model as a base. Our experiments demonstrate that BERTOps, when fine-tuned using a limited amount of labeled data (few-shot setting) tailored to each specific AIOps downstream task, surpasses the performance of state-of-the-art transformer models. This underscores its significance as a cost-effective and valuable augmentation to the AIOps platform. We will also show a demo and an interactive user interface that provides a summarized view of the log data and the detected anomalous log windows to help diagnose a fault. The demo uses a framework incorporating the various fine-tuned models on BERTOps. We will also demonstrate why this framework is useful when domain experts are required for log diagnosis in a complex industrial application setting while significantly reducing manual effort and visual overload. The demo will highlight specific use cases and applications of the framework in IBM Software Support, IBM Automation and IBM Consulting.
Presenter(s): Ruchi Mahindru
The fast-increasing complexity of modern IT in multi cloud environments is bringing unprecedented management challenges to Site Reliability Engineers (SREs) to meet Service Level Objectives (SLOs) and keep systems up and running effectively. To put in perspective, an availability SLO of 99.99% allows for 4.3 minutes of downtime per month, hardly something that can be attained by simply reacting to incidents. In this demo, we introduce our approach to address this challenge by transforming ITOps from being reactive to becoming proactive by leveraging large language models and advanced AI capabilities. The main goal of our work is to automate as much as possible the implementation of resolutions for upcoming IT issues before they turn into outages. Our demo consists of three steps: (1) Issue Diagnosis, where we have developed language model based log data representation, built an AI system for probable cause identification using novel causal analysis and reinforcement learning, complemented with LLM-based summarization techniques easing consumption of diagnosis results by SREs and by downstream issue resolution analytics; (2) Action Recommendation, which leverages state-of-the-art generative AI techniques to produce actionable recommendations; (3) Automation, where action recommendation outputs are transformed into code that can be executed to resolve the incidents.
Presenter(s): Yu Deng
While Foundation Models (FM) have greatly transformed AI solutions for language and vision, they often fall short in addressing sensor and numerical time-series data, which is widely used in various industries. At IBM Research, our dedicated team focuses exclusively on advancing Time Series foundation models and has made significant contributions with influential papers presented at top AI conferences. Our team has been pioneers in this space where we defined the first inaugural architecture for several popular Time-series FM backbones, including the first transformer for multi-variate time-series representation learning (TST, KDD 21), the first patched time-series transformer (PatchTST, ICLR 23), the first patched MLP-Mixer for time series (TSMixer, KDD 23), and the first multimodal transfer learning for new product time-series forecasting (NPF, KDD 20). Our line of work not only attempts to improve State-of-the-art accuracies (SOTA), but also focuses on achieving it with extremely reduced memory and computing requirements. Our latest Models (PatchTST and TSMixer) are the leading SOTAs in this space with a significant reduction (2-3X) in compute and memory requirements. For effective mindshare and open collaboration, we have released our SOTA models through various open-source channels (500+ stars, 100+ forks, and several blogs written by popular linked/medium influencers). In fact - our SOTA Models like PatchTST are so popular that - within a few months of its open source, they got quickly incorporated into almost all the famous time-series libraries like GluonTS, NeuralForecast, and timeseriesAI(tsai). Our SOTA models (PatchTST and TSMixer) are currently in the process of integrating into the HuggingFace Transformer repository and will be available at the time of the demonstration. In this session, we would like to provide a demo of our SOTA models to a larger scientific community and also showcase interesting applications in diverse industrial settings across electricity, weather, traffic, retail, etc. Through illustrative notebooks and demos, we plan to discuss the best practices and the impact of various modeling approaches, design choices, and hyper-parameters that affect the performance across datasets and use cases from different industries. We will also provide insights on the various pretraining and finetuning workflow templates that we have standardized for various industrial settings to quickly get started. This demo session will be hands-on using our open source libraries and we will release the demo notebooks and associated artifacts for wider use.
Presenter(s): Nam Nguyen
Visit us at booth 1209 in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Enterprise organizations have large amounts of data which is utilized by multiple Machine Learning (ML) models over various software frameworks. These models provide trends and insights from the data that can help enterprises define business rules around their processes. However, if certain aspects of this data are removed from the datasets, it could influence the business rules and policies in place. When a user requests data to be removed, the model retraining may be required called Machine Unlearning (MU). Recent research works in the area of MU include different methods of retraining the machine learning models. It turns out that there is lack of work in removing certain aspects of data, and quantifying its impact on the models.
This paper aspires to provide a novel methodology IDMU (Impact Driven Machine Unlearning) that performs quantification of the impact of data removal requests while performing MU. Our method provides recommendations for data removal requests, factoring in underlying features of data.
The results from the industrial application and evaluation of our method on a financial services dataset are encouraging. The overall IDMU had a mean MAPE of 10.25% over a set of 120 data removal requests. It also saved ~$1900 hours of model retraining time by factoring in urgency and impact of data removal data removal requests over a period of three years.
Authors: Shubhi Asthana (IBM); Bing Zhang (IBM); Ruchi Mahindru (IBM); Indervir Singh Banipal (IBM); Pawan Chowdhary (IBM)
Visit us at booth 1209 in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Visit us at booth 1209 in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
In response to recent data regulation requirements, machine unlearning (MU) has emerged as a critical process to remove the influence of specific examples from a given model. Although exact unlearning can be achieved through complete model retraining using the remaining dataset, the associated computational costs have driven the development of efficient, approximate unlearning techniques. Moving beyond data-centric MU approaches, our study introduces a novel model-based perspective: model sparsification via weight pruning, which is capable of reducing the gap between exact unlearning and approximate unlearning. We show in both theory and practice that model sparsity can boost the multi-criteria unlearning performance of an approximate unlearner, closing the approximation gap, while continuing to be efficient. This leads to a new MU paradigm, termed prune first, then unlearn, which infuses a sparse prior to the unlearning process. Building on this insight, we also develop a sparsity-aware unlearning method that utilizes sparsity regularization to enhance the training process of approximate unlearning. Extensive experiments show that our proposals consistently benefit MU in various unlearning scenarios. A notable highlight is the 77% unlearning efficacy gain of fine-tuning (one of the simplest approximate unlearning methods) when using our proposed sparsity-aware unlearning method. Furthermore, we showcase the practical impact of our proposed MU methods through two specific use cases: defending against backdoor attacks, and enhancing transfer learning through source class removal. These applications demonstrate the versatility and effectiveness of our approaches in addressing a variety of machine learning challenges beyond unlearning for data privacy.
Authors: Jinghan Jia; Jiancheng Liu; Parikshit Ram (IBM); Yuguang Yao; Gaowen Liu; Yang Liu; Pranay Sharma; Sijia Liu
This paper provides a theoretical understanding of deep Q-Network (DQN) with the epsilon-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Q-network. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with epsilon-greedy policy. We prove an iterative procedure with decaying converges to the optimal Q-value function geometrically. Moreover, a higher level of values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of epsilon-values. Experiments justify our established theoretical insights on DQNs.
Authors: Pin-Yu Chen, Keerthiram Murugesan, Miao Liu, Songtao Lu, Subhajit Chaudhury
Sequence memory is an essential attribute of natural and artificial intelligence that enables agents to encode, store, and retrieve complex sequences of stimuli and actions. Computational models of sequence memory have been proposed where recurrent Hopfield-like neural networks are trained with temporally asymmetric Hebbian rules. However, these networks suffer from limited sequence capacity (maximal length of the stored sequence) due to interference between the memories. Inspired by recent work on Dense Associative Memories, we expand the sequence capacity of these models by introducing a nonlinear interaction term, enhancing separation between the patterns. We derive novel scaling laws for sequence capacity with respect to network size, significantly outperforming existing scaling laws for models based on traditional Hopfield networks, and verify these theoretical results with numerical simulation. Moreover, we introduce a generalized pseudoinverse rule to recall sequences of highly correlated patterns. Finally, we extend this model to store sequences with variable timing between states' transitions and describe a biologically-plausible implementation, with connections to motor neuroscience.
Authors: Hamza Tahir Chaudhry; Jacob Zavatone-veth; Dmitry Krotov (IBM); Cengiz Pehlevan
There has been an increasing interest in using symbolic models along with reinforcement learning (RL) problems, where these coarser abstract models are used as a way to provide RL agents with higher level guidance. However, most of these works are inherently limited by their assumption of having an access to a symbolic approximation of the underlying problem. To address this issue, we introduce a new method for learning optimistic symbolic approximations of the underlying world model. We will see how these representations, coupled with fast diverse planners developed by the automated planning community, provide us with a new paradigm for optimistic exploration in sparse reward settings. We investigate the possibility of speeding up the learning process by generalizing learned model dynamics across similar actions with minimal human input. Finally, we evaluate the method, by testing it on multiple benchmark domains and compare it with other RL strategies.
Authors: Sarath Sreedharan; Michael Katz (IBM)
Self-supervised pre-training methods on proteins have recently gained attention, with most approaches focusing on either protein sequences or structures, neglecting the exploration of their joint distribution, which is crucial for a comprehensive understanding of protein functions by integrating co-evolutionary information and structural characteristics. In this work, inspired by the success of denoising diffusion models in generative tasks, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure joint diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the joint diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. Code will be released upon acceptance.
Authors: Aurelie Lozano, Payel Das
Stochastic gradient descent (SGD) algorithm is the method of choice in many machine learning tasks thanks to its scalability and efficiency in dealing with large-scale problems. In this paper, we focus on the shuffling version of SGD which matches the mainstream practical heuristics. We show the convergence to a global solution of shuffling SGD for a class of non-convex functions under over-parameterized settings. Our analysis employs more relaxed non-convex assumptions than previous literature. Nevertheless, we maintain the desired computational complexity as shuffling SGD has achieved in the general convex setting.
Authors: Lam Nguyen (IBM); Trang H. Tran
Diffusion models have risen a promising approach to data-driven planning, and have demonstrated impressive robotic control, reinforcement learning, and video planning performance. Given an effective planner, an important question to consider is replanning -- when given plans should be regenerated due to both action execution error and external environment changes. Direct plan execution, without replanning, is problematic as errors from individual actions rapidly accumulate and environments are partially observable and stochastic. Simultaneously, replanning at each timestep incurs a substantial computational cost, and may prevent successful task execution, as different generated plans prevent consistent progress to any particular goal. In this paper, we explore how we may effectively replan with diffusion models. We propose a principled approach to determine when to replan, based on the diffusion model's estimated likelihood of existing generated plans. We further present an approach to replan existing trajectories to ensure that new plans follow the same goal state as the original trajectory, which may efficiently bootstrap off previously generated plans. We illustrate how a combination of our proposed additions significantly improves the performance of diffusion planners leading to 38% gains over past diffusion planning approaches on Maze2D, and further enables the handling of stochastic and long-horizon robotic control tasks.
Authors: Shun Zhang, Yikang Shen, Chuang Gan
Missing values in real-world data pose a significant and unique challenge to algorithmic fairness. Different demographic groups may be unequally affected by missing data, and the standard procedure for handling missing values where first data is imputed, then the imputed data is used for classification—a procedure referred to as "impute-then-classify"—can exacerbate discrimination. In this paper, we analyze how missing values affect algorithmic fairness. We first prove that training a classifier from imputed data can significantly worsen the achievable values of group fairness and average accuracy. This is because imputing data results in the loss of the missing pattern of the data, which often conveys information about the predictive label. We present scalable and adaptive algorithms for fair classification with missing values. These algorithms can be combined with any preexisting fairness-intervention algorithm to handle all possible missing patterns while preserving information encoded within the missing patterns. Numerical experiments with state-of-the-art fairness interventions demonstrate that our adaptive algorithms consistently achieve higher fairness and accuracy than impute-then-classify across different datasets.
Authors: Raymond Feng; Flavio Calmon; Hao Wang (IBM)
With the popularity of automatic code generation tools, such as Copilot, the study of the potential hazards of these tools is gaining importance. In this work, we explore the social bias problem in pre-trained code generation models. We propose a new paradigm to construct code prompts and successfully uncover social biases in code generation models. To quantify the severity of social biases in generated code, we develop a dataset along with three metrics to evaluate the overall social bias and fine-grained unfairness across different demographics. Experimental results on three pre-trained code generation models (Codex, InCoder, and CodeGen) with varying sizes, reveal severe social biases. Moreover, we conduct analysis to provide useful insights for further choice of code generation models with low social bias.
Authors: Yan Liu; Xiaokang Chen; Yan Gao; Zhe Su; Fengji Zhang; Daoguang Zan; Jian-guang Lou; Pin-Yu Chen (IBM); Tsung-yi Ho
Although pairwise causal relations have been extensively studied in observational longitudinal analyses across many disciplines, incorporating knowledge of causal pairs into deep learning models for temporal event sequences remains largely unexplored. In this paper, we propose a novel approach for enhancing the performance of transformer-based models in multivariate event sequences by injecting pairwise qualitative causal knowledge such as `event Z amplifies future occurrences of event Y'. We establish a new framework for causal inference in temporal event sequences using a transformer architecture, providing a theoretical justification for our approach, and show how to obtain unbiased estimates of the proposed measure. Experimental results demonstrate that our approach outperforms several state-of-the-art models in terms of prediction accuracy by effectively leveraging knowledge about causal pairs. We also consider a unique application where we extract knowledge around sequences of societal events by generating them from a large language model, and demonstrate how a causal knowledge graph can help with event prediction in such sequences. Overall, our framework offers a practical means of improving the performance of transformer-based models in multivariate event sequences by explicitly exploiting pairwise causal information.
Authors: Jannis Born
Deep neural networks have been increasingly used in real-world applications, making it critical to ensure their ability to adapt to new, unseen data. In this paper, we study the generalization capability of neural networks trained with (stochastic) gradient descent. We establish a new connection between the loss dynamics of gradient flow and general kernel machines using a unique kernel, the loss path kernel. This kernel measures the similarity between two data points by evaluating the agreement between loss gradients along the path determined by the gradient flow. Based on this, we derive a new generalization upper bound that applies to general neural network architectures. This new bound is tight and strongly correlated with the true generalization error. We apply our results to guide the design of neural architecture search (NAS) and demonstrate the favorable performance compared with state-of-the-art NAS algorithms through numerical experiments.
Authors: Yilan Chen; Wei Huang; Hao Wang (IBM); Charlotte Loh; Akash Srivastava (IBM); Lam Nguyen (IBM); Lily Weng
Vision and Language (VL) models offer an effective method for aligning repre- sentation spaces of images and text allowing for numerous applications such as cross-modal retrieval, visual and multi-hop question answering, captioning, and many more. However, the aligned image-text spaces learned by all the popular VLs are still suffering from the so-called ’object bias’ - their representations behave as ’bags of nouns’ mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing this issue were proposed in the recent literature, the problem is still far from being solved. In this paper, we make two interesting observations enabling boosting the VL models’ understanding of the non-noun language concepts by a considerable amount (of up to ∼ 30%). These two factors are: (i) the caption quality, or in other words ’image-alignment’, of the texts in the finetuning (or pre-training) paired VL dataset; and (ii) the ’density’ of the captions in the sense of mentioning all the details appearing on the image.
Authors: Sivan Doveh (IBM); Assaf Arbelle (IBM); Amit Alfassy (IBM); Sivan Harary (IBM); Paola Cascante-bonilla (IBM); Roi Herzig (IBM); Eliyahu Schwartz (IBM); Donghyun Kim (IBM); Rameswar Panda (IBM); Rogerio Feris (IBM); Raja Giryes; Shimon Ullman; Leonid Karlinsky (IBM)
Visit us at booth 1209 in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Nature evolves creatures with a high complexity of morphological and behavioral intelligence, meanwhile computational methods lag in approaching that diversity and efficacy. Co-optimization of artificial creatures' morphology and control shows promise for applications in physical soft robotics and virtual character creation; such approaches, however, require developing new learning algorithms that can reason about function atop pure structure. In this paper, we present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks. DiffuseBot bridges the gap between virtually generated content and physical utility by (i) augmenting the diffusion process with a physical dynamical simulation which provides a certificate of performance, and (ii) introducing a co-design procedure that jointly optimizes physical design and control by leveraging information about physical sensitivities from differentiable simulation. We showcase a range of simulated and fabricated robots along with their capabilities.
Authors: Tsun-hsuan Wang; Juntian Zheng (IBM); Pingchuan Ma; Yilun Du; Byungchul Kim; Andrew Everett Spielberg; Joshua B. Tenenbaum; Chuang Gan (IBM); Daniela Rus
Traditional federated learning (FL) algorithms operate under the assumption that the data distributions at training (source domains) and testing (target domain) are the same. The fact that domain shifts often occur in practice necessitates equipping FL methods with a domain generalization (DG) capability. However, existing DG algorithms face fundamental challenges in FL setups due to the lack of samples/domains in each client’s local dataset. In this paper, we propose StableFDG, a style and attention based learning strategy for accomplishing federated domain generalization, introducing two key contributions. The first is style-based learning, which enables each client to explore novel styles beyond the original source domains in its local dataset, improving domain diversity based on the proposed style sharing, shifting, and exploration strategies. Our second contribution is an attention-based feature highlighter, which captures the similarities between the features of data samples in the same class, and emphasizes the important/common characteristics to better learn the domain-invariant characteristics of each class in data-poor FL scenarios. Experimental results show that StableFDG outperforms existing baselines on various DG benchmark datasets, demonstrating its efficacy.
Authors: Jungwuk Park; Dong-jun Han; Jinho Kim; Shiqiang Wang (IBM); Christopher Brinton; Jaekyun Moon
Our work combines aspects of three promising paradigms in machine learning, namely, attention mechanism, energy-based models, and associative memory. Attention is the power-house driving modern deep learning successes, but it lacks clear theoretical foundations. Energy-based models allow a principled approach to discriminative and generative tasks, but the design of the energy functional is not straightforward. At the same time, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, and allow an intuitive design of the energy function. We propose a novel architecture, called the Energy transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. In this work, we introduce the theoretical foundations of ET, explore its empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection and graph classification tasks.
Authors: Benjamin Hoover (IBM); Yuchen Liang; Bao Pham; Rameswar Panda (IBM); Hendrik Strobelt (IBM); Polo Chau; Mohammed Zaki; Dmitry Krotov (IBM)
The use of Deep Generative Models has revolutionized the fields of vision and language, paving the way for a new era of multimodal generative applications. In light of these achievements, researchers have begun exploring the use of generative models in scientific and engineering applications, with the goal of accelerating the design process and reducing the need for computationally intensive iterative optimization techniques. Despite these efforts, generative models still lag behind classic optimization methods based on physical principles in constrained environments where data is scarce and precision is paramount. While recent advancements in Generative Models conditioned on physical fields and guided by performance have shown promise, their efficacy is largely dependent on costly finite element methods, labeled datasets required to train surrogate models and slow sampling processes. In order to overcome these challenges, we introduce a new approach called Diffusion Optimization Models (DOM). DOM utilizes the hierarchical sampling structure inherent in diffusion models to efficiently generate a wide range of high-quality designs in just a few steps. By utilizing inexpensive kernel conditioning and trajectory alignment, DOM guides the sampling trajectory toward the optimization path, grounding it in the physical process. Moreover, we employ a few steps of explicit optimization to refine the generated candidates, ensuring that they meet the necessary performance requirements and manufacturability standards. We thoroughly evaluate our framework using topology optimization, a fundamental problem in mechanical design, on both in- and out-of-distribution configurations. Our results demonstrate that trajectory alignment significantly improves engineering performance and efficiency without incurring any additional cost at inference time. Thanks to our hybrid formulation, we are able to generate high-quality designs in a mere few steps and steer them toward regions of high performance and manufacturability without the need for external surrogate models or costly labeled data. This approach paves the way for the widespread application of DOM in data-driven design at large scales.
Authors: Akash Srivastava
Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.
Authors: Amirhossein Kazemnejad; Inkit Padhi (IBM); Karthikeyan Natesan Ramamurthy (IBM); Payel Das (IBM); Siva Reddy
Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.
Authors: Pin-Yu Chen
Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.
Authors: Xiaomeng Hu; Pin-Yu Chen (IBM); Tsung-yi Ho
Functional constrained optimization (FCO) has emerged as a powerful tool for solving various machine learning problems. However, with the rapid increase in applications of neural networks in recent years, it has become apparent that both the objective and constraints often involve nonconvex functions, which poses significant challenges in obtaining high-quality solutions. In this work, we focus on a class of structured nonconvex FCO problems, where the two optimization variables are nonlinearly coupled in the inequality constraint. Leveraging the primal-dual optimization framework, we propose a smoothed first-order Lagrangian method (SLM) for solving this class of problems. We establish the theoretical convergence rates of SLM to the Karush-Kuhn-Tucker (KKT) solutions through quantifying dual error bounds. By establishing connections between this structured FCO and equilibrium-constrained nonconvex problems (also known as bilevel optimization), we apply the proposed SLM to tackle bilevel optimization problems where the lower-level problem is nonconvex. Numerical results obtained from both toy examples and hyper-data cleaning problems demonstrate the superiority of SLM compared to benchmark methods.
Authors: Songtao Lu
Graph Neural Networks (GNNs) have emerged as a powerful tool for semi?2 supervised node classification tasks. However, recent studies have revealed various biases in GNNs stemming from both node features and graph topology. In this work, we uncover a new bias - label position bias, which indicates that the node closer to the labeled nodes tends to perform better. We introduce a new metric, the Label Proximity Score, to quantify this bias, and find that it is closely related to performance disparities. To address the label position bias, we propose a novel optimization framework for learning a label position unbiased graph structure, which can be applied to existing GNNs. Extensive experiments demonstrate that our proposed method not only outperforms backbone methods but also significantly mitigates the issue of label position bias in GNNs.
Authors: Charu Aggarwal
Cookies are designed to enable more accurate identification and tracking of user behavior, in turn allowing for more personalized ads and better performing ad campaigns. Given the additional information that is recorded, questions related to privacy and fairness naturally arise. How does a user's consent decision influence how much the system can learn about their demographic and tastes? Is the impact of a user's consent decision on the recommender system's ability to learn about their latent attributes uniform across demographics? We investigate these questions in the context of an engagement-driven recommender system using simulation. We empirically demonstrate that when consent rates exhibit demographic-dependence, user consent has a disparate impact on the recommender agent's ability to estimate users' latent attributes. In particular, we find that when consent rates are cohort-dependent, a user disagreeing to share their cookie may counter-intuitively cause the recommender agent to know more about the user than if the user agreed to share their cookie. Furthermore, the gap in base consent rates across demographics serves as an amplifier: users from the higher consent rate demographic who disagree to cookie sharing experience higher estimation errors than the same users from the lower consent rate demographic, and conversely for users who choose to agree to cookie sharing. Attesting to the informational richness of user responses, these effects diminish completely as the recommender system is trained on more user responses, independently of the users' consent decisions. We discuss the need for new notions of fairness that encourage consistency of a user's privacy decision and the agent's resulting estimation accuracy of their latent attributes.
Authors: Elizabeth Daly, Rahul Nair, Robert Redmond, Erik Miehling, Karthikeyan Natesan Ramamurthy
Locally interpretable model agnostic explanations (LIME) method is one of the most popular methods used to explain black-box models at a per example level. Although many variants have been proposed few provide a simple way to produce high fidelity explanations that are also stable and intuitive in the neighborhood of the example. In this work, we provide a novel perspective by proposing a model agnostic local explanation method inspired by the invariant risk minimization (IRM) principle -- originally proposed for (global) out-of-distribution generalization -- to provide high fidelity explanations that are robust across neighborhoods and for near by examples. Our method is based on a game theoretic formulation where we theoretically show that our approach has a strong tendency to eliminate features where the gradient of the black-box function abruptly changes sign in the locality of the example we want to explain, while in other cases it is more careful and will choose a more conservative (feature) attribution, a behavior which can be highly desirable for recourse. Empirically, we show on tabular, image and text data that the quality of our explanations with neighborhoods formed using random perturbations are much better than LIME and in some cases even comparable to other methods that use realistic neighbors sampled from the data manifold, where the latter is a popular strategy to obtain high quality explanations. This is a desirable property given that learning a manifold to either create realistic neighbors or to project explanations is typically expensive or may even be impossible. Moreover, our algorithm is simple and efficient to train, and can ascertain stable input features for local decisions of a black-box without access to side information such as a (partial) causal graph as has been seen in some recent works.
Authors: Amit Dhurandhar (IBM); Karthikeyan Natesan Ramamurthy (IBM); Kartik Ahuja; Vijay Arya (IBM)
In this work we make progress in understanding the relationship between learning models when given access to entangled measurements, separable measurements and statistical measurements in the quantum statistical query () model. To this end, we show the following~results.
\textbf{Entanglement versus separable measurements.} The goal here is to learn an unknown from the concept class given copies of . We show that, if copies suffice to learn using entangled measurements, then copies suffice to learn using just separable measurements. %Additionally, we exhibit a concept class for which, in order to learn some \emph{property} of , the sample complexity of learning using entangled measurements is exponentially smaller than separable measurements.
\textbf{Entangled versus statistical measurements} The goal here is to learn a function given access to separable measurements and statistical measurements. We exhibit a concept class based of degree- functions that gives an exponential separation between learning and quantum learning with entangled measurements (even in the presence of noise). This proves the ``quantum analogue" of the seminal result of Blum et al.\cite{blum2003noise} that separates classical learning from classical learning with classificationnoise.
\textbf{ lower bounds for learning states.} The main technical contribution is to introduce a quantum statistical query dimension (), which we use to give lower bounds on the complexity of learning. Using this, we prove exponential lower bounds for testing purity of quantum states, learning CCHL states, coset states of Abelian groups, degree- functions, planted bi-clique states and learning output states of Clifford circuits of depth .
\textbf{Further applications.} Using our lower bounds give an \emph{unconditional} separation between weak and strong error mitigation and prove lower bounds for learning distributions in the model. Prior works by Quek et al.\cite{quek2022exponentially}, Hinsche et al.\cite{hinsche2022single} and Neitner et al.\cite{sweke23} proved the analogous results \emph{assuming} diagonal measurements and our work removes thisassumption.
Authors: Vojtěch Havlíček, Srinivasan Arunachalam, Louis Schatzki
Nowadays integrated circuits (ICs) are underpinning all major information technology innovations including the current trends of artificial intelligence (AI). Modern IC designs often involve analyses of complex phenomena (such as timing, noise, and power etc.) for tens of billions of electronic components, like resistance (R), capacitance (C), transistors and gates, interconnected in various complex structures. Those analyses often need to strike a balance between accuracy and speed as those analyses need to be carried out many times throughout the entire IC design cycles. With the advancement of AI, researchers also start to explore new ways in leveraging AI to improve those analyses. This paper focuses on one of the most important analyses, timing analysis for interconnects. Since IC interconnects can be represented as an RC-tree, a specialized graph as tree, we design a novel tree-based graph neural network, SyncTREE, to speed up the timing analysis by incorporating both the structural and physical properties of electronic circuits. Our major innovations include (1) a two-pass message-passing (bottom-up and top-down) for graph embedding, (2) a tree contrastive loss to guide learning, and (3) a closed formula-based approach to conduct fast timing. Our experiments show that, compared to conventional GNN models, SyncTREE achieves the best timing prediction in terms of both delays and slews, all in reference to the industry golden numerical analyses results on real IC design data.
Authors: Yuting Hu; Jiajie Li; Florian Klemme; Gi-Joon Nam (IBM); Tengfei Ma; Hussam Amrouch; Jinjun Xiong
People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
Authors: Hussein Mozannar; Jimin Lee; Dennis Wei (IBM); Prasanna Sattigeri (IBM); Subhro Das (IBM); David Sontag
Efficient transfer learning algorithms are key to the success of foundation models on diverse downstream tasks even with limited data. Recent works of Basu et al. (2023) and Kaba et al. (2022) propose group averaging (equitune) and optimization-based methods, respectively, over features from group-transformed inputs to obtain equivariant outputs from non-equivariant neural networks. While Kaba et al. (2022) are only concerned with training from scratch, we find that equitune performs poorly on equivariant zero-shot tasks despite good finetuning results. We hypothesize that this is because pretrained models provide better quality features for certain transformations than others and simply averaging them is deleterious. Hence, we propose λ-equitune that averages the features using importance weights, λs. These weights are learned directly from the data using a small neural network, leading to excellent zero-shot and finetuned results that outperform equitune. Further, we prove that λ-equitune is equivariant and a universal approximator of equivariant functions. Additionally, we show that the method of Kaba et al. (2022) used with appropriate loss functions, which we call equizero, also gives excellent zero-shot and finetuned performance. Both equitune and equizero are special cases of λ- equitune. To show the simplicity and generality of our method, we validate on a wide range of diverse applications and models such as 1) image classification using CLIP, 2) deep Q-learning, 3) fairness in natural language generation (NLG), 4) compositional generalization in languages, and 5) image classification using pretrained CNNs such as Resnet and Alexnet.
Authors: Sourya Basu; Pulkit Katdare; Prasanna Sattigeri (IBM); Vijil Vijil (IBM); Katherine Driggs-campbell; Payel Das (IBM); Lav Varshney
Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: \emph{aleatoric discrimination}, which is inherent in the data distribution, and \emph{epistemic discrimination}, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy when fairness constraints are applied and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.
Authors: Hao Wang (IBM); Luxi He; Rui Gao; Flavio Calmon
Offline reinforcement learning (RL) enables learning a decision-making policy without interaction with the environment. This makes it particularly beneficial in situations where such interactions are costly. However, a known challenge for offline RL algorithms is the distributional mismatch between the state-action distributions of the learned policy and the dataset, which can significantly impact performance. State-of-the-art algorithms address it by constraining the policy to align with the state-action pairs in the dataset. However, this strategy struggles on datasets that predominantly consist of trajectories collected by low-performing policies and only a few trajectories from high-performing ones. Indeed, the constraint to align with the data leads the policy to imitate low-performing behaviors predominating the dataset. In this paper, we propose an importance weighted sampling method that allows the learned policy to only align with the high-performing decisions in the dataset, unlocking offline RL algorithms' potential on such imbalanced datasets.
Authors: Akash Srivastava, Abhishek Bhandwaldar
The accurate predictions and principled uncertainty measures provided by GP regression incur cost which is prohibitive for modern-day large-scale applications. This has motivated extensive work on computationally efficient approximations. We introduce a new perspective by exploring robustness properties and limiting behaviour of GP nearest-neighbour (GPnn) prediction. We demonstrate through theory and simulation that as the data-size increases, accuracy of estimated parameters and GP model assumptions become increasingly irrelevant to GPnn predictive accuracy. Consequently, it is sufficient to spend small amounts of work on parameter estimation in order to achieve high MSE accuracy, even in the presence of gross misspecification. In contrast, as , uncertainty calibration and NLL are shown to remain sensitive to just one parameter, the additive noise-variance; but we show that this source of inaccuracy can be corrected for, thereby achieving both well-calibrated uncertainty measures and accurate predictions at remarkably low computational cost. We exhibit a very simple GPnn regression algorithm with stand-out performance compared to other state-of-the-art GP approximations as measured on large UCI datasets. It operates at a small fraction of those other methods' training costs, for example on a basic laptop taking about 30 seconds to train on a dataset of size .
Authors: Edward Pyzer-Knapp
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.
Authors: Yining Hong; Haoyu Zhen; Peihao Chen; Shuhong Zheng; Yilun Du; Zhenfang Chen (IBM); Chuang Gan (IBM)
Quantifying the dependence between high-dimensional random variables is central to statistical learning and inference. Two classical methods are canonical correlation analysis (CCA), which identifies maximally correlated projected versions of the original variables, and Shannon's mutual information, which is a universal dependence measure that also captures high-order dependencies. However, CCA only accounts for linear dependence, which may be insufficient for certain applications, while mutual information is often infeasible to compute/estimate in high dimensions. This work proposes a middle ground in the form a scalable information-theoretic generalization of CCA, termed max-sliced mutual information (mSMI). mSMI equals the maximal mutual information between low-dimensional projections of the high-dimensional variables, which reduces back to CCA in the Gaussian case. It enjoys the best of both worlds: capturing intricate dependencies in the data while being amenable to fast computation and scalable estimation from samples. We show that mSMI retains favorable structural properties of Shannon's mutual information, like variational forms and identification of independence. We then study statistical estimation of mSMI, propose an efficiently computable neural estimator, and couple it with formal non-asymptotic error bounds. We present experiments that demonstrate the utility of mSMI for several synthetic and real-world tasks, encompassing independence testing, multi-view representation learning, and algorithmic fairness. We observe that mSMI consistently outperforms competing methods with little-to-no computational overhead.
Authors: Dor Tsur; Ziv Goldfeld; Kristjan Greenewald (IBM)
Visit us at booth 1209 in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Contrastive learning is quickly becoming an essential tool in neuroscience for extracting robust and meaningful representations of neural activity. Despite numerous applications to neuronal population data, there has been little exploration of how these methods can be adapted to key primary data analysis tasks such as spike sorting or cell-type classification. In this work, we propose a novel contrastive learning framework for high-density extracellular recordings. We demonstrate that through careful design of the network architecture and data augmentations, it is possible to generically extract representations for the aforementioned tasks that far outperform current specialized approaches. We validate our method with applications to both real and simulated high-density extracellular recordings.
Authors: Akash Srivastava
With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves ≈ 2 – 4× speedup at an accuracy delta within [+0.68,−3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2–4 inputs at once while maintaining a high average accuracy within a [−1.07,−3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.
Authors:
Combining gradient-based trajectory optimization with differentiable physics simulation is an efficient technique for solving soft-body manipulation problems.Using a well-crafted optimization objective, the solver can quickly converge onto a valid trajectory.However, writing the appropriate objective functions requires expert knowledge, making it difficult to collect a large set of naturalistic problems from non-expert users.We introduce DiffVL, a method that enables non-expert users to communicate soft-body manipulation tasks -- a combination of vision and natural language, given in multiple stages -- that can be readily leveraged by a differential physics solver. We have developed GUI tools that enable non-expert users to specify 100 tasks inspired by real-life soft-body manipulations from online videos, which we'll make public.We leverage large language models to translate task descriptions into machine-interpretable optimization objectives. The optimization objectives can help differentiable physics solvers to solve these long-horizon multistage tasks that are challenging for previous baselines.
Authors: Chuang Gan
Multi-agent reinforcement learning (MARL) has primarily focused on solving a single task in isolation, while in practice the environment is often evolving, leaving many related tasks to be solved. In this paper, we investigate the benefits of meta-learning in solving multiple MARL tasks collectively. We establish the first line of theoretical results for meta-learning in a wide range of fundamental MARL settings, including learning Nash equilibria in two-player zero-sum Markov games and Markov potential games, as well as learning coarse correlated equilibria in general-sum Markov games. Under natural notions of task similarity, we show that meta-learning achieves provable sharper convergence to various game-theoretical solution concepts than learning each task separately. As an important building block, we develop multiple MARL algorithms with initialization-dependent convergence guarantees. Such algorithms integrate optimistic policy mirror descents with stage-based value updates, and their refined convergence guarantees (nearly) match the best existing results even when a good initialization is unknown. To our best knowledge, such results are also new and might be of independent interest. We further provide numerical simulations to corroborate our theoretical findings.
Authors: Weichao Mao; Haoran Qiu; Chen Wang (IBM); Hubertus Franke (IBM); Zbigniew T. Kalbarczyk; Ravishankar K. Iyer; Tamer Basar
Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
Authors: Hao Wang (IBM); Shivchander Sudalairaj (IBM); John Henning (IBM); Kristjan Greenewald (IBM); Akash Srivastava (IBM)
Transformer models have recently gained popularity in graph representation learning as they have the potential to learn complex relationships beyond the ones captured by regular graph neural networks. The main research question is how to inject the structural bias of graphs into the transformer architecture, and several proposals have been made for undirected molecular graphs and, recently, also for larger network graphs. In this paper, we study transformers over directed acyclic graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An attention mechanism that is considerably more efficient than the regular quadratic complexity of transformers and at the same time faithfully captures the DAG structure, and (2) a positional encoding of the DAG's partial order, complementing the former. We rigorously evaluate our approach over various types of tasks, ranging from classifying source code graphs to nodes in citation networks, and show that it is effective in two important aspects: in making graph transformers generally outperform graph neural networks tailored to DAGs and in improving SOTA graph transformer performance in terms of both quality and efficiency.
Authors: Yuankai Luo; Veronika Thost (IBM); Lei Shi
Causal disentanglement aims to uncover a representation of data using latent variables that are interrelated through a causal model. Such a representation is identifiable if the latent model that explains the data is unique. In this paper, we focus on the scenario where unpaired observational and interventional data are available, with each intervention changing the mechanism of a latent variable. When the causal variables are fully observed, statistically consistent algorithms have been developed to identify the causal model under faithfulness assumptions. We here show that identifiability can still be achieved with unobserved causal variables, given a generalized notion of faithfulness. Our results guarantee that we can recover the latent causal model up to an equivalence class and predict the effect of unseen combinations of interventions, in the limit of infinite data. We implement our causal disentanglement framework by developing an autoencoding variational Bayes algorithm and apply it to the problem of predicting combinatorial perturbation effects in genomics.
Authors: Jiaqi Zhang; Kristjan Greenewald (IBM); Chandler Squires; Akash Srivastava (IBM); Karthikeyan Shanmugam (IBM); Caroline Uhler
With the advent of deep learning, progressively larger neural networks have been designed to solve complex tasks. We take advantage of these capacity-rich models to lower the cost of inference by exploiting computation in superposition. To reduce the computational burden per input, we propose Multiple-Input-Multiple-Output Neural Networks (MIMONets) capable of handling many inputs at once. MIMONets augment various deep neural network architectures with variable binding mechanisms to represent an arbitrary number of inputs in a compositional data structure via fixed-width distributed representations. Accordingly, MIMONets adapt nonlinear neural transformations to process the data structure holistically, leading to a speedup nearly proportional to the number of superposed input items in the data structure. After processing in superposition, an unbinding mechanism recovers each transformed input of interest. MIMONets also provide a dynamic trade-off between accuracy and throughput by an instantaneous on-demand switching between a set of accuracy-throughput operating points, yet within a single set of fixed parameters. We apply the concept of MIMONets to both CNN and Transformer architectures resulting in MIMOConv and MIMOFormer, respectively. Empirical evaluations show that MIMOConv achieves ≈ 2 – 4× speedup at an accuracy delta within [+0.68,−3.18]% compared to WideResNet CNNs on CIFAR10 and CIFAR100. Similarly, MIMOFormer can handle 2–4 inputs at once while maintaining a high average accuracy within a [−1.07,−3.43]% delta on the long range arena benchmark. Finally, we provide mathematical bounds on the interference between superposition channels in MIMOFormer. Our code is available at https://github.com/IBM/multiple-input-multiple-output-nets.
Authors: Nicolas Menet; Michael Hersche (IBM); Kumudu Geethan Karunaratne (IBM); Luca Benini; Abu Sebastian (IBM); Abbas Rahimi (IBM)
Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zero-shot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine-tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to ( on average) in the label-free setting. Moreover, despite our approach being label-free, we observe average gains over leading few-shot prompting baselines that do use 5-shot supervision.
Authors: Jehanzeb Mirza; Leonid Karlinsky (IBM); Wei Lin; Horst Possegger; Mateusz Kozinski; Rogerio Feris (IBM); Horst Bischof
Diffusion models have risen a promising approach to data-driven planning, and have demonstrated impressive robotic control, reinforcement learning, and video planning performance. Given an effective planner, an important question to consider is replanning -- when given plans should be regenerated due to both action execution error and external environment changes. Direct plan execution, without replanning, is problematic as errors from individual actions rapidly accumulate and environments are partially observable and stochastic. Simultaneously, replanning at each timestep incurs a substantial computational cost, and may prevent successful task execution, as different generated plans prevent consistent progress to any particular goal. In this paper, we explore how we may effectively replan with diffusion models. We propose a principled approach to determine when to replan, based on the diffusion model's estimated likelihood of existing generated plans. We further present an approach to replan existing trajectories to ensure that new plans follow the same goal state as the original trajectory, which may efficiently bootstrap off previously generated plans. We illustrate how a combination of our proposed additions significantly improves the performance of diffusion planners leading to 38% gains over past diffusion planning approaches on Maze2D, and further enables the handling of stochastic and long-horizon robotic control tasks.
Authors: Siyuan Zhou; Yilun Du; Shun Zhang (IBM); Mengdi Xu; Yikang Shen (IBM); Wei Xiao; Dit-yan Yeung; Chuang Gan (IBM)
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.
Authors: Zhiqing Sun; Yikang Shen (IBM); Qinhong Zhou; Hongxin Zhang; Zhenfang Chen (IBM); David Cox (IBM); Yiming Yang; Chuang Gan (IBM)
With the widespread digitization of finance and the increasing popularity of cryptocurrencies, the sophistication of fraud schemes devised by cybercriminals is growing. Money laundering -- the movement of illicit funds to conceal their origins -- can cross bank and national boundaries, producing complex transaction patterns. The UN estimates 2-5\% of global GDP or \2.0 trillion dollars are laundered globally each year. Unfortunately, real data to train machine learning models to detect laundering is generally not available, and previous synthetic data generators have had significant shortcomings. A realistic, standardized, publicly-available benchmark is needed for comparing models and for the advancement of the area.
To this end, this paper contributes a synthetic financial transaction dataset generator and a set of synthetically generated AML (Anti-Money Laundering) datasets. We have worked to calibrate this agent-based generator to match real transactions as closely as possible and we made the datasets public.
We describe the generator in detail and we demonstrate how the data can be used to measure the performance of Graph Neural Networks in detecting money laundering. In a key way, these measurements are even better than using real data: the ground truth labels are complete, whilst many laundering transactions in real data are never detected.
Authors: Jovan Blanusa, Kubilay Atasu, Erik Altman
Diffusion Models (DMs) are state-of-the-art generative models that learn a reversible corruption process from iterative noise addition and denoising. They are the backbone of many generative AI applications, such as text-to-image conditional generation. However, recent studies have shown that basic unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a type of output manipulation attack triggered by a maliciously embedded pattern at model input. This paper presents a unified backdoor attack framework (VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our framework covers mainstream unconditional and conditional DMs (denoising-based and score-based) and various training-free samplers for holistic evaluations. Experiments show that our unified framework facilitates the backdoor analysis of different DM configurations and provides new insights into caption-based backdoor attacks on DMs.
Authors: Sheng-yen Cho; Pin-Yu Chen (IBM); Tsung-yi Ho
Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to alleviate these problems by blurring faces, downsampling videos, or training on synthetic data. On the other hand, analysis on the {\em transferability} of privacy-preserving pre-trained models to downstream tasks has been limited. In this work, we study this problem by first asking the question: can we pre-train models for human action recognition with data that does not include real humans? To this end, we present, for the first time, a benchmark that leverages real-world videos with {\em humans removed} and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Furthermore, we propose a novel pre-training strategy, called Privacy-Preserving MAE-Align, to effectively combine synthetic data and human-removed real data. Our approach outperforms previous baselines by up to 5% and closes the performance gap between human and no-human action recognition representations on downstream tasks, for both linear probing and fine-tuning. Our benchmark, code, and models are available at https://github.com/howardzh01/PPMA
Authors: Howard Zhong; Samarth Mishra (IBM); Donghyun Kim (IBM); Souyoung Jin; Rameswar Panda (IBM); Hildegard Kuehne (IBM); Leonid Karlinsky (IBM); Venkatesh Saligrama; Aude Oliva; Rogerio Feris (IBM)
Self-supervised learning (SSL) has great potential for molecular representation learning given the complexity of molecular graphs, the large amounts of unlabelled data available, the considerable cost of obtaining labels experimentally, and the hence often only small training datasets. The importance of the topic is reflected in the variety of paradigms and architectures that have been investigated recently. Yet the differences in performance seem often minor and are barely understood to date. In this paper, we study SSL based on persistent homology (PH), a mathematical tool for modeling topological features of data that persist across multiple scales. It has several unique features which particularly suit SSL, naturally offering: different views of the data, stability in terms of distance preservation, and the opportunity to flexibily incorporate domain knowledge. We propose (1) an autoencoder, which shows the general representational power of PH, and (2) a contrastive-learning-based loss, which flexibly can be applied on top of existing SSL approaches. We rigorously evaluate our approach for molecular property prediction and demonstrate its particular features: after SSL, the representations are better and offer considerably more predictive power than the baselines over different probing tasks; our loss increases baseline performance, sometimes largely; and we obtain consistent substantial improvements over very small datasets, a common scenario in practice.
Authors: Yuankai Luo; Lei Shi; Veronika Thost (IBM)
Traditional multi-armed bandit (MAB) frameworks, predominantly examined under stochastic or adversarial settings, often overlook the temporal dynamics inherent in many real-world applications such as recommendation systems and online advertising. This paper introduces a novel non-stationary MAB framework that captures the temporal structure of these real-world dynamics through an auto-regressive (AR) reward structure. We propose an algorithm that integrates two key mechanisms: (i) an alternation mechanism adept at leveraging temporal dependencies to dynamically balance exploration and exploitation, and (ii) a restarting mechanism designed to discard out-of-date information. Our algorithm achieves a regret upper bound that nearly matches the lower bound, with regret measured against a robust dynamic benchmark. Finally, via a real-world case study on tourism demand prediction, we demonstrate both the efficacy of our algorithm and the broader applicability of our techniques to more complex, rapidly evolving time series.
Authors: Djallel Bouneffouf
General physical scene understanding requires more than simply localizing and recognizing objects -- it requires knowledge that objects can have different latent properties (e.g., mass or elasticity), and that those properties affect the outcome of physical events. While there has been great progress in physical and video prediction models in recent years, benchmarks to test their performance typically do not require an understanding that objects have individual physical properties, or at best test only those properties that are directly observable (e.g., size or color). This work proposes a novel dataset and benchmark, termed Physion++, that rigorously evaluates visual physical prediction in artificial systems under circumstances where those predictions rely on accurate estimates of the latent physical properties of objects in the scene. Specifically, we test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability, and where the values of those properties can only be inferred by observing how objects move and interact with other objects or fluids. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions. We find that models that have been trained using standard regimes and datasets do not spontaneously learn to make inferences about latent properties, but also that models that encode objectness and physical states tend to make better predictions. However, there is still a huge gap between all models and human performance, and all models' predictions correlate poorly with those made by humans, suggesting that no state-of-the-art model is learning to make physical predictions in a human-like way.
Authors: Zhenfang Chen, Chuang Gan
Transfer learning -- i.e., further fine-tuning a pre-trained model on a downstream task -- can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter's values across models. To address this, we propose our method, \method{} (\methodshort{}), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that \methodshort{} outperforms existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of signs, and show that estimating the signs using the validation data could further improve performance.
Authors: Leshem Choshen, Leshem Choshen
Bilevel optimization has recently regained interest owing to its applications in emerging machine learning fields such as hyperparameter optimization, meta-learning, and reinforcement learning. Recent results have shown that simple alternating (implicit) gradient-based algorithms can achieve the same convergence rate of single-level gradient descent (GD) for bilevel problems with a strongly convex lower-level objective. However, it remains unclear whether this result can be generalized to bilevel problems beyond this basic setting. In this paper, we propose a \textsf{G}eneralized \textsf{AL}ternating m\textsf{E}thod for bilevel op\textsf{T}imization (\textsf{GALET}) with a nonconvex lower-level objective that satisfies the Polyak-Łojasiewicz (PL) condition. We first introduce a stationary metric for the considered bilevel problems, which generalizes the existing metric. We then establish that GALET achieves an -stationary metric for the considered problem within iterations, which matches the iteration complexity of GD for smooth nonconvex problems.
Authors: Songtao Lu
Credal networks extend Bayesian networks to allow for imprecision in probability values. Marginal MAP is a widely applicable mixed inference task that identifies the most likely assignment for a subset of variables (called MAP variables). However, the task is extremely difficult to solve in credal networks particularly because the evaluation of each complete MAP assignment involves exact likelihood computations (combinatorial sums) over the vertices of a complex joint credal set representing the space of all possible marginal distributions of the MAP variables. In this paper, we explore Credal Marginal MAP inference and develop new exact methods based on variable elimination and depth-first search as well as several approximation schemes based on the mini-bucket partitioning and stochastic local search. An extensive empirical evaluation demonstrates the effectiveness of our new methods on random as well as real-world benchmark problems.
Authors: Radu Marinescu (IBM); Debarun Bhattacharjya (IBM); Junkyu Lee (IBM); Fabio Cozman; Alexander Gray (IBM)
Humans outperform object recognizers despite the fact that models perform well on current datasets. Numerous attempts have been made to create more challenging datasets by scaling them up from the web, exploring distribution shift, or adding controls for biases. The difficulty of each image in each dataset is not independently evaluated, nor is the concept of dataset difficulty as a whole well-posed. We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object. Images whose objects can be recognized in 17ms are considered to be easier than those which require seconds of viewing time. Using 133,588 judgments on two major datasets, ImageNet and ObjectNet, we determine the distribution of image difficulties in those datasets, which we find varies wildly, but significantly undersamples hard images. Rather than hoping that distribution shift or other approaches will lead to hard datasets, we should measure the difficulty of datasets and seek to explicitly fill out the class of difficult examples. Analyzing model performance guided by image difficulty reveals that models tend to have lower performance and a larger generalization gap on harder images. Encouragingly for the biological validity of current architectures, much of the variance in human difficulty can be accounted for given an object recognizer by computing a combination of prediction depth, c-score, and adversarial robustness. We release a dataset of such judgments as a complementary metric to raw performance and a network's ability to explain neural recordings. Such experiments with humans allow us to create a metric for progress in object recognition datasets, which we find are skewed toward easy examples, to test the biological validity of models in a novel way, and to develop tools for shaping datasets as they are being gathered to focus them on filling out the missing class of hard examples from today's datasets.
Authors: Dan Gutfreund
Authors:
Transfer learning – i.e., further fine-tuning a pre-trained model on a downstream task – can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model merging techniques have emerged as a solution to combine multiple task-specific models into a single multitask model without performing additional training. However, existing merging methods often ignore the interference between parameters of different models, resulting in large performance drops when merging multiple models. In this paper, we demonstrate that prior merging techniques inadvertently lose valuable information due to two major sources of interference: (a) interference due to redundant parameter values and (b) disagreement on the sign of a given parameter’s values across models. To address this, we propose our method, TrIm, Elect Sign & Merge (TIES-Merging), which introduces three novel steps when merging models: (1) resetting parameters that only changed a small amount during fine-tuning, (2) resolving sign conflicts, and (3) merging only the parameters that are in alignment with the final agreed-upon sign. We find that TIES-Merging outperforms existing methods in diverse settings covering a range of modalities, domains, number of tasks, model sizes, architectures, and fine-tuning settings. We further analyze the impact of different types of interference on model parameters, highlight the importance of signs, and show that estimating the signs using the validation data could further improve performance.
Authors: Prateek Yadav; Derek Tam; Leshem Choshen (IBM); Colin Raffel; Mohit Bansal
This workshop will discuss the latest multidisciplinary developments in Associative Memory and Hopfield Networks. A number of leading researchers in this research area from around the world have already agreed to attend and present their latest results. We anticipate sharing their presentations and outlining future research directions in this emerging field with the rest of the NeurIPS community.
Authors: Parikshit Ram (IBM); Hilde Kuehne; Daniel Lee; Cengiz Pehlevan; Mohammed Zaki; Lenka Zdeborova
Diffusion Models (DMs) have recently set state-of-the-art on many generation benchmarks. However, there are myriad ways to describe them mathematically, which makes it difficult to develop a simple understanding of how they work. In this submission, we provide a concise overview of DMs from the perspective of dynamical systems and Ordinary Differential Equations (ODEs) which exposes a mathematical connection to the highly related yet often overlooked class of energy-based models, called Associative Memories (AMs). Energy-based AMs are a theoretical framework that behave much like denoising DMs, but they enable us to directly compute a Lyapunov energy function on which we can perform gradient descent to denoise data.
Authors:
The application of machine learning models in chemistry has made remarkable strides in recent years. Even though there is considerable interest in automating common procedure in analytical chemistry using machine learning, very few models have been adopted into everyday use. Among the analytical instruments available to chemists, Nuclear Magnetic Resonance (NMR) spectroscopy is one of the most important, offering insights into molecular structure unobtainable with other methods. However, most processing and analysis of NMR spectra is still performed manually, making the task tedious and time consuming especially for large quantities of spectra. We present a transformer-based machine learning model capable of predicting the molecular structure directly from the NMR spectrum. Our model is pretrained on synthetic NMR spectra, achieving a top–1 accuracy of 67.0% when predicting the structure from both the 1H and 13C spectrum. Additionally, we train a model which, given a spectrum and a set of likely compounds, selects the structure corresponding to the spectrum. This model achieves a top–1 accuracy of 98.28% when trained on both 1H and 13C spectra in selecting the correct structure.
Authors:
The significance of Nuclear Magnetic Resonance (NMR) spectroscopy in organic synthesis cannot be overstated, as it plays a pivotal role in deducing chemical structures from experimental data. While machine learning has predominantly been employed for predictive purposes in the analysis of spectral data, our study introduces a novel application of a transformer-based model's attention weights to unravel the underlying "language" that correlates spectral peaks with their corresponding atom in the chemical structures. This attention mapping technique proves beneficial for comprehending spectra, enabling the reliable differentiation between product H-NMR spectra and reactant spectra extracted from experimental data with an accuracy exceeding 95%. Furthermore, it consistently associates peaks with the correct atoms in the molecule, achieving a remarkable peak-to-atom match rate of 71% for exact match and 89% of close shift matching (+- 0.59ppm). This framework exemplifies the capability of harnessing the attention mechanism within transformer models to unveil the intricacies of spectroscopic data. Importantly, this approach can readily be extended to other types of spectra, showcasing its versatility and potential for broader applications in the field.
Authors:
It has been observed that the global minimum of neural networks is connected by curves on which train and test loss is almost constant. This phenomenon, often referred to as mode connectivity, has inspired various applications such as model ensembling and fine-tuning. However, despite empirical evidence, a theoretical explanation is still lacking. We explore the connectedness of minimum through a new approach, parameter space symmetry. By relating topology of symmetry groups to topology of minima, we provide the number of connected components of full-rank linear networks. In particular, we show that skip connections reduce the number of connected components. We then prove mode connectivity up to permutation for linear networks. We also provide explicit expressions for connecting curves in minimum induced by symmetry.
Authors:
Discovering new materials is essential to solve challenges in climate change, sustainability and healthcare. A typical task in materials discovery is to search for a material in a database which maximises the value of a function. That function is typically expensive to evaluate, and often relies upon a simulation or an experiment. Here, we introduce SyMDis, a sample efficient optimisation method based on symbolic learning, that discovers near-optimal materials in a large database. SyMDis performs comparably to a state-of-the-art optimiser, whilst learning interpretable rules to aid physical and chemical verification. Furthermore, the rules learned by SyMDis generalise to unseen datasets and return high performing candidates in a zero-shot evaluation, which is difficult to achieve with other approaches.
Authors:
Pre-trained deep learning representations have been successful in a wide range of predictive and generative tasks across different domains and input modalities. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. Our goal is to characterize the internal representation of pre-trained models to better inform data efficiency and sampling, robustness, and interoperability. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). While our method is domain-agnostic, we assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet), across multiple predictive chemical language models (MoLFormer, ChemBERTa), and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection. Our results show a significant presence of disentanglement in the learned space in terms of molecular structure and properties. Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the most-common elements in the embedding, only property-driven elements are shared between the two tasks, while of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.
Authors: Girmaw Abebe Tadesse, Celia Cintas, Jannis Born, Skyler Speakman, Payel Das, Jerret Ross, Brian Belgodere, Enara Vijil
Property prediction plays an important role in material discovery. As an initial step to eventually develop a foundation model for material science, we introduce a new autoencoder called the MHG-GNN, which combines graph neural network (GNN) with Molecular Hypergraph Grammar (MHG). Results on a variety of property prediction tasks with diverse materials show that MHG-GNN is promising.
Authors:
Large language models (LLMs) are increasingly used to support humans in tasks involving writing natural language and programming. How do we evaluate the benefits of LLM assistance for the human and learn from human interaction? We argue that benchmarks that evaluate the abilities of the model in isolation are not sufficient to reveal its impact on humans. Ideally, we can conduct user studies where humans complete tasks with the LLM and measure outcomes of interest. However, this can be prohibitively expensive in terms of human resources especially as we want to continuously iterate on model design. We propose building a simulation environment that mimics how humans interact with the LLM focusing in this work on assistants that provide inline suggestions for coding tasks. The environment simulates the set of multi-turn interactions that occur in programming with LLMs and uses a secondary LLM to simulate the human. We design the environment based on work that studies programmer behavior when coding with LLMs to make sure it is realistic. The environment allows us to evaluate the abilities of different scales of LLMs in terms of simulation metrics of success. The simulation also allows us to collect data that can be potentially used to improve the LLMs ability to assist humans which we showcase with a simple intervention.
Authors:
Energy minimization problems are highly non-convex problems at the heart of physical sciences. These problems often suffer from slow convergence due to sharply falling potentials, leading to small gradients. To make them tractable, we often resort to coarse-graining (CG), a type of lossy compression. We introduce a new way to perform CG using reparametrization, which does not require the costly steps of force-matching and back-mapping required in traditional CG. We focus on improving the slow dynamics by using CG to projecting onto slow modes. We also propose a way to find robust slow modes for many physical potentials. Our method also does not require data, which is expensive in molecular systems and a bottleneck for applying machine learning methods to such systems. We test our method on molecular dynamics for folding of small proteins. We observe that our method either reaches deeper (more optimal) energies or runs in shorter time than the baseline non-CG simulations.
Authors:
In this paper, we propose an accelerated stochastic step search algorithm which combines an accelerated method with a fully adaptive step size parameter for convex problems in (Scheinberg et. al., 2014) with stochastic step search analysis in (Paquette and Scheinberg, 2020). Under appropriate conditions on the accuracy of the estimates of gradient and function value our algorithm achieves expected iteration complexity of to reach an -accurate solution which satisfies . This complexity matches with the iteration complexity of deterministic Nesterov's accelerated and FISTA algorithms (Nesterov, 1983, Beck and Teboulle, 2009). This paper continues the line of work on stochastic adaptive algorithms studied in (Berahas et. al., 2021, Blanchet et. al., 2019, Paquette and Scheinberg, 2020) and is the first one to develop an accelerated gradient descent type algorithm in this domain.
Authors: Trang H. Tran; Lam Nguyen (IBM); Katya Scheinberg
Inverse protein folding, the process of designing sequences that fold into a specific 3D structure, is crucial in bio-engineering and drug discovery. Traditional methods rely on experimentally resolved structures, but these cover only a small fraction of protein sequences. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. However, these models are too slow for integration into the optimization loop of inverse folding models during training. To address this, we propose using knowledge distillation on folding model confidence metrics, such as pTM or pLDDT scores, to create a faster and end-to-end differentiable distilled model. This model can then be used as a structure consistency regularizer in training the inverse folding model. Our technique is versatile and can be applied to other design tasks, such as sequence- based protein infilling. Experimental results show that our method outperforms non-regularized baselines, yielding up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity while maintaining structural consistency in generated sequences.
Authors: Igor Melnyk (IBM); Aurelie Lozano (IBM); Payel Das (IBM); Vijil Vijil (IBM)
We propose a multi-modal foundation model for small molecules, a shift from traditional AI models that are tailored for individual tasks and modalities. This model uses a late fusion strategy to align and fuse three distinct modalities: SELFIES, DFT properties, and optical spectrum. The model is pre-trained with over 6 billion samples to provide two primary functions, generating fused feature representations across the three modalities, and cross-modal predictions and genrations. As preliminary experiments, we demonstrate that the fused representation successfully improves the performance of property predictions for chromophore molecules, and showcase 6 distinct cross-modal inferences.
Authors:
Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the \emph{Fusion of Experts (FoE)} problem of fusing outputs of expert models with \emph{complementary} knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We further extend our method to the ``frugal'' setting where it is desired to reduce the number of expert model evaluations at test time.
Authors: Hongyi Wang; Felipe Maia Polo; Yuekai Sun; Souvik Kundu; Eric Xing; Mikhail Yurochkin (IBM)
Traditional machine learning models focus on achieving good performance on the overall training distribution, but they often underperform on minority groups. Existing methods can improve the worst-group performance, but they can have several limitations: (i) they require group annotations, which are often expensive and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most related works fail to solve these two issues simultaneously as they focus on conflicting perspectives of minority groups and outliers. We address the problem of learning group annotations in the presence of outliers by clustering the data in the space of gradients of the model parameters. We show that data in the gradient space has a simpler structure while preserving information about minority groups and outliers, making it suitable for standard clustering methods like DBSCAN. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art both in terms of downstream worst-group performance.
Authors: Yuchen Zeng; Kristjan Greenewald (IBM); Luann Jung; Kangwook Lee; Justin Solomon; Mikhail Yurochkin (IBM)
There is a rapidly growing number of open-source Large Language Models (LLMs) and benchmark datasets to compare them. While some models dominate these benchmarks, there is typically no single model that achieves the best accuracy in all tasks and use cases. In this work, we address the challenge of selecting the best LLM for new tasks out of a collection of models. We propose a new formulation for the problem, in which benchmark datasets are repurposed to learn a ``router'' model for this LLM selection and show that this problem can be reduced to a collection of binary classification tasks. We demonstrate the utility and limitations of learning model routers from various benchmark datasets, where we consistently improve performance upon using any single model for all tasks.
Authors: Tal Shnitzer; Anthony Ou; Mirian Silva (IBM); Kate Soule (IBM); Yuekai Sun; Justin Solomon; Neil Thompson; Mikhail Yurochkin (IBM)
Growing applications of large language models (LLMs) trained by a third party raise serious concerns on the security vulnerability of LLMs. It has been demonstrated that malicious actors can covertly exploit these vulnerabilities in LLMs through poisoning attacks aimed at generating undesirable outputs. While poisoning attacks have received significant attention in the image domain (e.g., object detection), and classification tasks, their implications for generative models, particularly in the realm of natural language generation (NLG) tasks, remain poorly understood. To bridge this gap, we perform a comprehensive exploration of various poisoning techniques to assess their effectiveness across a range of generative tasks. Furthermore, we introduce a range of metrics designed to quantify the success and stealthiness of poisoning attacks specifically tailored to NLG tasks. Through extensive experiments on multiple NLG tasks, LLMs and datasets, we show that it is possible to successfully poison an LLM during the fine-tuning stage using as little as 1% of the total tuning data samples. Our paper presents the first systematic approach to comprehend poisoning attacks targeting NLG tasks considering a wide range of triggers and attack settings. We hope our findings will assist the AI security community in devising appropriate defenses against such threats.
Authors: Shuli Jiang (IBM); Swanand Ravindra Kadhe (IBM); Yi Zhou (IBM); Ling Cai (IBM); Nathalie Baracaldo Angel (IBM)
Recent progress in large transformers-based foundation models have demonstrated impressive capabilities in mastering complex chemical language representations. These models show promise in learning task-agnostic chemical language representations through a two-step process: pre-training on extensive unlabeled corpora and fine-tuning on specific downstream tasks. By utilizing self-supervised learning capabilities, foundation models have significantly reduced the reliance on labeled data and task-specific features, streamlining data acquisition and pushing the boundaries of chemical language representation. However, their practical implementation in further downstream tasks is still in its early stages and largely limited to sequencing problems. The proposed multimodal approach using MoLFormer, a chemical large language model, aims to demonstrate the capabilities of transformer based models to non-sequencing applications such as capturing design space of liquid formulations. Multimodal MoLFormer utilizes the extensive chemical information learned in pre-training from unlabeled corpora for predicting performance of battery electrolytes and showcases superior performance compared to state-of-the-art algorithms. The potential of foundation models in designing mixed material systems such as liquid formulations presents a groundbreaking opportunity to accelerate the discovery and optimization of new materials and formulations across various industries.
Authors: Eduardo Almeida Soares (IBM); Vidushi Sharma (IBM); Emilio Ashton Vital Brazil (IBM); Renato Fontoura de Gusmao Cerqueira (IBM); Young-Hye Na (IBM)
Artificial intelligence holds promise to improve materials discovery. GFlowNets are an emerging deep learning algorithm with many applications in AI-assisted discovery. By using GFlowNets, we generate porous reticular materials, such as metal organic frameworks and covalent organic frameworks, for applications in carbon dioxide capture. We introduce a new Python package (matgfn) to train and sample GFlowNets. We use matgfn to generate the matgfn-rm dataset of novel and diverse reticular materials with gravimetric surface area above 5000 . We calculate single- and two-component gas adsorption isotherms for the top-100 candidates in matgfn-rm. These candidates are novel compared to the state-of-art ARC-MOF dataset and rank in the 90th percentile in terms of working capacity compared to the CoRE2019 dataset. We discover 15 materials outperforming all materials in CoRE2019.
Authors: Flaviu Cipcigan (IBM); Jonathan Booth; Rodrigo Neumann Barros Ferreira (IBM); Carine Dos Santos (IBM); Mathias Steiner (IBM)
We present a novel multimodal language model approach for predicting molecular properties by combining chemical language representation with physicochemical features. Our approach, MultiModal-MoLFormer, utilizes a causal multi-stage feature selection method that identifies physicochemical features based on their direct causal effect on a specific target property. These causal features are then integrated with the vector space generated by molecular embeddings from MoLFormer. In particular, we employ Mordred descriptors as physicochemical features and identify the Markov blanket of the target property, which theoretically contains the most relevant features for accurate prediction. Our results demonstrate a superior performance of our proposed approach compared to existing state-of-the-art algorithms, including the chemical language-based MoLFormer and graph neural networks, in predicting complex tasks such as biodegradability and PFAS toxicity estimation. Moreover, we demonstrate the effectiveness of our feature selection method in reducing the dimensionality of the Mordred feature space while maintaining or improving the model's performance. Our approach opens up promising avenues for future research in molecular property prediction by harnessing the synergistic potential of both chemical language and physicochemical features, leading to enhanced performance and advancements in the field.
Authors: Eduardo Almeida Soares (IBM); Emilio Ashton Vital Brazil (IBM); Karen Fiorella Aquino Gutierrez (IBM); Renato Fontoura de Gusmao Cerqueira (IBM); Dan Sanders (IBM); Kristin Schmidt (IBM); Dmitry Zubarev (IBM)
Deep Reinforcement Learning (DRL) has shown breakthroughs in solving challenging problems, such as pixel-based games and continuous control tasks. In complex environments, infusing prior domain knowledge is essential to achieve sample efficiency and generalization. Neuro-symbolic AI seeks systematic domain knowledge infusion into neural network-based learning, and existing neuro-symbolic approaches for sequential decision-making leverage hierarchical reinforcement learning (HRL) by infusing symbolically specified prior knowledge on desired trajectories. However, this requires finding symbolic solutions in RL environments before learning, and it is difficult to handle the divergence between unknown RL dynamics and prior knowledge. Such shortcomings result in loose and manual neuro-symbolic integration and degrade the generalization capability. In this paper, we integrate the options framework in HRL with an AI planning model to resolve the shortcomings in earlier approaches and generalize beyond RL environments where pre-specified partial solutions are valid. Our approach defines options from AI planning operators by establishing the connection between the two transition systems in the options framework and the AI planning task. Then, we show an option policy learning method that integrates an AI planner and model-free DRL algorithms with intrinsic rewards, encouraging consistency between the two transition systems. We design a suite of MiniGrid environments that cover the increasing levels of difficulties in exploration, where our empirical evaluation clearly shows the advantage of HRL with AI planning models.
Authors: Junkyu Lee (IBM); Michael Katz (IBM); Don Joven Ravoy Agravante (IBM); Miao Liu (IBM); Geraud Nangue Tasse; Tim Klinger (IBM); Shirin Sohrabi (IBM)
Classical learning theory focuses on supervised learning of functions via empirical risk minimization where labeled examples for a particular task are represented by the data distribution experienced by the model during training. Recently, in-context learning emerged as a paradigm shift in large pre-trained models. When conditioned with few labeled examples of potentially unseen tasks in the training, the model infers the task at hand and makes predictions on new points. Learning to learn in-context on the other hand, aims at training models in a meta-learning setup that generalize to new unseen tasks from only few shots of labeled examples. We present in this paper a statistical learning framework for the problem of in-context meta learning and define a function class that enables it. The meta-learner is abstracted as a function defined on the cross product of the probability space (representing ``context'') and the data space. The data distribution is sampled from a ``meta distribution'' on tasks. Thanks to the regularity we assume on the function class in the Wasserstein geometry, we leverage tools from optimal transport in order to study the generalization of the meta learner to unseen tasks. Finally, we show that encoder transformers exhibit this type of regularity and leverage our theory to analyze their generalization properties.
Authors: Youssef Mroueh
Eye gaze has proven to be a cost-efficient way to collect large-scale physiological data that can reveal the underlying human attentional patterns in real-life workflows, and thus has long been explored as a signal to directly measure human-related cognition in various domains. Physiological data (including but not limited to eye gaze) offer new perception capabilities, which could be used in several ML domains, e.g., egocentric perception, embodied AI, NLP, etc. They can help infer human perception, intentions, beliefs, goals, and other cognition properties that are much needed for human-AI interactions and agent coordination. In addition, large collections of eye-tracking data have enabled data-driven modeling of human visual attention mechanisms, both for saliency or scanpath prediction, with twofold advantages: from the neuroscientific perspective to understand biological mechanisms better, from the AI perspective to equip agents with the ability to mimic or predict human behavior and improve interpretability and interactions.
With the emergence of immersive technologies, now more than any time there is a need for experts of various backgrounds (e.g., machine learning, vision, and neuroscience communities) to share expertise and contribute to a deeper understanding of the intricacies of cost-efficient human supervision signals (e.g., eye-gaze) and their utilization towards by bridging human cognition and AI in machine learning research and development. The goal of this workshop is to bring together an active research community to collectively drive progress in defining and addressing core problems in gaze-assisted machine learning.
Authors: Amarachi Blessing Mbakwe; Joy Wu (IBM); Dario Zanca; Elizabeth Krupinski; Satyananda Kashyap (IBM); Alexandros Karargyris
This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks. To set the stage for this sangam, we start by enumerating the different flavors of model space problems that have been studied so far in the AI planning literature and explore the effect of an LLM on those tasks with detailed illustrative examples. We also empirically demonstrate how the performance of an LLM contrasts with combinatorial search (CS) -- an approach that has been traditionally used to solve model space tasks in planning -- with the increasing complexity of model edits and the increasing complexity of plans, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical modeling tool in concert with the CS approach as part of a two-stage process. Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future.
Authors:
The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to ”scientific language models” that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Authors:
MEGA is a recent transformer-based architecture, which utilizes a linear recurrent operator whose parallel computation, based on the FFT, scales as O(LlogL), with L being the sequence length. We build upon their approach by replacing the linear recurrence with a special temporal convolutional network which permits larger receptive field size with shallower networks, and reduces the computational complexity to O(L). The resulting model is called TCNCA, a Temporal Convolutional Network with Chunked Attention. We evaluate TCNCA on EnWik8 language modeling, long-range-arena (LRA) sequence classification, as well as a synthetic reasoning benchmark associative recall. On EnWik8, TCNCA outperforms MEGA, reaching a lower loss with 1.37×/1.24× faster forward/backward pass during training. The dilated convolutions used in TCNCA are consistently and significantly faster operations than the FFT-based parallelized recurrence in GPUs, making them a scalable candidate for handling very large sequence lengths: they are up to 7.07×/2.86× faster in the forward/backward pass for sequences up to 131 k. Further on LRA, TCNCA achieves, on average, 1.28× speed-up during inference with similar accuracy to what MEGA achieves. On associative recall, we find that even a simplified version of TCNCA, without excessive multiplicative and additive interactions, remains superior or competitive to MEGA on a range of sequence lengths and vocabulary sizes.
Authors: Aleksandar Terzic (IBM); Michael Hersche (IBM); Kumudu Geethan Karunaratne (IBM); Luca Benini; Abu Sebastian (IBM); Abbas Rahimi (IBM)
Accelerating scientific discovery through AI relies on the availability of high-quality data from scientific experimentation. Yet, scientific experimentation suffers from poor reproducibility and data capture challenges, mostly stemming from the difficulty in transcribing all details of an experiment and the different ways in which individuals document their lab work. With the emergence of foundation models capable of processing multiple data modalities including vision and language, there is a unique opportunity to redefine data and metadata capture and the corresponding scientific documentation process.
In this contribution, we discuss the challenges associated with lab digitization today and how multi-modal learning with transformer-based architectures can contribute to a new research infrastructure for scientific discovery in order to fully describe experimental methods and outcomes while facilitating data sharing and collaboration. We present a case study on a hybrid digital infrastructure and transformer-based vision-language models to transcribe high-dimensional raw data streams from non-invasive recording devices that represent the interaction of researchers with lab environments during scientific experimentation. The infrastructure is demonstrated in test cases related to semiconductor research and wet chemistry, where we show how vision-language foundation models fine-tuned on a limited set of experiments can be used to generate reports that exhibit high similarity with the recorded procedures. Our findings illustrate the feasibility of using foundation models to automate data capture and digitize all aspects of scientific experimentation, and suggest that the challenge of scarce training data for specific laboratory procedures can be alleviated by leveraging self-supervised pretraining on more abundant data from other domains.
Authors:
Training machine learning models in a centralized fashion often faces significant challenges due to regulatory and privacy concerns in real-world use cases. These include distributed training data, computational resources to create and maintain a central data repository, and regulatory guidelines (GDPR, HIPAA) that restrict sharing sensitive data. Federated learning (FL) is a new paradigm in machine learning that can mitigate these challenges by training a global model using distributed data, without the need for data sharing. The extensive application of machine learning to analyze and draw insight from real-world, distributed, and sensitive data necessitates familiarization with and adoption of this relevant and timely topic among the scientific community.
Recently, foundation models such as ChatGPT have revolutionized the field of machine learning by demonstrating remarkable capabilities across a wide range of tasks. These models have democratized the development of machine learning models, empowering developers to focus more on tuning a foundation model to their specific task rather than building complex models from scratch. This paradigm shift has the potential to remove the barriers to entry for machine learning development, and enables a broader community of developers to create high-quality models.
However, as the model development process itself becomes increasingly accessible, a new bottleneck emerges: computation power and data access. While foundation models have the potential to perform exceptionally well across various tasks, they pose two challenges: 1) training them requires vast amounts of training data and compute power, and 2) fine-tuning them to specific applications requires specialized and potentially sensitive data. Acquiring and centralizing datasets for both training and fine-tuning poses several challenges, including data privacy concerns, legal constraints (such as GDPR, HIPAA), and computational burdens.
FL is a promising solution to address these challenges in the era of foundation models. The fundamental goal of federated learning is to train models collaboratively across decentralized devices or data silos while keeping the data securely on those devices or within specific organizations. By adopting federated learning approaches, we can leverage the vast amounts of distributed data and compute available across different sources while respecting privacy regulations and data ownership.
The rise of foundation models amplifies the importance and relevance of FL as a crucial research direction. With foundation models becoming the norm in machine learning development, the focus shifts from model architecture design to tackling the issues surrounding privacy-preserving and distributed learning. Advancements in FL methods have the potential to unlock the full potential of foundation models, enabling efficient and scalable training while safeguarding sensitive data.
With this in mind, we invite original research contributions, position papers, and work-in-progress reports on various aspects of federated learning in the age of foundation models. Since the emergence of foundation models has been a relatively recent phenomenon, their full impact on federated learning has not yet been well explored or understood. We hope to provide a platform to facilitate interaction among students, scholars, and industry professionals from around the world to discuss the latest advancements, share insights, and identify future directions in this exciting field.
Authors: Jinghui Chen; Lixin Fan; Gauri Joshi; Sai Praneeth Karimireddy; Stacy Patterson; Shiqiang Wang (IBM); Han Yu
Optimal Transport (OT) has fueled machine learning (ML) applications across various domains. In cases where paired data measurements (μ, ν) are coupled to a context variable pi, one may aspire to learn a global transportation map, parameterized through the context to facilitate prediction of target states even from unseen context. Existing approaches for this task leverage Brenier’s theorem and utilize Neural OT. Here, we follow a radically different approach inspired by quantum computing principles to develop a Quantum formulation for learning transportation plans parameterized by a context variable. This is achieved through exploiting a natural link between doubly stochastic matrices and unitary operators. The latter can be directly related to recent results in quantum learning theory suggesting intrinsic advantages in modelling constrained problems with quantum methods. We verify our methodology on synthetic data, emulating the task of predicting single- cell perturbation responses parameterized through drug dosage as context. Our experimental comparisons to a baseline reveal that our method can capture dose- induced variations in cell distributions, even to some extent when extrapolating to dosages outside the interval seen during training. In summary, this work assesses the feasibility of learning to predict contextualized transportation plans through a novel quantum computing approach
Authors: Nicola Mariella (IBM); Jannis Born (IBM); Albert Akhriev (IBM); Francesco Tacchino (IBM); Christa Zoufal (IBM); Eugene Koskin; Ivano Tavernelli (IBM); Stefan Woerner (IBM); Marianna Rapsomaniki (IBM); Sergiy Zhuk (IBM)
Using novel approaches to dataset development, the Biasly dataset captures the nuance and subtlety of misogyny in ways that are unique within the literature. Built in collaboration with multi-disciplinary experts and annotators themselves, the dataset contains annotations of movie subtitles, capturing colloquial expres- sions of misogyny in North American film. The dataset can be used for a range of NLP tasks, including classification, severity score regression, and text gener- ation for rewrites. In this paper, we discuss the methodology used, analyze the annotations obtained, and provide baselines using common NLP algorithms in the context of misogyny detection and mitigation. We hope this work will promote AI for social good in NLP for bias detection, explanation, and removal. Content Warning: To illustrate examples from our dataset, misogynistic lan- guage is used in section 3 and table 3, which may be offensive or upsetting.
Authors: Ioana Baldini
Off-the-shelf pre-trained models are increasingly common in machine learning. In real-world applications, it is essential that the pre-trained models are not just accurate but also demonstrate qualities like fairness. This paper takes a closer look at recently proposed approaches that re-weight the training data to edit a pre-trained model for group fairness. We offer perspectives that unify disparate weighting schemes from past studies and pave the way for new weighting strategies to address group fairness concerns.
Authors: Soumya Ghosh (IBM); Prasanna Sattigeri (IBM); Inkit Padhi (IBM); Manish Nagireddy (IBM); Jie Chen (IBM)
Data-consistent model inversion problems aim to infer distributions of model parameters from distributions of experimental observations. Previous approaches to solving these problems include rejection algorithms, which are impractical for many real-world problems, and generative adversarial networks, which require a differentiable simulation. Here, we introduce a sequential sample refinement algorithm that overcomes these drawbacks. A set of parameters is iteratively refined using density ratio estimates in the model input and output domains, and parameters are resampled by training a generative implicit density estimator. We implement this novel approach using a combination of standard models from artificial intelligence and machine learning, including density estimators, binary classifiers, and diffusion models. To demonstrate the method, we show two examples from computational biology, with different levels of complexity.
Authors:
Pretrained Language Models (PLMs) have become the de facto starting point for fine-tuning on downstream tasks. However, as model sizes continue to increase, traditional fine-tuning of all parameters becomes challenging. To address this, parameter-efficient fine-tuning (PEFT) methods have gained popularity as a means to adapt PLMs effectively. In parallel, recent studies have revealed the presence of activation sparsity within the intermediate outputs of the MLP blocks in transformers. Low activation density enables efficient model inference on sparsity-aware hardware. Building upon this insight, in this work, we propose a novel density loss that encourages higher activation sparsity (equivalently, lower activation density) in the pre-trained models. In our experiments, we demonstrate the effectiveness of our proposed approach by employing mainstream PEFT techniques like LoRA, Adapter, Prompt/Prefix Tuning. DEFT consistently achieves substantial reductions in activation density. For example, on the T5-Base model, DEFT leads to reductions of average \textbf{47.77\\%} in encoder density and \textbf{81.82\\%} in decoder density compared to PEFT. These trends are mirrored across various GeLU activation-based models, including ViT-Base (86M), ViT-Large (307M), RoBERTa-Base (125M), RoBERTa-Large (355M), and GPT2 (117M), with density reductions ranging from \textbf{29.61\\%} to \textbf{56.68\\%}.
Authors: Bharat Runwal; Tejaswini Pedapati (IBM); Pin-Yu Chen (IBM)
Per- and polyfluoroalkyl substances (PFAS) are a broad class of molecules used in almost every sector of industry and consumer goods. PFAS exhibit highly desirable properties such as high durability, water repellance or high acidity, that are difficult to match. As a side effect, PFAS persist in the environment and have detrimental effect on human health. Epidemiological research has linked PFAS exposure to chronic health conditions, including dyslipidemia, cardiometabolic disorders, liver damage, and hypercholesterolemia. Recently, public health agencies significantly strengthed regulations on the use of PFAS. Therefore, alternatives are needed to maintain the pace of technological developments in multiple areas that traditionally relied on PFAS. To support the discovery of alternatives, we introduce MatGFN-PFAS, an AI system that generates PFAS replacements. We build MatGFN-PFAS using Generative Flow Networks (GFlowNets) for generation and a Chemical Language Model (MolFormer) for property prediction. We evaluate MatGFN-PFAS by exploring potential replacements of PFAS superacids, defined as molecules with negative pKa, that are critical for the semiconductor industry. It might be challenging to eliminate PFAS superacids entirely as a class due to the strong constraints on their functional performance. The proposed approach aims to account for this possibility and enables the generation of safer PFAS superacids as well. We evaluate two design strategies: 1) Using Tversky similarity to design molecules similar to a target PFAS but with lower toxicity and 2) Directly generating molecules with negative pKa and low toxicity. For the given query SMILE CC1CC(CC(F)(F)C(F)(F)OC(F)(F)C(F)(F)S(=O)(=O)O)OC1=O, the MatGFN-PFAS system was able to generate a candidate with very low toxicity, LD50 = 7304.23, strong acidity, pKa = -1.92, and high similarity score, 89.32%, to the query molecule. Results demonstrated that the proposed MatGFN-PFAS was able to consistently generate replacement molecules following all the constraints forehead mentioned. The resulting datasets for each studied molecule are available at anonymized.
Authors: Eduardo Almeida Soares (IBM); Flaviu Cipcigan (IBM); Dmitry Zubarev (IBM); Emilio Ashton Vital Brazil (IBM)
We develop methods for estimating Fréchet bounds on (possibly high-dimensional) distribution classes in which some variables are continuous-valued. We establish the statistical correctness of the computed bounds under uncertainty in the marginal constraints and demonstrate the usefulness of our algorithms by evaluating the performance of machine learning (ML) models trained with programmatic weak supervision (PWS). PWS is a framework for principled learning from weak supervision inputs (e.g., crowdsourced labels, knowledge bases, pre-trained models on related tasks, etc.), and it has achieved remarkable success in many areas of science and engineering. Unfortunately, it is generally difficult to validate the performance of ML models trained with PWS due to the absence of labeled data. Our algorithms address this issue by estimating sharp lower and upper bounds for performance metrics such as accuracy/recall/precision.
Authors: Felipe Maia Polo; Mikhail Yurochkin (IBM); Moulinath Banerjee; Subha Maity; Yuekai Sun