Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
Neural Information Processing Systems (NeurIPS) is a leading machine learning and computational neuroscience conference. IBM Research is excited to sponsor NeurIPS again this year as a Platinum sponsor.
We invite all attendees to visit us during the event at booth number 243, from Tuesday, December 10 through Thursday, December 12.
We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics such as foundation models, trustworthy AI, natural language processing and understanding, knowledge and reasoning, AI automation, human-centered AI, and federated learning.
Presentation times of conference workshops, demos, papers, and tutorials can be found see the agenda section at the bottom of this page. Note: All times are displayed in your local time.
Visit us at the IBM Booth to meet with IBM researchers and recruiters to speak about future job opportunities or 2025 summer internships.
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
EXPO | Atin Sood
Large Language Models for Code (or code LLMs) are increasingly gaining popularity and capabilities, offering a wide array of application modernization use cases such as code explanation, test generation, code repair, refactoring, translation, code generation, code completion and more. To leverage code LLMs to their full potential, developers must provide code-specific contextual information to the models. We would like to demonstrate generic pipelines we built, that incorporate static analysis to guide LLMs in generating code explanation at various levels (application, method, class) and automated test generation to produce compilable, high-coverage and natural looking test cases. We will also demonstrate how these pipelines can be built using “codellm-devkit”, an open-source library that significantly simplifies the process of performing program analysis at various levels of granularity, by making it easier to integrate detailed, code-specific insights that enhance the operational efficiency and effectiveness of LLMs in coding tasks. And how these use cases can be extended to different programming languages, specifically Java and Python.
EXPO | Julian Büchel
Analog in-memory computing (AIMC) using resistive memory devices has the potential to increase the energy efficiency of deep neural network inference by multiple orders of magnitude. This is enabled by performing matrix vector multiplications – one of the key operations in deep neural network inference – directly within the memory, avoiding expensive weight fetching from external memory such as DRAM. The IBM HERMES Project Chip is a state-of-the-art, 64-core mixed-signal AIMC chip based on Phase Change Memory that makes this concept a reality. Using this chip, we demonstrate automatic deployment and inference of a Transformer model capable of predicting chemical compounds that are formed in a chemical reaction.
EXPO | Luis Lastras
We aim to reframe how developers create LLM applications. Instead of iterating on verbose, complex prompts to achieve a desired complex behavior, we break down complex tasks into a series of standard computing elements that can be called by a developer in programmatic way. In this demonstration we will explore how leveraging an LLM trained with key intrinsic functions, such as hallucination detection, uncertainty quantification, and topic scoping, could unlock a new way of building and working with LLMs.
EXPO | Rohan Arora
IT failures are increasingly costly, with even brief outages leading to millions in losses as more business moves online. Incident management has become more complex than ever due to a combination of technological advancements, infrastructure heterogeneity, and evolving business needs. Resolving IT incidents is similar if not more complex to software code bug fixing. It is a very tedious and expensive task. Several advancements have been made including IBM’s Intelligent Incident Remediation using LLMs and generative AI to streamline incident resolution by identifying probable causes and using AI-guided remediation steps. In this demo, we are describing how we are advancing the state of the art in incident remediation using agentic Gen AI approaches. We demonstrate SRE-Agent-101, a ReAct style LLM-based agent, along with a benchmark to standardize the effectiveness of analytical solutions for incident management. SRE-Agent-101 uses several custom built tools, namely anomaly detection, causal topology extraction, NL2Traces, NL2Metrics, NL2Logs, NL2TopologyTraversal, and NL2Kubectl. These tools take natural language as input to fetch target data gathered by the observability stack. Given the verbosity of such data, even powerful models can quickly exhaust their context length. We have implemented a methodology to dynamically discover the more specific context using domain knowledge. The target context is then analyzed by underlying LLM to infer the root cause entity, fault, perform actions and this process iteratively continues until the incident is resolved.
EXPO | Werner Geyer
Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly used as evaluators to filter training data, evaluate model performance, detect harms and risks, or assist human evaluators with detailed assessments. To support this process, effective front-end tools are critical for evaluation. EvalAssist abstracts the llm-as-a-judge evaluation process into a library of parameterize-able evaluators (the criterion being the parameter), allowing the user to focus on criteria definition. EvalAssist consists of a web-based user experience, an API, and a Python toolkit and is based on the UNITXT open-source library. The user interface provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, and supports both direct (rubric-based) and pairwise assessment paradigms, the two most prevalent forms of LLM-as-a-judge evaluation available. In our demo, we will showcase different types of evaluator LLMs for general purpose evaluation and also the latest Granite Guardian model (released October 2024) to evaluate harms and risks.
EXPO | Leonid Karlinsky
Enterprise applications present unique challenges for vision and language foundation models, as they frequently involve visual data that diverges significantly from the typical distribution of web images and require understanding of nuanced details such as small text in scanned documents, or tiny defects in industrial equipment images. Motivated by these challenges, we will showcase our IBM Granite Vision model, a foundation model with state-of-the-art performance in document image understanding tasks, such as the analysis of charts, plots, infographics, tables, flow diagrams, and more. We will provide a detailed overview of our methodology and present a live demonstration of our model's capabilities, illustrating its key features and applications. Our model will be open-sourced, allowing the community to access and contribute to its development.
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Low-rank adapters (LoRA) and their variants are popular parameter-efficient fine-tuning (PEFT) techniques that closely match full model fine-tune performance while requiring only a small number of additional parameters. These additional LoRA parameters are specific to the base model being adapted. When the base model needs to be deprecated and replaced with a new one, all the associated LoRA modules need to be re-trained. Such re-training requires access to the data used to train the LoRA for the original base model. This is especially problematic for commercial cloud applications where the LoRA modules and the base models are hosted by service providers who may not be allowed to host proprietary client task data. To address this challenge, we propose \method --- a novel method for lossless, nearly data-free transfer of LoRAs across base models. Our approach relies on synthetic data to transfer LoRA modules. Using large language models, we design a synthetic data generator to approximate the data-generating process of the \textit{observed} task data subset. Training on the resulting synthetic dataset transfers LoRA modules to new models. We show the effectiveness of our approach using both LLama and Gemma model families. Our approach achieves lossless (mostly improved) LoRA transfer between models within and across different base model families, and even between different PEFT methods, on a wide variety of tasks.
Runqian Wang (IBM); Soumya Ghosh (IBM); David Cox (IBM); Diego Antognini (IBM); Aude Oliva; Rogerio Feris (IBM); Leonid Karlinsky (IBM)
Feature attribution methods explain black-box machine learning (ML) models by assigning importance scores to input features. These methods can be computationally expensive for large ML models. To address this challenge, there has been increasing efforts to develop amortized explainers, where a machine learning model is trained to predict feature attribution scores with only one inference. Despite their efficiency, amortized explainers can produce inaccurate predictions and misleading explanations. In this paper, we propose selective explanations, a novel feature attribution method that (i) detects when amortized explainers generate low-quality explanations and (ii) improves these explanations using a technique called explanations with initial guess. Our selective explanation method allows practitioners to specify the fraction of samples that receive explanations with initial guess, offering a principled way to bridge the gap between amortized explainers (one inference) and their high-quality counterparts (multiple inferences).
Lucas Monteiro Paes; Dennis Wei (IBM); Flavio Calmon
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Fréchet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.
Felipe Maia Polo; Subha Maity; Mikhail Yurochkin (IBM); Moulinath Banerjee; Yuekai Sun
As Large Language Models (LLMs) demonstrate extensive capability in learning from documents, LLM unlearning becomes an increasingly important research area to address concerns of LLMs in terms of privacy, copyright, etc. A conventional LLM unlearning task typically involves two goals: (1) The target LLM should forget the knowledge in the specified forget documents; and (2) it should retain the other knowledge that the LLM possesses, for which we assume access to a small number of retain documents. To achieve both goals, a mainstream class of LLM unlearning methods introduces an optimization framework with a combination of two objectives – maximizing the prediction loss on the forget documents while minimizing that on the retain documents, which suffers from two challenges, degenerated output and catastrophic forgetting. In this paper, we propose a novel unlearning framework called Unlearning from Logit Difference (ULD), which introduces an assistant LLM that aims to achieve the opposite of the unlearning goals: remembering the forget documents and forgetting the retain knowledge. ULD then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. We show that such reversed objectives would naturally resolve both aforementioned challenges while significantly improving the training efficiency. Extensive experiments demonstrate that our method efficiently achieves the intended forgetting while preserving the LLM’s overall capabilities, reducing training time by more than threefold. Notably, our method loses 0% of model utility on the ToFU benchmark, whereas baseline methods may sacrifice 17% of utility on average to achieve comparable forget quality.
Jiabao Ji; Yujian Liu; Yang Zhang (IBM); Gaowen Liu; Ramana Rao Kompella; Sijia Liu; Shiyu Chang
The goal of \emph{generalized} few-shot semantic segmentation (GFSS) is to recognize \emph{novel-class} objects through training with a few annotated examples and the \emph{base-class} model that learned the knowledge about base classes. Unlike the \emph{classic} few-shot semantic segmentation, GFSS aims to classify pixels into both base and novel classes, meaning that GFSS is a more practical setting. To this end, the existing methods rely on such as customized models, carefully-designed loss functions, and transductive learning. However, we found that a simple rule and standard supervised learning substantially improve performances in GFSS. In this paper, we propose a simple yet effective method for GFSS without the aforementioned techniques employed in the existing methods. Moreover, we theoretically prove that our method perfectly maintains most of the base-class segmentation performances. Through numerical experiments, we demonstrate the effectiveness of the proposed method. In particular, our method improves the novel-class segmentation performances in the -shot setting by on PASCAL- and on COCO-.
Tomoya Sakai (IBM); Haoxiang Qiu (IBM); Takayuki Katsuki (IBM); Daiki Kimura (IBM); Takayuki Osogami (IBM); Tadanobu Inoue (IBM)
Safety alignment is the key to guiding the behaviors of large language models (LLMs) that are in line with human preferences and restrict harmful behaviors at inference time, but recent studies show that it can be easily compromised by finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": randomly perturbing model weights maintains the safety level of the original aligned model in its local neighborhood. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community.
Shengyun Peng; Pin-Yu Chen (IBM); Matthew Hull; Duen Horng Chau
We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities.
Yuchen Hu; Chen Chen; Chao-han Huck Yang; Pin-Yu Chen (IBM); Chengwei Qin; Eng Siong Chng; Chao Zhang
Large pre-trained models excel in zero/few-shot learning for language and vision tasks but face challenges in multivariate time series (TS) forecasting due to diverse data characteristics. Consequently, recent research efforts have focused on developing pre-trained TS forecasting models. These models, whether built from scratch or adapted from large language models (LLMs), excel in zero/few-shot forecasting tasks. However, they are limited by slow performance, high computational demands, and neglect of cross-channel and exogenous correlations. To address this, we introduce Tiny Time Mixers (TTM), a compact model (starting from 1M parameters) with effective transfer learning capabilities, trained exclusively on public TS datasets. TTM, based on the light-weight TSMixer architecture, incorporates innovations like adaptive patching, diverse resolution sampling, and resolution prefix tuning to handle pre-training on varied dataset resolutions with minimal model capacity. Additionally, it employs multi-level modeling to capture channel correlations and infuse exogenous signals during fine-tuning. TTM outperforms existing popular benchmarks in zero/fewshot forecasting by (4-40%), while reducing computational requirements significantly. Moreover, TTMs are lightweight and can be executed even on CPU-only machines, enhancing usability and fostering wider adoption in resource-constrained environments. Model weights for our initial variant TTMQ are available here. Model weights for more sophisticated variants (TTMB, TTME, and TTMA) will be shared soon. The source code for TTM can be accessed here
Vijay E (IBM); Arindam Jati (IBM); Pankaj Dayama (IBM); Sumanta Mukherjee (IBM); Nam Nguyen (IBM); Wesley Gifford (IBM); Chandra Reddy (IBM); Jayant Kalagnanam (IBM)
We propose a novel approach to molecular simulations using neural network reparametrization, which offers a flexible alternative to traditional coarse-graining methods. Unlike conventional techniques that strictly reduce degrees of freedom, the complexity of the system can be adjusted in our model, sometimes increasing it to simplify the optimization process. Our approach also maintains continuous access to fine-grained modes and eliminates the need for force-matching, enhancing both the efficiency and accuracy of energy minimization. Importantly, our framework allows for the use of potentially arbitrary neural networks (e.g., Graph Neural Networks (GNN)) to perform the reparametrization, incorporating CG modes as needed. In fact, our experiments using very weak molecular forces (Lennard-Jones potential) the GNN-based model is the sole model to find the correct configuration. Similarly, in protein-folding scenarios, our GNN-based CG method consistently outperforms traditional optimization methods. It not only recovers the target structures more accurately but also achieves faster convergence to the deepest energy states. This work demonstrates significant advancements in molecular simulations by optimizing energy minimization and convergence speeds, offering a new, efficient framework for simulating complex molecular systems.
Nima Dehmamy (IBM); Csaba Both; Jeet Mohapatra; Subhro Das (IBM); Tommi Jaakkola
Training generative models with differential privacy (DP) typically involves injecting noise into gradient updates or adapting the discriminator's training procedure. As a result, such approaches often struggle with hyper-parameter tuning and convergence. We introduce the that injects noise into random low-dimensional projections of the private data. These noisy projections are used for training generative models. To enable optimizing generative models using this approach, we introduce the \emph{smoothed-sliced f-divergence} which ensures both DP and statistical consistency. Moreover, we present a kernel-based estimator for this divergence, circumventing the need for adversarial training. Extensive numerical experiments demonstrate that our approach can generate synthetic data of higher quality compared with baselines. Beyond performance improvement, our method, by sidestepping the need for noisy gradients, offers data scientists the flexibility to adjust generator architecture and hyper-parameters, run the optimization over any number of epochs, and even restart the optimization process---all without incurring additional privacy costs.
Kristjan Greenewald (IBM); Yuancheng Yu; Hao Wang (IBM); Kai Xu (IBM)
Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.
Giangiacomo Mercatali; Andre Freitas; Jie Chen (IBM)
Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.
Xiaomeng Xu; Pin-Yu Chen (IBM); Tsung-yi Ho
Designing novel functional proteins crucially depends on accurately modeling their fitness landscape. Given the limited availability of functional annotations from wet-lab experiments, previous methods have primarily relied on self-supervised models trained on vast, unlabeled protein sequence or structure datasets. While initial protein representation learning studies solely focused on either sequence or structural features, recent hybrid architectures have sought to merge these modalities to harness their respective strengths. However, these sequence-structure models have so far achieved only incremental improvements when compared to the leading sequence-only approaches, highlighting unresolved challenges 7in effectively leveraging these modalities together. Moreover, the function of certain proteins is highly dependent on the granular aspects of their surface topology, which have been overlooked by prior models. To address these limitations, we introduce the Sequence-Structure-Surface Fitness () model — a novel multimodal representation learning framework that integrates protein features across several scales. Our approach combines sequence representations from a protein language model with Geometric Vector Perceptron networks encoding protein backbone and detailed surface topology. The proposed method achieves state-of-the-art fitness prediction on the ProteinGym benchmark encompassing 217 substitution deep mutational scanning assays, and provides insights into the determinants of protein function. Code will be released upon acceptance.
Zuobai Zhang; Pascal Notin; Yining Huang; Aurelie Lozano (IBM); Vijil Vijil (IBM); Debora Marks; Payel Das (IBM); Jian Tang
Fine-tuning pre-trained models is a popular approach in machine learning for solving complex tasks with moderate data. However, fine-tuning the entire pre-trained model is ineffective in federated data scenarios where local data distributions are diversely skewed. To address this, we explore integrating federated learning with a more effective prompt-tuning method, optimizing for a small set of input prefixes to reprogram the pre-trained model's behavior. Our approach transforms federated learning into a distributed set modeling task, aggregating diverse sets of prompts to globally fine-tune the pre-trained model. We benchmark various baselines based on direct adaptations of existing federated model aggregation techniques and introduce a new probabilistic prompt aggregation method that substantially outperforms these baselines. Our reported results on a variety of computer vision datasets confirm that the proposed method is most effective to combat extreme data heterogeneity in federated learning.
Pei-yau Weng; Minh Hoang; Lam Nguyen (IBM); My Thai; Lily Weng; Trong Nghia Hoang
Visit us at the IBM booth in the exhibit hall to talk to our researchers and recruiters. We'll also be doing demos of our work. View our booth demo schedule and list of available IBM Research staff here.
Despite recent popularity of attention-based neural architectures in core AI fields like natural language processing (NLP) and computer vision (CV), their potential in modeling complex physical systems remains under-explored. Learning problems in physical systems are often characterized as discovering operators that map between function spaces based on a few instances of function pairs. This task frequently presents a severely ill-posed PDE inverse problem. In this work, we propose a novel neural operator architecture based on the attention mechanism, which we coin Nonlocal Attention Operator (NAO), and explore its capability towards developing a foundation physical model. In particular, we show that the attention mechanism is equivalent to a double integral operator that enables nonlocal interactions among spatial tokens, with a data-dependent kernel characterizing the inverse mapping from data to the hidden parameter field of the underlying operator. As such, the attention mechanism extracts global prior information from training data generated by multiple systems, and suggests the exploratory space in the form of a nonlinear kernel map. Consequently, NAO can address ill-posedness and rank deficiency in inverse PDE problems by encoding regularization and achieving generalizability. Lastly, we empirically demonstrate the advantages of NAO over baseline neural models in terms of the generalizability to unseen data resolutions and system states. Our work not only suggests a novel neural operator architecture for learning an interpretable foundation model of physical systems, but also offers a new perspective towards understanding the attention mechanism.
Yue Yu; Ning Liu; Fei Lu; Tian Gao (IBM); Siavash Jafarzadeh; Stewart Silling
While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.
Artem Lukoianov; Haitz Saez De Ocariz Borde; Kristjan Greenewald (IBM); Vitor Guizilini; Timur Bagautdinov; Vincent Sitzmann; Justin Solomon
Today's online platforms heavily lean on algorithmic recommendations for bolstering user engagement and driving revenue. However, these recommendations can impact multiple stakeholders simultaneously -- the platform, items (sellers), and users (customers) -- each with their unique objectives, making it difficult to find the right middle ground that accommodates all stakeholders. To address this, we introduce a novel fair recommendation framework, Problem (FAIR), that flexibly balances multi-stakeholder interests via a constrained optimization formulation. We next explore Problem (FAIR) in a dynamic online setting where data uncertainty further adds complexity, and propose a low-regret algorithm FORM that concurrently performs real-time learning and fair recommendations, two tasks that are often at odds. Via both theoretical analysis and a numerical case study on real-world data, we demonstrate the efficacy of our framework and method in maintaining platform revenue while ensuring desired levels of fairness for both items and users.
Qinyi Chen; Jason Cheuk Nam Liang; Negin Golrezaei; Djallel Bouneffouf (IBM)
Analog in-memory accelerators present a promising solution for energy-efficient training and inference of large vision or language models. While the inference on analog accelerators has been studied recently, the analog training perspective is under-explored. Recent studies have shown that the vanilla analog stochastic gradient descent (Analog SGD) algorithm {\em converges inexactly} and thus performs poorly when applied to model training on non-ideal devices. To tackle this issue, various analog-friendly gradient-based algorithms have been proposed, such as Tiki-Taka and its variants. Even though Tiki-Taka exhibits superior empirical performance compared to Analog SGD, it is a heuristic algorithm that lacks theoretical underpinnings. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergence issue of Analog SGD, which is caused by the asymptotic error arising from asymmetric updates and gradient noise. Further, we provide a convergence analysis of Tiki-Taka, which shows its ability to exactly converge to a critical point and hence eliminates the asymptotic error.The simulations verify the correctness of the analyses.
Zhaoxian Wu; Tayfun Gokmen (IBM); Malte Rasch (IBM); Tianyi Chen
Deep neural networks (DNNs) have become ubiquitous in machine learning, but their energy consumption remains problematically high. An effective strategy for reducing such consumption is supply-voltage reduction, but if done too aggressively, it can lead to accuracy degradation. This is due to random bit-flips in static random access memory (SRAM), where model parameters are stored. To address this challenge, we have developed NeuralFuse, a novel add-on module that handles the energy-accuracy tradeoff in low-voltage regimes by learning input transformations and using them to generate error-resistant data representations, thereby protecting DNN accuracy in both nominal and low-voltage scenarios. As well as being easy to implement, NeuralFuse can be readily applied to DNNs with limited access, such cloud-based APIs that are accessed remotely or non-configurable hardware. Our experimental results demonstrate that, at a 1% bit-error rate, NeuralFuse can reduce SRAM access energy by up to 24% while recovering accuracy by up to 57%. To the best of our knowledge, this is the first approach to addressing low-voltage-induced bit errors that requires no model retraining.
Hao-lun Sun; Lei Hsiung; Nandhini Chandramoorthy (IBM); Pin-Yu Chen (IBM); Tsung-yi Ho
We present a computational framework that transforms single images into 3D physical objects. The visual geometry of a physical object in an image is determined by three orthogonal attributes: mechanical properties, external forces, and rest-shape geometry. Existing single-view 3D reconstruction methods often overlook this underlying composition, presuming rigidity or neglecting external forces. Consequently, the reconstructed objects fail to withstand real-world physical forces, resulting in instability or undesirable deformation -- diverging from their intended designs as depicted in the image. Our optimization framework addresses this by embedding physical compatibility into the reconstruction process. We explicitly decompose the three physical attributes and link them through static equilibrium, which serves as a hard constraint, ensuring that the optimized physical shapes exhibit desired physical behaviors. Evaluations on a dataset collected from Objaverse demonstrate that our framework consistently enhances the physical realism of 3D models over existing methods. The utility of our framework extends to practical applications in dynamic simulations and 3D printing, where adherence to physical compatibility is paramount.
Minghao Guo; Bohan Wang; Pingchuan Ma; Tianyuan Zhang; Crystal Elaine Owens; Chuang Gan (IBM); Josh Tenenbaum; Kaiming He; Wojciech Matusik
Logical Credal Networks or LCNs were recently introduced as a powerful probabilistic logic framework for representing and reasoning with imprecise knowledge. Unlike many existing formalisms, LCNs have the ability to represent cycles and allow specifying marginal and conditional probability bounds on logic formulae which may be important in many realistic scenarios. Previous work on LCNs has focused exclusively on marginal inference, namely computing posterior lower and upper probability bounds on a query formula. In this paper, we explore abductive reasoning tasks such as solving MAP and Marginal MAP queries in LCNs given some evidence. We first formally define the MAP and Marginal MAP tasks for LCNs and subsequently show how to solve these tasks exactly using search-based approaches. We then propose several approximate schemes that allow us to scale MAP and Marginal MAP inference to larger problem instances. An extensive empirical evaluation demonstrates the effectiveness of our algorithms on both random LCN instances as well as LCNs derived from more realistic use-cases.
Radu Marinescu (IBM); Junkyu Lee (IBM); Debarun Bhattacharjya (IBM); Fabio Cozman; Alexander Gray (IBM)
Among the most important properties of algorithms investigated in computer science are soundness, completeness, and complexity. These properties, however, are rarely analyzed for the vast collection of recently proposed methods for planning with large language models. In this work, we alleviate this gap. We analyse these properties of using LLMs for planning and highlight that recent trends abandon both soundness and completeness for the sake of inefficiency. We propose a significantly more efficient approach that can, at the same time, maintain both soundness and completeness. We exemplify on four representative search problems, comparing to the LLM-based solutions from the literature that attempt to solve these problems. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100\% accuracy with only a few calls to the LLM. We argue for a responsible use of compute resources; urging research community to investigate sound and complete LLM-based approaches that uphold efficiency.
Michael Katz (IBM); Harsha Kokel (IBM); Kavitha Srinivas (IBM); Shirin Sohrabi (IBM)
Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 5.7% relative gains across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques.
Mingjian Jiang; Yangjun Yangjun; Prasanna Sattigeri (IBM); Salim Roukos (IBM); Tatsunori Hashimoto
We analyze the capabilities of Transformer language models in learning compositional discrete tasks. To this end, we evaluate training LLaMA models and prompting GPT-4 and Gemini on four tasks demanding to learn a composition of several discrete sub-tasks. On both training LLaMA models from scratch and prompting on GPT-4 and Gemini, we measure how well these models can reuse primitives observable in the sub-tasks to learn the composition task. Our results indicate that compositional learning in state-of-the-art Transformer language models is highly sample inefficient: LLaMA requires more data samples than relearning all sub-tasks from scratch to learn the compositional task; in-context prompting with few samples is unreliable and fails at executing the sub-tasks or correcting the errors in multi-round code generation. Further, by leveraging complexity theory, we support these findings with a theoretical analysis focused on the sample inefficiency of gradient descent in memorizing feedforward models.
Jonathan Thomm (IBM); Aleksandar Terzic (IBM); Giacomo Camposampiero (IBM); Michael Hersche (IBM); Bernhard Schoelkopf; Abbas Rahimi (IBM)
Working memory is a central cognitive ability crucial for intelligent decision-making. Recent experimental and computational work studying working memory has primarily used categorical (i.e., one-hot) inputs, rather than ecologically-relevant, multidimensional naturalistic ones. Moreover, studies have primarily investigated working memory during single or few number of cognitive tasks. As a result, an understanding of how naturalistic object information is maintained in working memory in neural networks is still lacking. To bridge this gap, we developed sensory-cognitive models, comprising of a convolutional neural network (CNN) coupled with a recurrent neural network (RNN), and trained them on nine distinct N-back tasks using naturalistic stimuli. By examining the RNN’s latent space, we found that: 1) Multi-task RNNs represent both task-relevant and irrelevant information simultaneously while performing tasks; 2) While the latent subspaces used to maintain specific object properties in vanilla RNNs are largely shared across tasks, they are highly task-specific in gated RNNs such as GRU and LSTM; 3) Surprisingly, RNNs embed objects in new representational spaces in which individual object features are less orthogonalized relative to the perceptual space; 4) Interestingly, the transformation of WM encodings (i.e., embedding of visual inputs in the RNN latent space) into memory was shared across stimuli, yet the transformations governing the retention of a memory in the face of incoming distractor stimuli were distinct across time. Our findings indicate that goal-driven RNNs employ chronological memory subspaces to track information over short time spans, enabling testable predictions with neural data.
Xiaoxuan Lei; Taku Ito (IBM); Pouya Bashivan
Spectroscopic techniques are essential tools for determining the structure of molecules. Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared spectroscopy, and Mass Spectrometry, provide insight into the molecular structure, including the presence or absence of functional groups. Chemists leverage the complementary nature of the different methods to their advantage. However, the lack of a comprehensive multimodal dataset, containing spectra from a variety of spectroscopic techniques, has limited machine-learning approaches mostly to single-modality tasks for predicting molecular structures from spectra. Here we introduce a dataset comprising simulated 1H-NMR, 13C-NMR, HSQC-NMR, Infrared, and Mass spectra (positive and negative ion modes) for 790k molecules extracted from chemical reactions in patent data. This dataset enables the development of foundation models for integrating information from multiple spectroscopic modalities, emulating the approach employed by human experts. Additionally, we provide benchmarks for evaluating single-modality tasks such as structure elucidation, predicting the spectra for a target molecule, and functional group predictions. This dataset has the potential automate structure elucidation, streamlining the molecular discovery pipeline from synthesis to structure determination. The dataset and code for the benchmarks can be found at https://rxn4chemistry.github.io/multimodal-spectroscopic-dataset
Marvin Alberts (IBM); Oliver Schilter (IBM); Federico Zipoli (IBM); Nina Hartrampf; Teodoro Laino (IBM)
The need for effective unlearning mechanisms in large language models (LLMs) is increasingly urgent, driven by the necessity to adhere to data regulations and foster ethical generative AI practices. LLM unlearning is designed to reduce the impact of undesirable data influences and associated model capabilities without diminishing the utility of the model if unrelated to the information being forgotten. Despite growing interest, much of the existing research has focused on varied unlearning method designs to boost effectiveness and efficiency. However, the inherent relationship between model weights and LLM unlearning has not been extensively examined. In this paper, we systematically explore how model weights interact with unlearning processes in LLMs and we design the weight attribution-guided LLM unlearning method, WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. By strategically guiding the LLM unlearning across different types of unlearning methods and tasks, WAGLE can erase the undesired content, while maintaining the performance of the original tasks. We refer to the weight attribution-guided LLM unlearning method as WAGLE, which unveils the interconnections between 'influence' of weights and 'influence' of data to forget and retain in LLM generation. Our extensive experiments show that WAGLE boosts unlearning performance across a range of LLM unlearning methods such as gradient difference and (negative) preference optimization, applications such as fictitious unlearning (TOFU benchmark), malicious use prevention (WMDP benchmark), and copyrighted information removal, and models including Zephyr-7b-beta and Llama2-7b. To the best of our knowledge, our work offers the first principled method for attributing and pinpointing the influential weights in enhancing LLM unlearning. It stands in contrast to previous methods that lack weight attribution and simpler weight attribution techniques.
Jinghan Jia; Jiancheng Liu; Yihua Zhang; Parikshit Ram (IBM); Nathalie Baracaldo Angel (IBM); Sijia Liu (IBM)
Causal interactions among a group of variables are often modeled by a single causal graph. In some domains, however, these interactions are best described by multiple co-existing causal graphs, e.g., in dynamical systems or genomics. This paper addresses the hitherto unknown role of interventions in learning causal interactions among variables governed by a mixture of causal systems, each modeled by one directed acyclic graph (DAG). Causal discovery from mixtures is fundamentally more challenging than single-DAG causal discovery. Two major difficulties stem from (i) inherent uncertainty about the skeletons of the component DAGs that constitute the mixture and (ii) possibly cyclic relationships across these component DAGs. This paper addresses these challenges and aims to identify edges that exist in at least one component DAG of the mixture, referred to as true edges. First, it establishes matching necessary and sufficient conditions on the size of interventions required to identify the true edges. Next, guided by the necessity results, an adaptive algorithm is designed that learns all true edges using O() interventions, where n is the number of nodes. Remarkably, the size of the interventions is optimal if the underlying mixture model does not contain cycles across its components. More generally, the gap between the intervention size used by the algorithm and the optimal size is quantified. It is shown to be bounded by the cyclic complexity number of the mixture model, defined as the size of the minimal intervention that can break the cycles in the mixture, which is upper bounded by the number of cycles among the ancestors of a node.
Burak Varici; Dmitriy Katz-Rogozhnikov (IBM); Dennis Wei (IBM); Prasanna Sattigeri (IBM); Ali Tajer
In time-series analysis, many recent works seek to provide a unified view and representation for time-series across multiple domains, leading to the development of foundation models for time-series data. Despite diverse modeling techniques, existing models are black-box and fail to provide insights and explanations about their representations. In this paper, we present VQShape, a pre-trained, generalizable, and interpretable model for time-series representation learning and classification. By introducing a novel representation for time-series data, we forge a connection between the latent space of VQShape and shape-level features. Using vector-quantization, we show that time-series from different domains can be described using a unified set of low-dimensional codes where each code can be represented as an abstracted shape in the time-domain. On classification tasks, we show that representations of VQShape can be utilized to build interpretable classifiers, achieving comparable performance as specialist models. Additionally, in zero-shot learning, VQShape and its codebook can generalize to previously unseen datasets and domains that are not included in the pre-training process.
Yunshi Wen; Tengfei Ma; Lily Weng; Lam Nguyen (IBM); Agung Julius
Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence lengths and large batch sizes. Since the invention of the transformer, two of the most effective interventions discovered for reducing the size of the KV cache have been Multi-Query Attention (MQA) and its generalization, Grouped-Query Attention (GQA). MQA and GQA both modify the design of the attention block so that multiple query heads can share a single key/value head, reducing the number of distinct key/value heads by a large factor while only minimally degrading accuracy. In this paper, we show that it is possible to take Multi-Query Attention a step further by also sharing key and value heads between adjacent layers, yielding a new attention design we call Cross-Layer Attention (CLA). With CLA, we find that it is possible to reduce the size of the KV cache by another 2x while maintaining nearly the same accuracy as unmodified MQA. In experiments training 1B- and 3B-parameter models from scratch, we demonstrate that CLA provides a Pareto improvement over the memory/accuracy tradeoffs which are possible with traditional MQA, enabling inference with longer sequence lengths and larger batch sizes than would otherwise be possible.
William Brandon; Mayank Mishra (IBM); Aniruddha Nrusimha; Rameswar Panda (IBM); Jonathan Ragan Kelly
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe\footnote{ConMe is an abbreviation for Confuse Me.} -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs.
Irene Huang; Wei Lin; Jehanzeb Mirza; Jacob Hansen; Sivan Doveh (IBM); Victor Butoi; Roi Herzig; Assaf Arbelle (IBM); Hilde Kuehne (IBM); Trevor Darrell; Chuang Gan (IBM); Aude Oliva; Rogerio Feris (IBM); Leonid Karlinsky (IBM)
While large language models (LLMs) such as Llama-2 or GPT-4 have shown impressive zero-shot performance, fine-tuning is still necessary to enhance their performance for customized datasets, domain-specific tasks, or other private needs. However, fine-tuning all parameters of LLMs requires significant hardware resources, which can be impractical for typical users. Therefore, parameter-efficient fine-tuning such as LoRA have emerged, allowing users to fine-tune LLMs without the need for considerable computing resources, with little performance degradation compared to fine-tuning all parameters. Unfortunately, recent studies indicate that fine-tuning can increase the risk to the safety of LLMs, even when data does not contain malicious content. To address this challenge, we propose Safe LoRA, a simple one-liner patch to the original LoRA implementation by introducing the projection of LoRA weights from selected layers to the safety-aligned subspace, effectively reducing the safety risks in LLM fine-tuning while maintaining utility. It is worth noting that Safe LoRA is a training-free and data-free approach, as it only requires the knowledge of the weights from the base and aligned LLMs. Our extensive experiments demonstrate that when fine-tuning on purely malicious data, Safe LoRA retains similar safety performance as the original aligned model. Moreover, when the fine-tuning dataset contains a mixture of both benign and malicious data, Safe LoRA mitigates the negative effect made by malicious data while preserving performance on downstream tasks.
Chia-yi Hsu; Yu-Lin Tsai; Chih-hsun Lin; Pin-Yu Chen (IBM); Chia-Mu Yu; Chun-ying Huang
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles
Wanhua Li; Zibin Meng; Jiawei Zhou; Donglai Wei; Chuang Gan (IBM); Hanspeter Pfister
Current studies on adversarial robustness mainly focus on aggregating robustness results from a set of data samples to evaluate and rank different models. However, the local statistics may not well represent the true robustness of the underlying unknown data distribution. To address this challenge, this paper makes the first attempt to present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. Formally, GREAT Score carries the physical meaning of a global statistic capturing a mean certified attack-proof perturbation level over all samples drawn from a generative model. For finite-sample evaluation, we also derive a probabilistic guarantee on the sample complexity and the difference between the sample mean and the true mean. GREAT Score has several advantages: (1) Robustness evaluations using GREAT Score are efficient and scalable to large models, by sparing the need of running adversarial attacks. In particular, we show high correlation and significantly reduced computation cost of GREAT Score when compared to the attack-based model ranking on RobustBench . (2) The use of generative models facilitates the approximation of the unknown data distribution. In our ablation study with different generative adversarial networks (GANs), we observe consistency between global robustness evaluation and the quality of GANs. (3) GREAT Score can be used for remote auditing of privacy-sensitive black-box models, as demonstrated by our robustness evaluation on several online facial recognition services.
Zhaitang Li; Pin-Yu Chen (IBM); Tsung-yi Ho
Dense Associative Memories are high storage capacity variants of the Hopfield networks that are capable of storing a large number of memory patterns in the weights of the network of a given size. Their common formulations typically require storing each pattern in a separate set of synaptic weights, which leads to the increase of the number of synaptic weights when new patterns are introduced. In this work we propose an alternative formulation of this class of models using random features, commonly used in kernel methods. In this formulation the number of network’s parameters remains fixed. At the same time, new memories can be added to the network by modifying existing weights. We show that this novel network closely approximates the energy function and dynamics of conventional Dense Associative Memories and shares their desirable computational properties.
Benjamin Hoover (IBM); Duen Horng Chau; Hendrik Strobelt (IBM); Parikshit Ram (IBM); Dmitry Krotov (IBM)
Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs’ abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt variants instead of finding a single prompt to evaluate with. We introduce PromptEval, a method for estimating performance across a large set of prompts borrowing strength across prompts and examples to produce accurate estimates under practical evaluation budgets. The resulting distribution can be used to obtain performance quantiles to construct various robust performance metrics (e.g., top 95% quantile or median). We prove that PromptEval consistently estimates the performance distribution and demonstrate its efficacy empirically on three prominent LLM benchmarks: MMLU, BIG-bench Hard, and LMentry. For example, PromptEval can accurately estimate performance quantiles across 100 prompt templates on MMLU with a budget equivalent to two single-prompt evaluations.
Felipe Maia Polo; Ronald Xu; Lucas Weber; Mirian Silva (IBM); Onkar Bhardwaj (IBM); Leshem Choshen (IBM); Allysson Flavio Melo de Oliveira (IBM); Yuekai Sun; Mikhail Yurochkin (IBM)
Despite the advancements in learning governing differential equations from observations of dynamical systems, data-driven methods are often unaware of fundamental physical laws, such as frame invariance. As a result, these algorithms may search an unnecessarily large space and discover equations that are less accurate or overly complex. In this paper, we propose to leverage symmetry in automated equation discovery to compress the equation search space and improve the accuracy and simplicity of the learned equations. Specifically, we derive equivariance constraints from the time-independent symmetries of ODEs. Depending on the types of symmetries, we develop a pipeline for incorporating symmetry constraints into various equation discovery algorithms, including sparse regression and genetic programming. In experiments across a diverse range of dynamical systems, our approach demonstrates better robustness against noise and recovers governing equations with significantly higher probability than baselines without symmetry.
Jianke Yang; Wang Rao; Nima Dehmamy (IBM); Robin Walters; Rose Yu
While traditional federated learning (FL) typically focuses on a star topology where clients are directly connected to a central server, real-world distributed systems often exhibit hierarchical architectures. Hierarchical FL (HFL) has emerged as a promising solution to bridge this gap, leveraging aggregation points at multiple levels of the system. However, existing algorithms for HFL encounter challenges in dealing with multi-timescale model drift, i.e., model drift occurring across hierarchical levels of data heterogeneity. In this paper, we propose a multi-timescale gradient correction (MTGC) methodology to resolve this issue. Our key idea is to introduce distinct control variables to (i) correct the client gradient towards the group gradient, i.e., to reduce client model drift caused by local updates based on individual datasets, and (ii) correct the group gradient towards the global gradient, i.e., to reduce group model drift caused by FL over clients within the group. We analytically characterize the convergence behavior of MTGC under general non-convex settings, overcoming challenges associated with couplings between correction terms. We show that our convergence bound is immune to the extent of data heterogeneity, confirming the stability of the proposed algorithm against multi-level non-i.i.d. data. Through extensive experiments on various datasets and models, we validate the effectiveness of MTGC in diverse HFL settings.
Wenzhi Fang; Dong-jun Han; Evan Chen; Shiqiang Wang (IBM); Christopher Brinton
Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as easy-to-hard generalization. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the (process-supervised) reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such easy-to-hard generalization from evaluators can enable easy-to-hard generalizations in generators either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model and 34b model (reranking@1024) achieves an accuracy of 34.0% and 52.5% on MATH500, respectively, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.
Zhiqing Sun; Longhui Yu; Yikang Shen (IBM); Weiyang Liu; Yiming Yang; Sean Welleck; Chuang Gan (IBM)
This paper aims at developing novel shuffling gradient-based methods for tackling two classes of minimax problems: nonconvex-linear and nonconvex-strongly concave settings. The first algorithm addresses the nonconvex-linear minimax setting and achieves the state-of-the-art oracle complexity typically observed in nonconvex optimization. It also employs a new shuffling estimator for the ``hyper-gradient,'' departing from standard shuffling techniques in optimization. The second method consists of two variants: semi-shuffling and full-shuffling schemes. These variants tackle the nonconvex-strongly concave minimax setting. We establish their oracle complexity bounds under standard assumptions, which, to our best knowledge, are the first for this specific setting. Numerical examples demonstrate the performance of our algorithms and compare them to two other methods. The results indicate that the new methods achieve comparable performance to SGD, supporting the potential of incorporating shuffling strategies into minimax algorithms.
Quoc Tran-Dinh; Trang H. Tran; Lam Nguyen (IBM)
While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an L p loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
Jonas Zausinger; Lars Pennig; Kacper Chlodny; Vincent Limbach; Anna Ketteler; Thorben Prein; Vishwa Mohan Singh; Michael Morris Danziger (IBM); Jannis Born (IBM)
In this paper, we present TabSketchFM, a neural tabular model for data discovery over data lakes. First, we propose novel pre-training: a sketch-based approach to enhance the effectiveness of data discovery in neural tabular models. Second, we finetune the model for identifying unionable, joinable, and subset table pairs and show significant improvement over previous tabular neural models. Third, we use these finetuned models to perform table search; i.e., given a query table, find other tables in a corpus that are unionable, joinable, or that are subsets of the query. Our results demonstrate significant improvements for search compared to state-of-the-art techniques. Finally, we show significant transfer across datasets and tasks establishing that our model can generalize across different tasks and over different data lakes.
Aamod Khatiwada (IBM); Harsha Kokel (IBM); Ibrahim Abdelaziz (IBM); SUBHAJIT CHAUDHURY (IBM); Julian Dolby (IBM); Oktie Hassanzadeh (IBM); Zhenhan Huang (IBM); Tejaswini Pedapati (IBM); Horst Samulowitz (IBM); Kavitha Srinivas (IBM)
Astrocytes, the most abundant type of glial cell, play a fundamental role in memory. Despite most hippocampal synapses being contacted by an astrocyte, there are no current theories that explain how neurons, synapses, and astrocytes might collectively contribute to memory function. We demonstrate that fundamental aspects of astrocyte morphology and physiology naturally lead to a dynamic, high-capacity associative memory system. The neuron-astrocyte networks generated by our framework are closely related to popular machine learning architectures known as Dense Associative Memories or Modern Hopfield Networks. In their known biological implementations the ratio of stored memories to the number of neurons remains constant, despite the growth of the network size. Our work demonstrates that neuron-astrocyte networks follow superior, supralinear memory scaling laws, outperforming all known biological implementations of Dense Associative Memory. This theoretical link suggests the exciting and previously unnoticed possibility that memories could be stored, at least in part, within astrocytes rather than solely in the synaptic weights between neurons.
Leo Kozachkov (IBM); Jean-jacques Slotine; Dmitry Krotov (IBM)
There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called MergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply MergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new research avenues and inspire more efficient development of safe expert LLMs.
Megh Thakkar; Yash More; Quentin Fournier; Matthew Riemer (IBM); Pin-Yu Chen (IBM); Amal Zouaq; Payel Das (IBM); Sarath Chandar
This paper presents SOLOMON, a novel Neuro-inspired Large Language Model (LLM) Reasoning Network architecture that enhances the adaptability of foundation models for domain-specific applications. Through a case study in semiconductor layout design, we demonstrate how SOLOMON enables swift adaptation of general-purpose LLMs to specialized tasks by leveraging Prompt Tuning and In-Context Learning techniques. Our experiments reveal the challenges LLMs face in spatial reasoning and applying domain knowledge to practical problems. Results show that SOLOMON instances significantly outperform their baseline LLM counterparts and achieve performance comparable to state-of-the-art reasoning model, o1-preview. We discuss future research directions for developing more adaptive AI systems that can continually learn, adapt, and evolve in response to new information and changing requirements.
Bo Wen (IBM); Xin Zhang (IBM)
Open-weight large language model zoos allow users to quickly integrate state-of-the-art models into systems. Despite increasing accessibility, selecting the most appropriate model for a given task still largely relies on public benchmark leaderboards and educated guesses. This can be unsatisfactory for both inference service providers and end users. The providers prioritize cost efficiency, while the end users prioritize model output quality for their inference requests. In commercial settings, these two priorities are often brought together in Service Level Agreements (SLA). We present MESS+, an online stochastic optimization algorithm for energy optimal model selection in a model zoo that works on a per-inference-request basis. For a given SLA that requires high accuracy, we are up to 2.5× more energy efficient with MESS+ than with randomly selecting an LLM from the zoo while maintaining SLA quality constraints.
Ryan Zhang; Herbert Woisetschläger; Shiqiang Wang (IBM); Hans-arno Jacobsen
Aligning large language models (LLMs) to value systems has emerged as a significant area of research within the fields of AI and NLP. Currently, this alignment process relies on the availability of high-quality supervised and preference data, which can be both timeconsuming and expensive to curate or annotate. In this paper, we introduce a systematic end-to-end methodology for aligning LLMs to the implicit and explicit values represented in unstructured text data. Our proposed approach leverages the use of scalable synthetic data generation techniques to effectively align the model to the values present in the unstructured data. Through two distinct use-cases, we demonstrate the efficiency of our methodology on the Mistral-7B-Instruct model. Our approach credibly aligns LLMs to the values embedded within documents, and shows improved performance against other approaches, as quantified through the use of automatic metrics and win rates.
Inkit Padhi (IBM); Karthikeyan Natesan Ramamurthy (IBM); Prasanna Sattigeri (IBM); Manish Nagireddy (IBM); Pierre Dognin (IBM); Kush Varshney (IBM)
Content-addressable memories such as Modern Hopfield Networks (MHN) have been studied as mathematical models of auto-association and storage/retrieval in the human declarative memory, yet their practical use for large-scale content storage faces challenges. Chief among them is the occurrence of meta-stable states, particularly when handling large amounts of high dimensional content. This paper introduces Hopfield Encoding Networks (HEN), a framework that integrates encoded neural representations into MHNs to improve pattern separability and reduce meta-stable states. We show that HEN can also be used for retrieval in the context of hetero association of images with natural language queries, thus removing the limitation of requiring access to partial content in the same domain. Experimental results demonstrate substantial reduction in meta-stable states and increased storage capacity while still enabling perfect recall of a significantly larger number of inputs advancing the practical utility of associative memory networks for real-world tasks.
Satyananda Kashyap (IBM); Niharika DSouza (IBM); Luyao Shi (IBM); Ken C. L. Wong (IBM); Hongzhi Wang (IBM); Tanveer Syeda-Mahmood (IBM)
Transformer-based chemical language models (CLM), trained on large and general purpose datasets consisting of molecular strings, have recently emerged as a powerful tool for successfully modeling various structure-property relations, as well as for proposing novel candidates. In this work, we propose a novel approach that harnesses a recent generative CLM, namely GP-MoLFormer, to propose small molecules with more desirable properties. Specifically, we present a parameter-efficient fine-tuning method for the unconstrained property optimization, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer outperforms existing baselines in terms of generating diverse molecules with desired properties across three popular property optimization tasks, namely drug likeliness, penalized logP, and dopamine type 2 receptor activity. Results demonstrate the general utility of pair-tuning together with a generative CLM for a variety of molecular optimization tasks.
Jerret Ross (IBM); Samuel Hoffman (IBM); Brian Belgodere (IBM); Vijil Vijil (IBM); Youssef Mroueh (IBM); Payel Das (IBM)
Large-scale molecular representation methods have revolutionized applications in material science, such as drug discovery, chemical modeling, and material design. With the rise of transformers, models now learn representations directly from molecular structures. In this study, we develop an encoder-decoder model based on BART that not only learns molecular representations but also auto-regressively generates molecules. Trained on SELFIES, a robust molecular string representation, our model outperforms existing baselines in downstream tasks, demonstrating its potential in efficient and effective molecular data analysis and manipulation.
Indra Priyadarsini S (IBM); Seiji Takeda (IBM); Lisa Hamada (IBM); Emilio Ashton Vital Brazil (IBM); Eduardo Almeida Soares (IBM); Hajime Shinohara (IBM)
Large-scale pre-trained foundation models represent a breakthrough in cheminformatics. These methods excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens through self-supervised learning on large unlabeled corpora. Most chemical foundation models available are based on the Transformers architecture and its core attention module. The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window and quadratic scaling with respect to the window length. Structured state space sequence models (SSMs) have recently emerged as a promising class of architectures for sequence modeling. Mamba is a simplified end-to-end SSM-based neural network architecture without attention or even MLP blocks. This paper introduces a Mamba-based chemical foundational models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, which is equivalent to 4 billion of molecular tokens. These models support different complex tasks, including molecular properties prediction, classification, molecular reconstruction, and synthesis yield prediction. Our experiments across multiple benchmark datasets validate the SSM's capacity of providing state-of-the-art results while is designed for fast inference.
Emilio Ashton Vital Brazil (IBM); Eduardo Almeida Soares (IBM); Victor Shirasuna (IBM); Renato Fontoura de Gusmao Cerqueira (IBM); Dmitry Zubarev (IBM); Kristin Schmidt (IBM)
Compositionality of communication is considered a prerequisite for reasoning. Despite overall impressive performance, LLMs seem to have fundamental issues with compositionality in reasoning tasks. Research of the emergence of languages in referential games demonstrates that compositionality can be achieved via combination of the game organization and constraints on communication protocols. In this contribution we propose and offer initial evaluation of the hypothesis that compositionality in reasoning tasks with LLMs can be improved by placing LLM agents in the referential games that coax compositionality of the communication. We describe a multi-stage chemical game including recognition, naming, and reconstruction of chemical structures by LLM agents without leveraging their pre-existing chemical knowledge.
Dmitry Zubarev (IBM); Sarath Swaminathan (IBM)
To efficiently factorize high-dimensional distributed representations to the constituent atomic vectors, one can exploit the compute-in-superposition capabilities of vector-symbolic architectures (VSA). Such factorizers however suffer from the phenomenon of limit cycles. Applying noise during the iterative decoding is one mechanism to address this issue. In this paper, we explore ways to further relax the noise requirement by applying noise only at the time of VSA's reconstruction codebook initialization. While the need for noise during iterations proves analog in-memory computing systems to be a natural choice as an implementation media, the adequacy of initialization noise allows digital hardware to remain equally indispensable. This broadens the implementation possibilities of factorizers. Our study finds that while the best performance shifts from initialization noise to iterative noise as the number of factors increases from 2 to 4, both extend the operational capacity by at least 50x compared to the baseline factorizer resonator networks.
Kumudu Geethan Karunaratne (IBM); Michael Hersche (IBM); Abu Sebastian (IBM); Abbas Rahimi (IBM)
We introduce AIHWKIT-Lightning, a new toolkit designed for efficient and scalable hardware-aware training of large neural networks deployed on Analog In-Memory Computing (AIMC)-based hardware. The toolkit prioritizes speed and ease of use, addressing the limitations of existing frameworks in training Large Language Models (LLMs) with billions of parameters. AIHWKIT-Lightning leverages dedicated GPU kernels and a streamlined implementation, achieving up to 3.7x faster training at lower memory consumption compared to state-of-the-art toolkits. Benefiting from the increased scalability, we demonstrate near-iso-accuracy on the GLUE benchmark using a RoBERTa model trained on 11B tokens. The toolkit is publicly available at github.com/IBM/aihwkit-lightning.
Julian Büchel (IBM); William Simon (IBM); Corey Liam Lammie (IBM); Giovanni Acampa (IBM); Kaoutar El Maghraoui (IBM); Manuel Le Gallo (IBM); Abu Sebastian (IBM)
Hopfield networks are associative memory systems, designed for storing and retrieving specific patterns as local minima of an energy landscape. In the classical Hopfield model, an interesting phenomenon occurs when the model's memorization capacity reaches its critical memory load - spurious states, or unintended stable points, emerge at the end of the retrieval dynamics. These particular states often appear as mixtures of the stored patterns, leading to incorrect recall. In this work, we propose that these spurious states are not necessarily a negative feature of retrieval dynamics, but rather that they serve as the onset of generalization. We employ diffusion models, commonly used in generative modelling, to demonstrate that their generalization stems from a phase transition which occurs as the number of training samples is increased. In the low data regime the model exhibits a strong memorization phase, where the network creates a distinct basin of attraction for each sample in the training set, akin to the Hopfield model below the critical memory load. In the large data regime a different phase appears where an increase in the training set size fosters the creation of new attractor states that correspond to manifolds of the generated samples. Spurious states appear at the boundary of this transition and correspond to emergent attractor states, which are absent in the training set, but, at the same time, still have a distinct basin of attraction around them. From the perspective of Hopfield description these spurious states correspond to mixtures of "fundamental memories" which facilitate generalization through the superposition of underlying features, resulting in the creation of novel samples. Our findings provide a novel perspective on the memorization-generalization phenomenon in diffusion models via the lens of Hopfield networks, which illuminate the previously underappreciated view of diffusion models as Hopfield networks above the critical memory load.
Bao Pham; Gabriel Raya; Matteo Negri; Mohammed Zaki; Luca Ambrogioni; Dmitry Krotov (IBM)
Recent benchmarks suggest that there remains significant room to improve large language models’ ability to robustly reason across facts distributed in extremely long documents. In this work, we propose MemReasoner, a new memory-augmented LLM architecture that is trained to perform temporal reasoning, along with multiple computational steps, over the context stored in the memory. Experiments show that MemReasoner trained on the core reasoning facts generalizes better, when compared to off-the-shelf large language models and existing recurrent models, on a test distribution where the required facts are scattered across long natural text up to 128k tokens. Further, MemReasoner demonstrates robust reasoning performance relative to the baselines, when the answer distribution in test samples differs from that in the training set.
Irene Ko (IBM); Sihui Dai (IBM); Payel Das (IBM); Georgios Kollias (IBM); SUBHAJIT CHAUDHURY (IBM); Aurelie Lozano (IBM)
Recent advancements in chemical machine learning have adopted a two-step approach—pre-training on unlabeled data followed by fine-tuning on specific tasks—to boost model capacity. With the increasing demand for training efficiency, Mixture-of-Experts (MoE) has become essential for scaling large models by selectively activating sub-networks of experts through a gating network, thereby optimizing performance.This paper presents MoL-MoE, a Multi-view Mixture-of-Experts framework designed to predict molecular properties by integrating latent spaces derived from SMILES, SELFIES, and molecular graphs. Our approach leverages the complementary strengths of these representations to enhance predictive accuracy. Here, we evaluate the performance of MoL-MoE with a total of 12 experts, organized into 4 experts for each modality (SMILES, SELFIES, and molecular graphs). We evaluate MoL-MoE on a range of benchmark datasets from MoleculeNet, demonstrating its superior performance compared to state-of-the-art methods across all nine datasets considering two different routing activation settings: k=4 and k=6. The results underscore the model's robustness and adaptability in handling various complex molecular prediction tasks. Our analysis of routing activation patterns reveals that MoL-MoE dynamically adjusts its use of different molecular representations based on task-specific requirements. This adaptability highlights the importance of representation choice in optimizing model performance.
Eduardo Almeida Soares (IBM); Indra Priyadarsini S (IBM); Emilio Ashton Vital Brazil (IBM); Victor Shirasuna (IBM); Seiji Takeda (IBM)
Transformers have achieved remarkable success in Natural Language Processing but struggle with state tracking and algorithmic reasoning tasks, such as modeling Regular Languages. In contrast, Recurrent Neural Networks (RNNs) exhibit perfect generalization modeling Regular Languages. To bridge this gap, we explore Recurrent Transformer variants that incorporate chunking, balancing the parallelizability of Transformers with the sequential processing of RNNs. We identify layer-recurrence as the key type of recurrence that allows Recurrent Transformers to succeed in modeling Regular Languages. Further analysis indicates a rapid decline in generalization performance as chunk size increases beyond two, though with an exponential decrease in training time. This study underscores the critical role of layer-recurrence and chunk size in Recurrent Transformers, highlighting the trade-off between generalization capabilities and parallelism.
Paul Soulos (IBM); Aleksandar Terzic (IBM); Michael Hersche (IBM); Abbas Rahimi (IBM)
Traditional drug design methods are costly and time-consuming due to their reliance on trial-and-error processes. As a result, computational methods, including diffusion models, designed for molecule generation tasks have gained significant traction. Despite their potential, they have faced criticism for producing physically implausible outputs. We alleviate this problem by conditionally training a diffusion model capable of generating molecules of varying and controllable levels of structural plausibility. This is achieved by adding distorted molecules to training datasets, and then annotating each molecule with a label representing the extent of its distortion, and hence its quality. By training the model to distinguish between favourable and unfavourable molecular conformations alongside the standard molecule generation training process, we can selectively sample molecules from the high-quality region of learned space, resulting in improvements in the validity of generated molecules. In addition to the standard two datasets used by molecule generation methods (QM9 and GEOM), we also test our method on a druglike dataset derived from ZINC. We use our conditional method with EDM, the first E(3) equivariant diffusion model for molecule generation, as well as two further models—a more recent diffusion model and a flow matching model—which were built off EDM. We demonstrate improvements in validity as assessed by RDKit parsability and the PoseBusters test suite; more broadly, though, our findings highlight the effectiveness of conditioning methods on low-quality data to improve the sampling of high-quality data.
Lucy Vost; Vijil Vijil (IBM); Payel Das (IBM); Charlotte Dean
Recent advancements in Large Language Models (LLMs) for Generative AI have significantly increased their popularity, resulting in an exponential arise of new close and open LLM models with frequent algorithm updates. This further complicates the challenge of optimal application management, resource allocation, and scaling in cloud environments for optimal inference latency. Hence, the typical approach of running and learning to define the optimal configuration starts to be unpractical due to the large combinatorial problem and shortage/cost of GPU resources, which creates the necessity for predictive performance models.
Given that, we propose a new LLM performance prediction model that can be leveraged for optimal cluster management. The novelty of our approach is the combination of an analytical Roofline Model (RLM) specific for LLM implementation and based on the hardware characteristic with data from Regression Models trained with historical data. More specifically, our approach calibrates the theoretical hardware performance given from RLM with inherent runtime overhead captured by Regression Models, offering a more interpretable and accurate prediction method in cloud-based deployments. We validate our method for both vLLM and Triton inference servers, demonstrating that by combining the RLM with regression, our approach improves the value by and reduces MSE by up to for vLLM, compared to other regression-only models.
Saki Imai; Rina Nakazawa (IBM); Marcelo Amaral (IBM); Sunyanan Choochotkaew (IBM); Tatsuhiro Chiba (IBM)
Scientific Machine Learning has significantly advanced climate science by enabling precise forecasting of complex dynamical systems. While state-of-the-art models excel in domain-specific tasks, recent advancements in time series-based foundation models seek to replicate the success seen in natural language processing and computer vision. This study investigates whether a "small" MLP-Mixer-based foundation models, Tiny Time Mixers (TTMs), can be fine-tuned to forecast complex real-world dynamical systems accurately while adhering to practical resource and cost constraints. Our findings reveal that TTMs are sensitive to the dynamical characteristics present in the training data, particularly in terms of amplitude and periodicity, yet significant variations in forecast accuracy were observed within the same training distribution. These results highlight the need for further adaptation of TTMs to enhance their robustness in specialized SciML forecasting tasks.
Imran Nasim (IBM); Joao Lucas de Sousa Almeida (IBM)
Physics-informed neural networks (PINNs) incorporate physical laws into their training to efficiently solve partial differential equations (PDEs) with minimal data. However, PINNs fail to guarantee adherence to conservation laws, which are also important to consider in modeling physical systems. To address this, we created PINN-Proj, a PINN-based model which uses a novel projection method to enforce to conservation laws. We found that PINN-Proj substantially outperformed PINN in conserving momentum and guaranteed conservation to an accuracy of while performing marginally better in the separate task of state prediction on three PDE datasets.
Anthony Baez; Wang Zhang; Martin Ma; Subhro Das (IBM); Lam Nguyen (IBM); Luca Daniel
Representation systems for polymers are a constant issue in deep-learning models for polymer property prediction, necessitating a balance between structural accuracy with interoperability to achieve utility in property prediction tasks. To facilitate this, we introduce a serialized polymer graph (SPG) notation and SPG-TED289M, a SPG-based foundation model for polymers, which has been pre-trained on a carefully curated dataset of 1 million SPG samples. To better handle the unique characteristics of SPG, we extended the tokenization process, resulting in a vocabulary of 2,407 distinct tokens. We evaluated the SPG-TED289M model's performance across a range of tasks including copolymer phase behavior, polymer membrane properties, multi-task learning, refractive index prediction, ionic conductivity, gas permeability, and glass transition temperature. The model demonstrated state-of-the-art performance in most of these areas, achieving results on par with specialized models designed for specific tasks. This indicates that SPG-TED289M, with minimal fine-tuning, can adapt effectively to complex polymer-related tasks, showcasing its robustness and versatility as a foundation model. The SPG-TED289M model provides significant flexibility and scalability, making it a valuable tool for various applications in polymer science.
Eduardo Almeida Soares (IBM); Nathaniel Park (IBM); Emilio Ashton Vital Brazil (IBM); Victor Shirasuna (IBM)
Recent advancements in large foundation models have revealed impressive capabilities in mastering complex chemical language representations. These models undergo a task-agnostic learning phase, characterized by pre-training on extensive unlabeled corpora followed by fine-tuning on specific downstream tasks. This methodology reduces reliance on labeled data, facilitating data acquisition and broadening the scope of chemical language representation. However, real-world scenarios often pose challenges due to domain shift, necessitating robust domain adaptation strategies to maintain performance levels across different contexts. To address this, we present a novel causal-based framework for feature selection and domain adaptation to enhance the performance of chemical foundation models on downstream tasks. Our approach employs a multi-stage feature selection method that identifies physico-chemical features based on their direct causal-effect over specific downstream properties. By employing Mordred descriptors and Markov blanket causal graphs, our approach provides insight into the causal relationships between features and target properties for prediction tasks. We evaluate our approach on various foundation model architectures and datasets, demonstrating consistent performance improvements, which showcases the robustness and the agnostic nature of our approach.
Victor Shirasuna (IBM); Eduardo Almeida Soares (IBM); Emilio Ashton Vital Brazil (IBM); Karen Fiorella Aquino Gutierrez (IBM); Renato Fontoura de Gusmao Cerqueira (IBM); Dmitry Zubarev (IBM); Kristin Schmidt (IBM)
Operational decision making in the shipping industry exemplifies a real-world challenge that extends beyond single tasks and static conditions. We introduce an agentic LLM system designed to enhance anomaly detection (AD) and maintenance processes within this highly dynamic domain, involving multi-persona stakeholder interactions. The method leverages the intrinsic knowledge and reasoning abilities of LLMs, augmented by a suite of external tools to reason on the severity of anomalies detected by an out-of-the-box AD tool. Our approach achieves this by considering environmental factors, interconnected system dynamics extracted from a knowledge graph, and broader operational parameters. Evaluations on large-scale shipping data demonstrate that our method effectively reasons about multimodal data, distilling complex system dynamics into operational insights. This represents the first agentic application in an open-world maritime environment.
Alexander Timms (IBM); Abigail Langbridge (IBM); Fearghal O'Donncha (IBM)
Language models (LLMs) have recently used for search, primarily as world models that define the space; the use of LLMs for search forgo soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having the language models produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. In this work, we automate ToS (AutoToS), completely taking the human out of the loop of solving planning problems. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We achieve 100% accuracy, with minimal feedback iterations, using LLMs of various sizes on all evaluated domains.
Daniel Cao; Michael Katz (IBM); Harsha Kokel (IBM); Kavitha Srinivas (IBM); Shirin Sohrabi (IBM)
Deep clustering -- joint representation learning and latent space clustering -- is a well studied problem especially in computer vision and text processing under the deep learning framework. While the representation learning is generally differentiable, clustering is an inherently discrete optimization, requiring various approximations and regularizations to fit in a standard differentiable pipeline. This leads to a somewhat disjointed representation learning and clustering. Recently, Associative Memories were utilized in the end-to-end differentiable ClAM clustering scheme (Saha et al. 2023). In this work, we show how Associative Memories enable a novel take on deep clustering, DClAM, simplifying the whole pipeline and tying together the representation learning and clustering more intricately. Our experiments showcase the advantage of DClAM, producing improved clustering quality regardless of the architecture choice (convolutional, residual or fully-connected) or data modality (images or text).
Bishwajit Saha; Dmitry Krotov (IBM); Mohammed Zaki; Parikshit Ram (IBM)
Fluorescence lifetime imaging (FLI) is an important technique for studying cellular environments and molecular interactions, but its real-time application is limited by slow data acquisition, which requires capturing large time-resolved images and complex post-processing using iterative fitting algorithms. Deep learning (DL) models enable real-time inference, but can be computationally demanding due to complex architectures and large matrix operations. This makes DL models ill-suited for direct implementation on field-programmable gate array (FPGA)-based camera hardware. Model compression is thus crucial for practical deployment for real-time inference generation. In this work, we focus on compressing recurrent neural networks (RNNs), which are well-suited for FLI time-series data processing, to enable deployment on resource-constrained FPGA boards. We perform an empirical evaluation of various compression techniques, including weight reduction, knowledge distillation (KD), post-training quantization (PTQ), and quantization-aware training (QAT), to reduce model size and computational load while preserving inference accuracy. Our compressed RNN model, Seq2SeqLite, achieves a balance between computational efficiency and prediction accuracy, particularly at 8-bit precision. By applying KD, the model parameter size was reduced by 98% while retaining performance, making it suitable for concurrent real-time FLI analysis on FPGA during data capture. This work represents a big step towards integrating hardware-accelerated real-time FLI analysis for fast biological processes.
Ismail Erbas; Vikas Pandey; Aporva Amarnath (IBM); Naigang Wang (IBM); Karthik Swaminathan (IBM); Stefan Radev; Xavier Intes
Fusion energy research has long captured the public imagination for its applications to fundamental physics, material sciences, and as a low-carbon-footprint electrical power source. The National Ignition Facility (NIF) recently demonstrated that focusing lasers onto a very small target of hydrogen isotopes can produce conditions for nuclear fusion. Despite such remarkable progress, sustainable production of inertial fusion energy (IFE) still presents a huge challenge due to a vast space of parameters that must be explored in order to find optimum conditions for a thermonuclear ignition. It is perceived that artificial intelligence (AI) can pla a crucial role in advancing IFE technology. We present our vision of how large language models (LLM) and deep reinforcement learning (DRL) can guide IFEresearch.
Vadim Elisseev (IBM); Max Esposito (IBM); James Sexton (IBM)