# IBM at NeurIPS 2022

New Orleans, LA, USA and virtual
This event has ended.

Neural Information Processing Systems (NeurIPS) is a leading machine learning and computational neuroscience conference. IBM Research is excited to sponsor NeurIPS again this year! We invite all attendees to visit us during the event at booth number 507, from Monday, Nov 28, through Thursday, Dec 1. We look forward to meeting you and telling you more about our latest work and career opportunities at IBM Research. At our booth we’ll be demoing projects on a broad range of AI topics such as foundation models, trustworthy AI, natural language processing and understanding, knowledge and reasoning, AI automation, human-centered AI, and federated learning.

Read our NeurIPS 2022 accepted papers: ibm.biz/NeurIPS22Papers

We invite you to join IBM experts at the following workshops:

Important conference dates:

• Expo (Industry) Day - Monday, November 28
• Conference Sessions - Tuesday, November 29 through December 1
• Workshops - Friday, December 2 through Saturday, December 3
• Virtual Only Pass - Monday, November 28 through December 9

## Why attend

Join conversations on machine learning best practices, attend education tutorials, and participate in workshops.

Meet with IBM recruiting and hiring managers about future job opportunities or 2023 summer internships.

We look forward to meeting you and seeing you in New Orleans!

Stay connected with us for career opportunities: https://ibm.biz/connectwithus

## Agenda

• Partially monotone regression is a regression analysis in which the target values are monotonically increasing with respect to a subset of input features. The TensorFlow Lattice library is one of the standard machine learning libraries for partially monotone regression. It consists of several neural network layers, and its core component is the lattice layer. One of the problems of the lattice layer is its requirement for a special training algorithm to satisfy monotonicity constraints. Another problem is that it cannot receive a high-dimensional input vector due to the resultant memory consumption. We propose a novel neural network layer, the hierarchical lattice layer (HLL), as an extension of the lattice layer so that we can use a standard neural network algorithm to train HLL while satisfying monotonicity constraints and so that it can receive a high-dimensional input vector. Our experiments demonstrate that HLL did not sacrifice its prediction performance on real datasets compared with the lattice layer.

Hiroki Yanagisawa (IBM); Kohei Miyaguchi (IBM); Takayuki Katsuki (IBM)

• The mixing time of the Markov chain induced by a policy limits performance in real-world continual learning scenarios. Yet, the effect of mixing times on learning in continual reinforcement learning (RL) remains underexplored. In this paper, we characterize problems that are of long-term interest to the development of continual RL, which we call scalable MDPs, through the lens of mixing times. In particular, we theoretically establish that scalable MDPs have mixing times that scale polynomially with the size of the problem. We go on to demonstrate that polynomial mixing times present significant difficulties for existing approaches that suffer from myopic bias and stale bootstrapped estimates. To validate the proposed theory, we study the empirical scaling behavior of mixing times with respect to the number of tasks and task switching frequency for pretrained high performing policies on seven Atari games. Our analysis demonstrates both that polynomial mixing times do emerge in practice and how their existence may lead to unstable learning behavior like catastrophic forgetting in continual learning settings.

Matthew Riemer (IBM); Sharath Raparthy; Ignacio Cases; Gopeshh Subbaraj; Maximilian Touzel; Irina Rish

• Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail part of the data distribution of the huge datasets used for FM pre-training. This underlines the necessity to explicitly evaluate and finetune FMs on such expert tasks, arguably ones that appear the most in practical real-world applications. In this paper, we propose a first of its kind FETA benchmark built around the task of teaching FMs to understand technical documentation, via learning to match their graphical illustrations to corresponding language descriptions. Our FETA benchmark focuses on text-to-image and image-to-text retrieval in public car manuals and sales catalogue brochures. FETA is equipped with a procedure for completely automatic annotation extraction (code would be released upon acceptance), allowing easy extension of FETA to more documentation types and application domains in the future. Our automatic annotation leads to an automated performance metric shown to be consistent with metrics computed on human-curated annotations (also released). We provide multiple baselines and analysis of popular FMs on FETA leading to several interesting findings that we believe would be very valuable to the FM community, paving the way towards real-world application of FMs for practical expert tasks currently 'overlooked' by standard benchmarks focusing on common objects.

Amit Alfassy (IBM); Assaf Arbelle (IBM); Oshri Halimi; Sivan Harary (IBM); Roi Herzig; Eliyahu Schwartz (IBM); Rameswar Panda (IBM); Michele Dolfi (IBM); Christoph Auer (IBM); Peter Staar (IBM); Kate Saenko (IBM); Rogerio Feris (IBM); Leonid Karlinsky

• The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment’s transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent’s average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.

Dong Ki Kim; Matthew Riemer (IBM); Miao Liu (IBM); Jakob Foerster; Michael Everret; Chuangchuang Sun; Gerald Tesauro (IBM); Jonathan How

• Most of the existing algorithms for zero-shot classification problems typically rely on the attribute-based semantic relations among categories to realize the classification of novel categories without observing any of their instances. However, training the zero-shot classification models still requires attribute labeling for each class (or even instance) in the training dataset, which is also expensive. To this end, in this paper, we bring up a new problem scenario: ''Can we derive zero-shot learning for novel attribute detectors/classifiers and use them to automatically annotate the dataset for labeling efficiency?'' Basically, given only a small set of detectors that are learned to recognize some manually annotated attributes (i.e., the seen attributes), we aim to synthesize the detectors of novel attributes in a zero-shot learning manner. Our proposed method, Zero-Shot Learning for Attributes (ZSLA), which is the first of its kind to the best of our knowledge, tackles this new research problem by applying the set operations to first decompose the seen attributes into their basic attributes and then recombine these basic attributes into the novel ones. Extensive experiments are conducted to verify the capacity of our synthesized detectors for accurately capturing the semantics of the novel attributes and show their superior performance in terms of detection and localization compared to other baseline approaches. Moreover, we demonstrate the application of automatic annotation using our synthesized detectors on Caltech-UCSD Birds-200-2011 dataset. Various generalized zero-shot classification algorithms trained upon the dataset re-annotated by ZSLA shows comparable performance with those trained with the manual ground-truth annotations.

Yu-hsuan Li; Tzu-yin Chao; Ching-chun Huang; Pin-Yu Chen (IBM); Wei-Chen Chiu

• A classical result of Johnson and Lindenstrauss states that a set of n high dimensional data points can be projected down to O(log n/ε^2) dimensions such that the square of their pairwise distances will be preserved up to some small distortion ε ∈ (0,1). This work aims to improve this1/ε^2 dependency based on techniques inspired by the Hutch++ Algorithm [23], which, remarkably, reduced1/ε^2 to1/ε for the related problem of implicit matrix trace estimation. Forε = 0.01, for example, this translates to 100 times less matrix-vector products in the matrix-vector query model to achieve the same accuracy. We first present an algorithm for estimating the Euclidean lengths of the rows of a matrix for which we prove (i) element-wise probabilistic bounds which are at least as good as standard JL estimations in the worst case, but are asymptotically better for matrices with rapidly decaying spectrum and (ii) for any matrix, regardless its spectrum, the algorithm can achieve εaccuracy for the total, Frobenius norm-wise relative error using only O(1/ε) queries, which is a quadratic improvement over standard JL approximations. We show that these results can also be extended for two other important problems, namely for estimating the Euclidean distances between data points, as well as for approximating the statistical leverage scores of a tall-and-skinny data matrix. We also provide indicative numerical experiments validating our theoretical analysis.

Aleksandros Sobczyk (IBM); Mathieu Luisier

• Deep neural networks have seen great success in recent years; however, training a deep model is often challenging as its performance heavily depends on the hyper-parameters used. In addition, finding the optimal hyper-parameter configuration, even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms, can be time-consuming, requiring multiple training runs over the entire dataset for different possible sets of hyper-parameters. Our central insight is that using an informative subset of the dataset for model training runs involved in hyper-parameter optimization, allows us to find the optimal hyper-parameter configuration significantly faster. In this work, we propose AUTOMATA, a gradient-based subset selection framework for hyper-parameter tuning. We empirically evaluate the effectiveness of AUTOMATA in hyper-parameter tuning through several experiments on real-world datasets in the text, vision, and tabular domains. Our experiments show that using gradient-based data subsets for hyper-parameter tuning achieves significantly faster turnaround times and speedups of 3-30 while achieving comparable performance to the hyper-parameters found using the entire dataset.

Krishnateja Killamsetty; Guttu Sai Abhishek; Aakriti Lnu; Alexandre Evfimievski (IBM); Lucian Popa (IBM); Ganesh Ramakrishnan; Rishabh Iyer

• We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are \emph{calibrated}: enforcing the reformulated constraints guarantees that their expected value counterparts are satisfied with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efficacy of our method on a fairness-sensitive classification task where we wish to guarantee the classifier's fairness (at test time).

Songkai Xue; Yuekai Sun; Mikhail Yurochkin (IBM)

• In consequential decision-making applications, mitigating unwanted biases in machine learning models that yield systematic disadvantage to members of groups delineated by sensitive attributes such as race and gender is one key intervention to strive for equity. Focusing on demographic parity and equality of opportunity, in this paper we propose an algorithm that improves the fairness of a pre-trained classifier by simply dropping carefully selected training data points. We select instances based on their influence on the fairness metric of interest, computed using an infinitesimal jackknife-based approach. The dropping of training points is done in principle, but in practice does not require the model to be refit. Crucially, we find that such an intervention does not substantially reduce the predictive performance of the model but drastically improves the fairness metric. Through careful experiments, we evaluate the effectiveness of the proposed approach on diverse tasks and find that it consistently improves upon existing alternatives.

Prasanna Sattigeri (IBM); Soumya Ghosh (IBM); Inkit Padhi (IBM); Pierre Dognin (IBM); Kush Varshney (IBM)

• Action recognition has improved dramatically with massive-scale video datasets. Yet, these datasets are accompanied with issues related to curation cost, privacy, ethics, bias, and copyright. Compared to that, only minor efforts have been devoted toward exploring the potential of synthetic video data. In this work, as a stepping stone towards addressing these shortcomings, we study the transferability of video representations learned solely from synthetically-generated video clips, instead of real data. We propose SynAPT, a novel benchmark for action recognition based on a combination of existing synthetic datasets, in which a model is pre-trained on synthetic videos rendered by various graphics simulators, and then transferred to a set of downstream action recognition datasets, containing different categories than the synthetic data. We provide an extensive baseline analysis on SynAPT revealing that the simulation-to-real gap is minor for datasets with low object and scene bias, where models pre-trained with synthetic data even outperform their real data counterparts. We posit that the gap between real and synthetic action representations can be attributed to contextual bias and static objects related to the action, instead of the temporal dynamics of the action itself.

Yo-whan Kim; Samarth Mishra (IBM); Souyoung Jin; Rameswar Panda (IBM); Hildegard Kuehne; Leonid Karlinsky (IBM); Kate Saenko (IBM); Aude Oliva; Rogerio Feris (IBM)

• We propose VRL3, a powerful data-driven framework with a minimalist design for solving highly challenging visual deep reinforcement learning (DRL) tasks. We analyze a number of major obstacles in taking a data-driven approach, and present a suite of design principles, novel findings, and critical insights about data-driven visual DRL. Our framework has three stages: in stage 1, we leverage non-RL datasets (e.g. ImageNet) to learn task-agnostic visual representations; in stage 2, we use offline RL data (e.g. a limited number of expert demonstrations) to convert the task-agnostic representations into more powerful task-specific representations; in stage 3, we fine-tune the agent with online RL. On a set of highly challenging hand manipulation tasks with sparse reward and realistic visual inputs, compared to the previous SOTA, VRL3 achieves an average of 780% better sample efficiency. And on the hardest task, VRL3 is 1220% more sample efficient and solves the task with only 10% of the computation. These highly significant results clearly demonstrate the great potential of data-driven deep reinforcement learning.

Che Wang; Xufang Luo; Keith Ross; Dongsheng Li (IBM)

• We study the following independence testing problem: given access to samples from a distribution P over ${0,1}^n$, decide whether P is a product distribution or whether it is ε-far in total variation distance from any product distribution. For arbitrary distributions, this problem requires exp(n) samples. We show in this work that if P has a sparse structure, then in fact only linearly many samples are required.Specifically, if P is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by d, then ~Θ( $2^{d/2}$ ⋅n/ $ε^2$) samples are necessary and sufficient for independence testing.

Arnab Bhattacharyya; Clément Cannone (IBM); Qiping Yang

• Graph-level anomaly detection aims to distinguish anomalous graphs in a graph dataset from normal graphs. Anomalous graphs represent very few but essential patterns in the real world. The anomalous property of a graph may be referable to its anomalous attributes of particular nodes and anomalous substructures referring to a subset of nodes and edges in the graph. In addition, due to the imbalance nature of anomaly problem, the anomalous information will be diluted by normal graphs with overwhelming quantities. Various anomaly notions in the attributes and/or substructures and the imbalance nature together make detecting anomalous graphs a non-trivial task. In this paper, we propose a dual-discriminative graph neural network for graph-level anomaly detection, namely iGAD. Specifically, an anomalous graph attribute-aware graph convolution and an anomalous graph substructure-aware deep Random Walk Kernel (deep RWK) are welded into a graph neural network to achieve a dual-discriminative ability on anomalous attributes and substructures. The deep RWK in iGAD makes up for the deficiency of graph convolution in distinguishing structural information caused by the simple neighborhood aggregation mechanism. Further, we propose a Point Mutual Information-based loss function to address the imbalance nature of anomaly problem. The loss function enables iGAD to capture the essential correlation between input graphs and their anomalous/normal properties. We evaluate iGAD on four real-world graph datasets. Extensive experiments demonstrate the superiority of iGAD on the graph-level anomaly detection task.

Ge Zhang; Zhenyu Yang; Jia Wu; Jian Yang; Shan Xue; Hao Peng; Jianlin Su; Chuan Zhou; Quan Z. Sheng; Leman Akoglu; Charu Aggarwal (IBM)

• The deployment constraints in practical applications necessitate the pruning of large-scale deep learning models, i.e., promoting their weight sparsity. As illustrated by the Lottery Ticket Hypothesis (LTH), pruning also has the potential of improving their generalization ability. At the core of LTH, iterative magnitude pruning (IMP) is the predominant pruning method to successfully find ‘winning tickets’. Yet, the computation cost of IMP grows prohibitively as the targeted pruning ratio increases. To reduce the computation overhead, various efficient ‘one-shot’ pruning methods have been developed, but these schemes are usually unable to find winning tickets as good as IMP. This raises the question of how to close the gap between pruning accuracy and pruning efficiency? To tackle it, we pursue the algorithmic advancement of model pruning. Specifically, we formulate the pruning problem from a fresh and novel viewpoint, bi-level optimization (BLO). We show that the BLO interpretation provides a technically-grounded optimization base for an efficient implementation of the pruning-retraining learning paradigm used in IMP. We also show that the proposed bi-level optimization-oriented pruning method (termed BiP) is a special class of BLO problems with a bi-linear problem structure. By leveraging such bi-linearity, we theoretically show that BiP can be solved as easily as first-order optimization, thus inheriting the computation efficiency. Through extensive experiments on both structured and unstructured pruning with 5 model architectures and 4 data sets, we demonstrate that BiP can find better winning tickets than IMP in most cases, and is computationally as efficient as the one-shot pruning schemes, demonstrating 2-7x speedup over IMP for the same level of model accuracy and sparsity.

Yihua Zhang; Yuguang Yao; Parikshit Ram (IBM); Pu Zhao; Tianlong Chen; Mingyi Hong; Yanzhi Wang; Sijia Liu (IBM)

• Training deep neural network classifiers that are certifiably robust against adversarial attacks is critical to ensuring the security and reliability of AI-controlled systems. Although numerous state-of-the-art certified training methods have been developed, they are computationally expensive and scale poorly with respect to both dataset and network complexity. Widespread usage of certified training is further hindered by the fact that periodic retraining is necessary to incorporate new data and network improvements. In this paper, we propose Certified Robustness Transfer (CRT), a general-purpose framework for reducing the computational overhead of any certifiably robust training method through knowledge transfer. Given a robust teacher, our framework uses a novel training loss to transfer the teacher's robustness to the student. We provide theoretical and empirical validation of CRT. Our experiments on CIFAR-10 show that CRT speeds up certified robustness training by 8x on average across three different architecture generations, while achieving comparable robustness to state-of-the-art methods. We also show that CRT can scale to large-scale datasets like ImageNet.

Pratik Vaishnavi; Kevin Eykholt (IBM); Amir Rahmati

• We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given n i.i.d. samples from the distribution N(θ, $I\_d$ ) (with unknown θ), of which a small fraction has been arbitrarily corrupted. Under the promise that ∥θ∥0≤s, we want to correctly distinguish whether ∥θ∥ $\_2$ =0 or ∥θ∥ $\_2$ >γ, for some input parameter γ>0. We show that any algorithm for this task requires n=Ω(s log ed/s) samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, ∥θ∥q≤s for any 0<q<2. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be s-sparse. Here too we assume that an ϵ-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least n=Ω(min(s log d,1/γ4)) samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.

Anand Jerry George; Clément Cannone