Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
The ACM SIGKDD Conference on Knowledge Discovery and Data Mining is one of the leading annual conferences on data science, data mining, knowledge discovery, large-scale data analytics, and big data. The event convenes researchers and practitioners from across disciplines to share best practices and discuss their work.
IBM researchers are presenting hands-on tutorials on how to use our Toolkit for Time Series Anomaly Detection and on Gradual AutoML using Lale. Our experts are also co-organizing seven workshops at KDD and have co-authored eight main conference papers.
Learn about the latest trends in data mining and knowledge discovery. Meet IBM experts working in this space.
We propose an extension to the transformer neural network architecture for general-purpose graph learning by adding a dedicated pathway for pairwise structural information, called edge channels. The resultant framework - which we call Edge-augmented Graph Transformer (EGT) - can directly accept, process and output structural information of arbitrary form, which is important for effective learning on graph-structured data. Our model exclusively uses global self-attention as an aggregation mechanism rather than static localized convolutional aggregation. This allows for unconstrained long-range dynamic interactions between nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. We verify the performance of EGT in a wide range of graph-learning experiments on benchmark datasets, in which it outperforms Convolutional/Message-Passing Graph Neural Networks. EGT sets a new state-of-the-art for the quantum-chemical regression task on the OGB-LSC PCQM4Mv2 dataset containing 3.8 million molecular graphs. Our findings indicate that global self-attention based aggregation can serve as a flexible, adaptive and effective replacement of graph convolution for general-purpose graph learning. Therefore, convolutional local neighborhood aggregation is not an essential inductive bias.
Md Shamim Hussain (RPI); Mohammed Zaki (RPI); Dharmashankar Subramanian (IBM Research)
With the advent of big data across multiple high-impact applications, we are often facing the challenge of complex heterogeneity. The newly collected data usually consist of multiple modalities and are characterized with multiple labels, thus exhibiting the co-existence of multiple types of heterogeneity. Although state-of-the-art techniques are good at modeling complex heterogeneity with sufficient label information, such label information can be quite expensive to obtain in real applications. Recently, researchers pay great attention to contrastive learning due to its prominent performance by utilizing rich unlabeled data. However, existing work on contrastive learning is not able to address the problem of false negative pairs, i.e., some `negative' pairs may have similar representations if they have the same label. To overcome the issues, in this paper, we propose a unified heterogeneous learning framework, which combines both the weighted unsupervised contrastive loss and the weighted supervised contrastive loss to model multiple types of heterogeneity. We first provide a theoretical analysis showing that the vanilla contrastive learning loss easily leads to the sub-optimal solution in the presence of false negative pairs, whereas the proposed weighted loss could automatically adjust the weight based on the similarity of the learned representations to mitigate this issue. Experimental results on real-world data sets demonstrate the effectiveness and the efficiency of the proposed framework modeling multiple types of heterogeneity.
Lecheng Zheng (University of Illinois at Urbana-Champaign); Jinjun Xiong (University at Buffalo); Yada Zhu (IBM Research); Jingrui He (University of Illinois at Urbana-Champaign)
Recent years have witnessed remarkable success achieved by graph neural networks (GNNs) in many real-world applications such as recommendation and drug discovery. Despite the success, oversmoothing has been identified as one of the key issues which limit the performance of deep GNNs. It indicates that the learned node representations are highly indistinguishable due to the stacked aggregators. In this paper, we propose a new perspective to look at the performance degradation of deep GNNs, i.e., feature overcorrelation. Through empirical and theoretical study on this matter, we demonstrate the existence of feature overcorrelation in deeper GNNs and reveal potential reasons leading to this issue. To reduce the feature correlation, we propose a general framework DeCorr which can encourage GNNs to encode less redundant information. Extensive experiments have demonstrated that DeCorr can help enable deeper GNNs and is complementary to existing techniques tackling the oversmoothing issue.
Wei Jin (MSU); Xiaorui Liu (MSU); Yao Ma (MSU); Charu Aggarwal (IBM); Jiliang Tang (MSU)
Prerna Agarwal (IBM Research); Buyu Gao (IBM); Siyu Huo (IBM Research); Prabhat Reddy (IBM Research); Sampath Dechu (IBM Research); Yazan Obeidi (IBM); Vinod Muthusamy (IBM Research); Vatche Isahagian (IBM Research); Sebastian Carbajales (IBM)
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we present \textit{DocLayNet}, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10\% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
Birgit Pfitzmann (IBM Research); Christoph Auer (IBM Research); Michele Dolfi (IBM Research); Ahmed S Nassar (IBM Research); Peter W J Staar (IBM Research)
Mathematical decision-optimization (DO) models provide decision support in a wide range of scenarios. Often, hard-to-model constraints and objectives are learned from data. Learning, however, can give rise to DO models that fail to capture the real system, leading to poor recommendations. We introduce an open-source framework designed for large-scale testing and solution quality analysis of DO model learning algorithms. Our framework produces multiple optimization problems at random, feeds them to the user's algorithm and collects its predicted optima. By comparing predictions against the ground truth, our framework delivers a comprehensive prediction profile of the algorithm. Thus, it provides a playground for researchers and data scientists to develop, test, and tune their DO model learning algorithms. Our contributions include: (1) an open-source testing framework implementation, (2) a novel way to generate DO ground truth, and (3) a first-of-its-kind, generic, cloud-distributed Ray and Rayvens architecture. We demonstrate the use of our testing framework on two open-source DO model learning algorithms.
Orit Davidovich (IBM Research); Gheorghe-Teodor Bercea (IBM Research); Segev Wasserkrug (IBM Research)
Neural networks can leverage self-supervision to learn integrated representations across multiple data modalities. This makes them suitable to uncover complex relationships between vastly different data types, thus lowering the dependency on labor-intensive feature engineering methods. Leveraging deep representation learning, we propose a generic, robust and systematic model that is able to combine multiple data modalities in a permutation and modes number-invariant fashion. Both fundamental properties to properly face changes in data type content and availability. To this end, we treat each multi-modal data sample as a set and utilise autoencoders to learn a fixed size, permutation invariant representation that can be used in any decision making process. We build upon previous work that demonstrates the feasibility of presenting a set as an input to autoencoders through content-based attention mechanisms. However, since model inputs and outputs are permutation invariant, we develop an end-to-end architecture to approximate the solution of a linear sum assignment problem, i.e., a minimum-cost bijective mapping problem, to ensure a match between the elements of the input and the reconstructed set.
For dimensions up to 128, the network demonstrates near-perfect accuracy in matching these two sets. Combining the content-based attention mechanism for set processing with our aforementioned matching network allows us to construct a Fully Differentiable Set Autoencoder. We demonstrate the model capability to learn a combined representation while preserving individual mode characteristics focusing on the task of reconstructing multi-omic cancer data.
Nikita Janakarajan (IBM Research, ETH Zürich); Jannis Born (IBM Research); Matteo Manica (IBM Research)