AI for IT Infrastructure

AI to improve infrastructure efficiency and user productivity, and to extract valuable insights from business data.

Overview

This project aims to enhance the capabilities of IBM IT Infrastructure with AI, by using the power of Foundation Models. We develop AI-infused tools to help increase the productivity and efficiency of human agents. We also design AI solutions to enhance the resiliency and security of IT systems and to fight digital crime.

AI for Digital Assistant Applications

Digital Assistants are AI-powered agents that augment humans in various tasks, e.g. in human resources, infrastructure support and system administration. We are designing novel AI algorithms and tools to help human support agents increase their efficiency and reduce the time to resolve cases, and to help customers handle certain infrastructure issues directly.

The objectives of our work are:

To aid support agents to resolve customer cases faster and better,
to proactively recognize anomalies in infrastructure operation before these occur and alert the customer or support agent accordingly,
to help customers self-diagnose and troubleshoot problems faster and efficiently (case deflection).

Infrastructure IT systems collect and report multi-domain and multi-variate data about their operation, e.g. configuration, topology, performance data and logs. In addition, every time an IT infrastructure customer opens a ticket to report an issue, or when tickets/alerts are generated automatically, textual reports are created with information describing the issue. Support agents are then manually examining these reports to understand the symptoms, identify the root causes and eventually propose a resolution action, which is then communicated to the customer.

There are several inefficiencies with the above process.

The work of the support agent is largely manual, labor-intensive and thus not scalable and costly.
As a result, some customer problems may be left waiting, possibly causing customer dissatisfaction.
Whereas agents develop deep know-how and expertise on particular problems and associated root causes and resolutions, this large knowledge capital is lost if and when they leave the company. New agents lack the experience to handle cases efficiently and their on-boarding is costly and time-consuming.

Semi-automatic, rule-based systems have been put in place over time to assist support agents in their everyday tasks, in particular to spot anomalies by monitoring particular metrics. This approach has two main drawbacks: (1) it is based on fixed, pre-defined and rigid rules that cannot easily adapt to the particular situation and are thus ineffective, (2) it can lead to a large number of false alerts which result in significant additional overhead for the agents to debug.

We utilize multi-variate performance metrics (time series) collected periodically from IT systems in the field, as well as system logs and hardware/software tickets. We then use AI and Foundation Models to make sense of all this multivariate and diverse data. Among the main use cases that we develop are real-time anomaly detection, ticket resolution recommendation, and interactive chatbots.

Financial Crime Detection and Accelerated AI Model Inference in the IBM Mainframe

With the widespread digitization of finance and the increasing popularity of cryptocurrencies, the sophistication of financial fraud schemes devised by cybercriminals is growing. For example, money laundering -- the movement of illicit funds to conceal their origins -- can cross bank and national boundaries, producing complex transaction patterns. The UN estimates 2-5% of global GDP or 0.8 - 2.0 trillion dollars are laundered globally each year. Much of these illegal money is used to fund global terrorist activities, which pose a serious threat to the world order.

Our goal is to devise technology that can detect financial crime activity manifesting as particular patterns in transaction networks. Furthermore, patterns are also relevant for other graph-related tasks, ranging from protein folding to traffic forecasting. We are designing algorithms that mine large financial transaction graphs and are able to detect known financial crime patterns in real-time [Blanusa-2022, Blanusa-2023, Altman-2023]. We are also devising new GNN architectures that can detect arbitrary subgraphs in directed multigraphs with theoretical guarantees [Egressy-2024]. Our techniques have been applied successfully to synthetic money-laundering transactions as well as real-world phishing datasets and have shown very promising performance.

The IBM mainframe is the undisputed system for transactional workloads in the enterprise space. For reasons of data security and sustainability, it is often not desirable to move data to and from the mainframe. At the same time, users want to run advanced AI/ML algorithms on their data to uncover patterns of interest. In such cases, bringing the compute close to the data is often more effective. We are developing advanced AI algorithms for accelerating the inference of Foundation Models and Machine Learning models on IBM Z systems, also leveraging the integrated accelerator for AI in z16.

Ransomware Detection in Storage Systems

In our fast moving world where Artificial Intelligence, Deep Neural Networks and Machine Learning achieve ever more impressive tasks, the availability and preservation of key business data is a critical asset, often touted as the "oil of the 21st century".

However, in addition to the legal owners of the data, fraudsters are also attracted to them because of their high value. Illegal capture, encryption or deletion of data, in the form of ransomware, spyware or other malware is at a steep rise worldwide.

At IBM Research we are working on developing advanced AI/ML techniques to detect suspicious access or tampering with data in real time. To achieve that, we monitor all I/O activity at the block storage level, within the IBM Flash Core Module and FlashSystem, and generate patterns that are then fed to machine learning algorithms for detection of malware activity. Our schemes are able to detect known ransomware in sub-minute intervals, allowing backup and recovery techniques to protect and recover user data before it is too late.

Our research extends also to creating synthetic ransomware and other malware traces, which can be used to train AI/ML algorithms to detect even obscure data attacks. We are working closely with IBM Storage and the FlashSystem division in particular to release advanced malware detection capabilities in IBM FlashSystem products.