Release
5 minute read

SysFlow: Scalable system telemetry for improved security analytics

No organization is safe against cybercrime. Recent studies have shown that these crimes will cost the world well over $5 trillion a year by 2024 . Cyber attackers breach corporate networks using a myriad of techniques, with application vulnerabilities corresponding to 25% of all exploitable attack vectors . More disturbing is that these attacks can go unnoticed for hundreds of days , often resulting in the exfiltration of confidential company data and erosion of client trust. These staggering indicators reveal that conventional defenses are no longer sufficient to protect IT environments from cyber attacks. Indeed, traditional network traffic monitoring and misuse detection systems offer limited visibility into targeted workloads and are unable to keep up with evolving attacks, sustaining high error rates and are akin to searching for a needle in an extremely large haystack.

To fill in these blind spots, IBM Research has been investigating new monitoring technologies that provide comprehensive visibility into cloud workload behaviors. Visibility into services and applications is critical for creating a strong security posture, and an opportunity to reduce cybersecurity risks.

In our opening talk at the FloCon 2020 Conference this week, we announced the open-sourcing of SysFlow, a new system telemetry format and tool suite for monitoring system behavior for scalable security, compliance, and performance analytics. SysFlow encodes the representation of system activities into a compact format that records how applications interact with their environment. It connects process behaviors to network and file access activities, providing a richer context for analysis. This additional context facilitates deeper visibility into host and container workloads, and enables a stream of cloud workload protection use cases, including container runtime integrity protection, threat hunting, and forensics. While telemetry of system event information is not new, current monitors collect data at system call granularity, generating massive amounts of data that limit analytics to simple rule-based approaches. SysFlow drastically reduces data collection rates by orders of magnitude and lifts events into behaviors which enable forensic applications, and more comprehensive analysis approaches. Furthermore, SysFlow’s open serialization format and libraries enable integrations with open source frameworks (e.g., Spark, scikit-learn) and custom analytic microservices.

APT attack kill chain and visual representation of entities and activities captured by SysFlow
Figure 1: APT attack kill chain and visual representation of entities and activities captured by SysFlow

To illustrate the benefits of the new approach, Figure 1 shows how SysFlow can be used to uncover a targeted attack in which a cyber criminal exfiltrates data from a cloud-hosted service. During reconnaissance (step 1), the attacker detects a vulnerable node.js server that is susceptible to a remote code execution attack exploiting a vulnerability in a node.js module. The attacker exploits the system using a malicious payload (step 2), which hijacks the node.js server and downloads a python script from a remote server (step 3). The script contacts its command-and-control server (step 4), and then starts scanning the system for sensitive keys, eventually gaining access to a sensitive customer database (step 5). The attack completes when data is exfiltrated off site (step 6).

While state-of-the-art monitoring tools would only capture streams of disconnected events, SysFlow can connect the entities of each attack step on the system. For example, the highlighted SysFlow trace maps precisely the steps of the attack kill chain: the node.js process is hijacked, and then converses with a remote malware server on port 2345 to download and execute a malicious script (exfil.py), which is eventually executed and starts an interaction with a command-and-control server on port 4444 to exfiltrate sensitive information from the customer database on port 3000.

This example showcases the advantages of applying flow analysis to system telemetry. SysFlow provides visibility within host environments, by exposing relationships between containers, processes, files, and network endpoints as events (single operations) and flows (volumetric operations). For example, when the node.js process clones and execs into the new process, these tasks are recorded as process events (PE), and when a process communicates with a network endpoint or writes a file, these interactions are captured and summarized using compact file (FF) and network (NF) flows. The result is a graph-like data structure that enables precise reasoning and fast retrieval of security-relevant information, allowing automated defenses to detect and respond to attack incursions promptly, even before attackers can complete their missions.

SysFlow telemetry stack architectural overview
Figure 2: SysFlow telemetry stack architectural overview

Figure 2 shows an overview of SysFlow’s telemetry pipeline. The data processing pipeline provides a set of reusable components and APIs that enables easy deployment of telemetry probes for cloud workload monitoring, as well as the export of SysFlow records to object storage services feeding into security analytics jobs. Specifically, the analytics framework provides an extensible policy engine that ingests customizable security policies described in a declarative input language, providing facilities for defining higher-order logic expressions that are checked against SysFlow records. This allows practitioners to easily define security and compliance policies that can be deployed on a scalable, out-of-the-box analysis toolchain while supporting extensible programmatic APIs for the implementation of custom analytics algorithms. As a result, users of the pipeline can redirect their efforts to delivering high-value use cases, leveraging the data processing framework to shorten the time required to deploy and share new analytics at cloud scale.

SysFlow is an ongoing research project and we welcome feedback and contributions from the community. Unlike other telemetry sources, SysFlow observes and correlates essential system activity, providing security teams with the necessary contextual information to identify cyber attacks and close security incidents quicker, without overwhelming analysts with disposable noise. As the SysFlow project matures, our goal is to contribute an open standard and data representation for system telemetry that may be adopted across industry.

To learn more, file bug reports, or contribute to SysFlow: