App-aware-networking.gif

Application Aware Networking

Overview

In order to meet the diverse reliability, performance and security needs of hybrid cloud applications, the network needs to evolve from being a simple pipe that moves packets across from one place to another, to becoming intelligent and aware of the needs of the applications it supports. Application developers and devops practitioners seek to specify the needs of application connectivity as 'intents', that get translated into a multi-cloud networking fabric of connections that are fine-tuned and optimized for each application. Supporting diverse application needs on the same underlying network is an extremely challenging problem that we aim to address leveraging a combination of technologies including Software Defined Networking (SDN), eBPF-based network datapath and programmable observability, and AI-based network management and optimization.

Enterprise Multi-cloud Networking

Multi-cluster networking, across clusters running on different domains - such as multiple zones, multiple clouds, is becoming an important requirement in the world of hybrid-cloud transformation. Unfortunately, existing extension of single cluster networking solutions fall short of achieving the goals of true independent multi-cluster deployment setups with a single plane of glass view to be able to control, manage and observe applications and their networking interactions. A major design decision in this space is to obtain a birds eye view of all the services distributed across diverse cloud environments along with the ability to drill down to a root cause in case of a failure. We are working on building observability solutions for such multi-cloud environments which will be enable intelligent decisions with our analytical prowess. Recently, there has also been an interest in the space of Extensible Internet, a novel backward compatible approach towards making new Internet services available on the existing Internet infrastructure such that older services are not impacted, while clients can now use new services without any tie in to particular clouds or providers.

aani1b.png

5G Assurance and Optimization

The 5th generation of telecom network has adopted revolutionizing changes such as Software Defined Networking (SDN) and Network Function Virtualization (NFV), which has made network slicing possible. Slicing allows different customers with varying network requirements of latency, bandwidth, reliability and quality of service to co-exist with isolated virtual networks on the same physical infrastructure. To enable rapid, reliable and scalable management and orchestration (MANO) of these 5G network slices, there is a need for embedding intelligence and optimization into all aspects of lifecycle management including slice planning, automated deployment and operational assurance. At IBM Research, we are working on groundbreaking research towards addressing the challenges in meeting slice KPIs despite dynamically changing network conditions, which entails coming up with a novel and sophisticated distributed system capable of performing dynamic monitoring, analysis and optimization of the slices across domains (i.e. Core, Transport and RAN), across the full stack (i.e. hardware, orchestrator, network functions & services) and across clusters. 

IBM Research and CP4NA [1] have designed a 5G slicing lifecycle MANO framework [2] that handles Day 0 operations of onboarding network functions and designing slice templates, Day 1 operations of zero-touch deployment of slices with intelligent decisions on optimal placement and Day 2 operations of slice monitoring and assurance using a Data and Analytics Function and an optimization engine. The Data and Analytics Function (DAF) builds upon the NWDAF concept in 3GPP [3], but is a more pluggable module for all data and analytics across domains (Core, Transport and RAN) and across the full stack. It drives the closed loop automation by monitoring Slice KPIs and raising alerts in case of a violation. The Slice optimizer optimizes placement of functions, allocation of resources, while simultaneously addressing feasibility of new slice requests leveraging data from the DAF. We have outlined the details of the framework in an Industry track paper [2] which will be presented at ACM Middleware 2022. The framework was also showcased at Mobile World Congress (MWC) Las Vegas 2022. Earlier versions of this work was demoed at MWC Los Angeles 2021, MWC Barcelona 2021 and COMSNETS 2022 [4] (received “Best research demo - runner up award” in COMSNETS 2022).

aani2.png

This framework works with a vast amount of distributed data having different modalities (logs, events, metrics, etc.) and therefore, we are working on building a federated data lake for all this data. Additionally, we are currently working on expanding the framework’s capabilities to handle a range of advanced optimization use cases such as optimal resource allocation to different sites, revenue/performance optimization, etc. We are also studying the applicability of the framework in Enterprise Edge context, thereby, supporting cross-cluster optimized deployments across Edge MEC and Core sites to improve performance/cost. Along with slice assurance and optimization, we are also exploring aspects in 5G design which aid performance improvements. One of the work in progress is understanding and optimizing the multicore scalability of the 5G core network functions. We published a workshop paper [5] evaluating the 5G core scalability across various network stack considered for developing the 5G core.

Network Observability

Observability is the ability to know and interpret the current state of a deployment, and a way to know when something is amiss. With cloud deployments of applications as microservices on Kubernetes and OpenShift growing, observability is getting a lot of attention. Many applications come with strict guarantees, such as service level agreements (SLA) for downtimes, latency, and throughput, so network-level observability is a highly imperative feature.

Recently, eBPF (extended Berkeley Packet Filter) emerged as a popular option to implement observability at the end-hosts kernel, due to performance and flexibility. This method enables custom programs to be hooked at certain points along the network data path (for instance, a socket, TC, and XDP). Several open source eBPF-based plugins and operators have been released, and each can be plugged into end-host nodes to provide network observability through your cloud orchestrator.

In our research, we analyzed monitoring of packet-level and flow-level information between multiple hosts in the cloud. We started with the premise that the core feature of observability is how the data is collected in a non-invasive manner. We studied how the performance of flows were affected by the data structure used to collect flow metrics. More details on the performance studies can be found in the blog. Based on our studies,  we are working  on the design of an optimized eBPF observability data path that we jointly developed with Red Hat Observability team, which is now available as part of OpenShift 4.11.

aani3.png

Along the same theme, we are also working on an exploratory project along with our academic partners on performance diagnosis of an application that is deployed as microservices. Recently, we have observed several operators who had to debug latency issues[3,4], where the system has to be torn down and instrumentation had to be done on each and every node to identify which entity is causing the latency spike. We attribute this problem of cumbersome debugging experience to absence an end-to-end observability in the orchestrators, which stems from the gap between application, host and network observability.  In this project [5], we are working on to solve this problem of end-to-end observability and bridge the gap between application,  host and network observability. 

eBPF-based Network Datapaths

Emerging applications, particularly at the edge, require both ultra low latency processing and minimal resource overhead, requirements that are not satisfactorily met by traditional network stacks. There is a clear need to develop network stacks that are performant, low maintenance and extensible to emerging application needs at runtime.  Traditional approaches for extending application requirements into the network data path such as iptables, custom kernel modules as well as kernel by pass techniques such as DPDK, have well known shortcomings such as lack of extensibility and a well as being resource heavy.  In this project, we explore the use of eBPF, a recent linux kernel technology in building flexible and resource efficient data paths.

While the use of eBPF for networking is not novel, most of the current solutions are monoliths designed for specific use cases and deployments.  To reuse functionalities from such large monoliths is generally hard due to a number of practical reasons. One needs to be able to identify the specific functionality and their control and data dependencies in the large code base to begin the extraction process. Similarly, extending functionalities to such monoliths is also non-trivial as one would need to make a large number of changes to the program  at different control points. Without complete understanding of the program, extracting and extending functionalities to the monolith is non-trivial and error prone. 

In the OPENED project, we propose to 1) build a composable framework that enables stitching together eBPF networking modules together to build customized datapaths, thereby enabling developers to rapidly prototype and try out new functionalities in eBPF without having to intricately understand large monolithic code bases. We also consider cases of automatic transformation of code written for one hook point to another. For example, a developer should be able to extract functionalities from Katran’s XDP based load balancer, transform them to TC hook point, and hook them into their datapath seamlessly. 2) Explore using AF_XDP to build hybrid data paths where bulk of packet processing happens in user space, thereby leveraging the flexibility and ease of development offered by user space tooling . Finally, we  compare the benefits of each approach in terms of performance, resource efficiency and feature velocity.

aani4.png

As part of this effort, we are looking forward to actively collaborate with academia and other industry partners in developing a vibrant open source ecosystem for building tooling that will enable easy development of eBPF based Network Functions (NFs). Please feel free to reach out to us (palani.kodeswaran@in.ibm.com or sayandes@in.ibm.com) if you have any questions or would like to collaborate.

Technical Resources and Products