Berkeley Innovation Forum 2025 at IBM Research
- San Jose, CA, USA
Open Source Summit is the premier event for open source developers, technologists, and community leaders to collaborate, share information, solve problems, and gain knowledge, furthering open source innovation and ensuring a sustainable open source ecosystem. It is the gathering place for open-source code and community contributors.
In the rapidly evolving landscape of Generative AI, it is becoming clearer every day that open strategies are key drivers of innovation and widespread adoption. This session will present the Generative AI Commons, its activities, and deliverables to date, including the Model Openness Framework (MOF) and the Responsible Generative AI Framework,
LF AI & Data initiative dedicated to fostering the democratization, advancement and adoption of efficient, secure, reliable, and ethical Generative AI open source innovations through neutral governance, open and transparent collaboration and education.
Speakers: Arnaud Le Hors, IBM Research & Ofer Hermoni, iForAI
Software supply chain attack is an emerging threat for today’s enterprises. An attacker first gets an internal network access of the target enterprise, typically by using social engineering. Next the attacker gets administrator access to a software supply chain of the enterprise. Finally the attacker injects a backdoor into a built artifact and steals confidential information or digital assets from the enterprise, or even worse from customers.
A critical attack surface here is the administrator of the software supply chain. Confidential Containers is an open source project to protect containers from administrators by using trusted execution environments (TEEs). It protects a Kubernetes pod from a cluster administrator by running the pod inside of a TEE and validating the pod by remote attestation.
This talk presents a use case of Confidential Containers to protect a Tekton task. You will understand how Confidential Containers protects a task and artifacts even when the cluster administrator is compromised.
Speaker: Tatsushi Inagaki, IBM Research
Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications.
Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP?
This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization.
We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.
Speakers: Anish Asthana, Red Hat & Mohammad Nassar, IBM Research
Confidential AI leveraging GPUs can bring AI to the masses without sacrificing the privacy of end users. Individual open source technologies already exist to configure, deploy, and manage confidential TEEs. However, clobbering a multitude of components into a coherent, secure, and efficient solution is challenging with many pitfalls. For example, depending on use cases and involved parties (cloud/model/service owners), attestation and key management methodology can vary drastically. In addition, for TEEs with confidential GPUs, complexity extends to increased load times, affecting services that serve multiple models.
This talk will go through key components and design decisions needed to enable confidential AI. Specifically: i) implications of different trust models on the solution and (ii) performance tradeoff considerations. To concretize the discussion, we will present a detailed end-to-end 'how to', for deploying an inference service on Nvidia H100 GPUs and AMD-based TEE with a focus on protecting the model and the user input. The audience will be able to appreciate why there can be no one size fit all confidential AI solution and understand what design works for them.
Speakers: Julian Stephen & Michael Le, IBM Research
When faced with a difficult challenge sometimes it helps to look back at lessons from ancient history to guide your thinking. The Open Source Initiative (OSI) is working to create a definition for Open Source AI (OSAID), aiming to apply open source principles to artificial intelligence development, but clearly the 1.0 version is a work-in-progress. Can it find success? How may policy-makers react? Join this session to hear about the latest efforts to define open source AI and what's likely in store for 2025.
Speaker: Jeffrey Borek, IBM Research
With the increase in generative AI model use, there is a growing concern of how models can divulge information or generate inappropriate content. This concern is leading to the development of technologies to “guardrail” user interactions with models. Some of these guardrails models are simple classification models, while others like IBM’s Granite Guardian or Meta’s Llama Guard are themselves generative models, able to identify multiple risks. As new models appear, a variety of large language model serving solutions are being developed and optimized. An open-sourced example, vllm, has become an increasingly popular serving engine.
In this talk I’ll discuss how we built an open-sourced adapter on top of vllm to serve an API for guardrails models, so that models like Granite Guardian and Llama Guard can be easily applied as guardrails in generative AI workflows.
Speaker: Evaline Ju, IBM Research
This session introduces **Interns for Open Source (IFOS)**, a program that offers undergraduate and graduate Computer and Information Sciences students hands-on experience with open source projects for academic credit. Over 10 weeks, students bridge classroom learning and real-world application by contributing through issue tracking and pull requests. Their fresh perspectives provide open source communities with valuable feedback, usability insights, and rigorous testing. Students sharpen technical skills, learn professional workflows, and build portfolios. Open source projects benefit from innovative ideas and unbiased input. Attendees will learn about the program structure, its benefits for students and open source communities, and how to get involved.
Speakers: Andy Anderson, IBM Research & Professor Corey Leong, Valencia College
In this talk, we will introduce two open-source projects vLLM and KServe and explain how they can be integrated to leverage better performance and scalability for LLMs in production. The session will include a demo showcasing their integration.
vLLM is a high-performance library specifically designed for LLM inference and serving, offering cutting-edge throughput and efficiency through techniques such as PagedAttention, continuous batching, and optimized CUDA kernels, making it ideal for production environments that demand fast, large-scale LLM serving.
KServe is a Kubernetes-based platform designed for scalable model deployment. It provides robust features for managing AI models in production, including autoscaling, monitoring, and model versioning.
By combining vLLM's inference optimizations with KServe's scalability, organizations can deploy LLMs effectively in production environments, ensuring fast, low-latency inference and seamless scaling across cloud platforms.
Speaker: Rafael Vasquez, IBM Research
LLMs are hotter than ever, but most LLM-based solutions available to us require you to use models trained on data with unknown provenance, send your most important data off to corporate-controlled servers, and use prodigious amounts of energy every time you write an email.
What if you could design a “second brain” assistant with OSS technologies, that lives on your laptop?
We’ll walk through the OSS landscape, discussing the nuts and bolts of combining Ollama, LangChain, OpenWebUI, Autogen and Granite models to build a fully local LLM assistant. We’ll also discuss some of the particular complexities involved when your solution involves a local quantized model vs one that’s cloud-hosted.
In this talk, we'll build on the lightning talk to include complexities like: - how much latency are you dealing with when you're running on a laptop? - does degradation from working with a 7-8b model reduce effectiveness? - how do reasoning + multimodal abilities help the assistant task?
Speaker: Olivia Buzek, IBM Research
Software supply chain attacks have surged in recent years, posing significant threats to organizations. In response, Software Bill of Materials (SBOMs)—structured inventories that document software components—have been proposed to enhance supply chain transparency, track dependencies, and manage vulnerabilities. Despite increasing adoption, their correctness and completeness in real-world open-source ecosystems remain largely unexamined. Incomplete SBOMs can result in overlooked vulnerabilities while incorrect dependency may waste resources on non-existent issues.
This talk introduces JBomAudit, an open-source tool to automatically verify Java SBOMs by systematically assessing their correctness and completeness against NTIA minimum requirements. We will cover technical details of JBomAudit, demonstrate how it examines missing and incorrect dependencies, and present findings from our large-scale analysis of over 25,000 Java SBOMs, highlighting the prevalence of non-compliant SBOMs and security implications. We will also discuss common pitfalls in SBOM generation, analyze the root causes of non-compliance, and provide actionable recommendations to improve SBOM quality.
Speakers: Yue Xiao, Jiyong Jang, Douglas Schales & Dhilung Kirat, IBM Research
Large Language Models (LLMs) are reshaping how we build applications; however, efficiently serving them at scale remains a major challenge.
The vLLM serving engine, historically focused on single-node deployments, is now being extended into a full-stack inference system through our open-source project, **vLLM Production Stack**. This extension enables any organization to deploy vLLM at scale with high reliability, high throughput, and low latency. Code: https://github.com/vllm-project/production-stack
At a high level, the vLLM Production Stack project allows users to easily deploy to their Kubernetes cluster through a single command. vLLM Production Stack's optimizations include KV cache sharing to speed up inference (https://github.com/LMCache/LMCache), prefix-aware routing that directs inference queries to vLLM instances holding the corresponding KV caches, and robust observability features for monitoring engine status and autoscaling.
Attendees will discover best practices and see real-time demonstrations of how these optimizations work together to enhance LLM inference performance.
Speakers: Junchen Jiang, University of Chicago & Yue Zhu, IBM Research