A Data-driven ML Approach for Maximizing Performance in LLM-Adapter ServingFerran Agullo LopezJoan Oliveras Torraet al.2025NeurIPS 2025
Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM InferenceYue ZhuHao Yuet al.2025CLOUD 2025
Voice-based AI Agents: Filling the Economic Gaps in Digital Health DeliveryBo WenChen Wanget al.2025ICDH 2025
A Practical Guide To Benchmarking AI and GPU Workloads in KubernetesChen WangYuan Chen2025KubeCon EU 2025
A House United Within Itself: SLO-Awareness for On-Premises Containerized ML Inference Clusters via FaroBeomyeol JeonChen Wanget al.2025EuroSys 2025
Cloud-native Workflow Scheduling using a Hybrid Priority Rule, Dynamic Resource Allocation, and Dynamic Task PartitionJungeun ShinDiana Arroyoet al.2024SoCC 2024
Dexter: A Performance-Cost Efficient Resource Allocation Manager for Serverless Data AnalyticsAnna Maria NestorovDiego Marronet al.2024Middleware 2024
Optimizing GPU Multiplexing for Efficient and Cost-Effective Access to Diverse Large Language Models in GPU ClustersYue ZhuChen Wanget al.2024MASCOTS 2024