Saurabh Paul, Christos Boutsidis, et al.
JMLR
Large Language Models have transformed cloud computing, but their deployment presents a challenging trilemma between operational costs, energy consumption, and performance requirements. This keynote presents a novel open architecture that harmonizes multiple efficiency techniques to address these competing concerns. We examine critical optimization strategies including quantization, batching strategies, KV-caching, auto-scaling, model parallelisms, and specialized hardware accelerators—analyzing their individual strengths and compounding benefits when integrated as a cohesive system.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Rakesh Mohan, Ramakant Nevatia
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cristina Cornelio, Judy Goldsmith, et al.
JAIR