Taku Ito, Luca Cocchi, et al.
ICML 2025
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Yet the excessive length of bytestreams requires new architectural paradigms for Byte Language Models (BLMs). Therefore, we present the Multiscale BLM (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of 5M bytes on single GPU in full model precision. Our experiments demonstrate that hybrid Trans-former/Mamba architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational ef-ficiency. Source code has already been publicly released and MBLM can be installed from PyPI (links blinded for review).
Taku Ito, Luca Cocchi, et al.
ICML 2025
Gosia Lazuka, Andreea Simona Anghel, et al.
SC 2024
Yidi Wu, Thomas Bohnstingl, et al.
ICML 2025
Robert Farrell, Rajarshi Das, et al.
AAAI-SS 2010