Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

Eric Egli; Matteo Manica; Jannis Born

ICML 2025

Workshop paper

13 Jul 2025

Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling

View code

Abstract

Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Yet the excessive length of bytestreams requires new architectural paradigms for Byte Language Models (BLMs). Therefore, we present the Multiscale BLM (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of 5M bytes on single GPU in full model precision. Our experiments demonstrate that hybrid Trans-former/Mamba architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational ef-ficiency. Source code has already been publicly released and MBLM can be installed from PyPI (links blinded for review).

Workshop paper