Accelerating scientific discovery through AI relies on the availability of high-quality data from scientific experimentation. Yet, scientific experimentation suffers from poor reproducibility and data capture challenges, mostly stemming from the difficulty in transcribing all details of an experiment and the different ways in which individuals document their lab work. With the emergence of foundation models capable of processing multiple data modalities including vision and language, there is a unique opportunity to redefine data and metadata capture and the corresponding scientific documentation process. In this contribution, we discuss the challenges associated with lab digitization today and how multi-modal learning with transformer-based architectures can contribute to a new research infrastructure for scientific discovery in order to fully describe experimental methods and outcomes while facilitating data sharing and collaboration. We present a case study on a hybrid digital infrastructure and transformer-based vision-language models to transcribe high-dimensional raw data streams from non-invasive recording devices that represent the interaction of researchers with lab environments during scientific experimentation. The infrastructure is demonstrated in test cases related to semiconductor research and wet chemistry, where we show how vision-language foundation models fine-tuned on a limited set of experiments can be used to generate reports that exhibit high similarity with the recorded procedures. Our findings illustrate the feasibility of using foundation models to automate data capture and digitize all aspects of scientific experimentation, and suggest that the challenge of scarce training data for specific laboratory procedures can be alleviated by leveraging self-supervised pretraining on more abundant data from other domains.