Multimodal foundation models for more reproducible scientific experimentation and data capture
Abstract
Following a surge in the interest in large language models for scientific discovery, there is ample opportunity for foundation models to support the laboratory of the future by capturing and processing multiple data modalities. In this contribution, we report the use of large vision-language models for action recognition in first-person, egocentric scene recording and evaluate the feasibility of automatically transcribing laboratory procedures in real-time. Leveraging an in-house dataset of egocentric videos of prototypical chemistry actions, we present benchmarks of different approaches for action recognition: zero-shot predictions by large models trained on broad data, action classification by fine-tuned models, and the effect of including addition visual cues in the data such as gaze coordinates. We conclude by discussing the potential benefits and challenges of implementing the technology in the fields of research and education.