Creating labs that learn through automated data management
Abstract
Learning from laboratory data at scale poses a bottleneck in research workflows. To overcome the bottleneck, we consider and propose solutions to the fundamental questions of Who captured the data? What data was captured? When was the data captured? Where was the data captured? Why was the experiment carried out? And how? Electronic laboratory notebooks (ELNs) have traditionally been used to record answers to the above questions. However, they impose a heavy burden on the researcher by requiring manual data entry. In addition, the records do not include information about the environment in which the researcher generated the data. For instance, which instruments were used, which software version, and the actions that led to the data generated. Thus, there is a need to improve traceability through the prototypical research workflow. We propose a data management infrastructure to capture a richer representation of experimental workflows to address the concerns. The infrastructure is divided into a primary component deployed in one or more cloud platforms and a minimal component installed on local laboratory instruments. Data generated during a workflow is automatically tied to the corresponding action, thereby improving the reproducibility of experiments, traceability, and automating data entry. Each experimental step, the related workflows, and data are readily accessible through cloud-based services. In turn, the infrastructure provides a framework for systematic and homogenous data collection to facilitate the application of machine learning (ML) on experimental data. This alleviates the burden of recording experimental data from the researcher while providing a framework for ML tools to gain a richer representation and understanding of the experiments carried out. In addition, the precise tracking of experiments fosters collaboration between researchers that can exchange workflows and related data in a common format.