Discovery and development of polymer materials is strongly driven by experimental data acquisition. Polymer materials have 2020 MRS Virtual Spring/Fall Meeting, November 27-December 4, 2020, Fall Abstracts Final Program 2/3/21 to be prepared and characterized in order to be patentable. The immediate outcome of the polymerization reaction can be transformed into multiple materials with strikingly different properties by means of formulation and processing. Potential for IP development increases sharply as the polymerization product evolves via formulation and processing into the final material. Experimental data acquisition, therefore, unfolds under conditions of delayed rewards on incredibly rich landscapes shaped by access to multiple experimental degrees of freedom that can be both continuous (concentration, temperature, radiation, time) and categorical (monomers, catalysts, initiators, solvents) [1,2]. In the presented contribution, we report results of the ongoing effort in the development of an end-to-end reinforcement learning (RL) approach to the experimental data acquisition in the polymer material domain. The application is the development of a simple spin-on-glass. It starts with the acquisition of initial experimental data, proceeds to training a RL agent to search for the experimental settings (generating experimental hypothesis) that produce materials with desired properties, then to applying a trained RL agent to design the experimental plan, and finally to executing the experimental plan using the existing robotic platform. We identify and discuss the following factors that we are systematically addressing: - Direct access to the lab equipment during the training phase is impractical; given the complexity of the polymer synthesis and processing, therefore, the most viable option is to prepare a surrogate model of the system and use it to set up the RL environment. The optimal choice of the surrogate as well as handling the uncertainties deserves systematic investigation. - The task of materials discovery often comprises a search for multiple promising solutions, both search and enumeration, but not necessarily global optimization. The search task can be learned in the narrow context of the specific experimental project (cf. navigation of one specific maze); a significantly stronger result is to train a RL agent to navigate experimental landscapes as such (cf. learning general strategies of maze navigation). This brings meta-learning to the top of the priority list . - RL is notoriously data-greedy; meta-learning further amplifies the need for abundant and diverse training data. Therefore, data augmentation is necessary, particularly around creating families of the surrogate models. The specifics of the data augmentation, that leads to optimal performance, needs careful observation. - As is generally the case with RL, the reward system plays a big role. One possible (and informative) approach is to develop reward systems that mimic effective “rewards” and “penalties” for human researchers solving similar data acquisition tasks. On the artificial intelligence angle, we describe the specifics of the implementation (OpenAI Gym based environment, and RL algorithms) and discuss obtained performance metrics and approaches to the diagnostics of the RL agent behavior. 1. Li, H. et al. “Tuning the Molecular Weight Distribution from Atom Transfer Radical Polymerization Using Deep Reinforcement Learning” Mol. Syst. Des. Eng., 2018. 2. Zhou, Z. et al. “Optimizing Chemical Reactions with Deep Reinforcement Learning” ACS Cent. Sci. 2017. 3. Kobbe, K. et al. “Quantifying Generalization in Reinforcement Learning” PMLR 97, 2019.