A Synthetic Data Generation Framework with Chain-of-Thought Reasoning

Indra Priyadarsini S; Seiji Takeda; Sina Asuka Klampt; Takao Moriyama; Lisa Hamada

MRS Fall Meeting 2025

Conference paper

30 Nov 2025

A Synthetic Data Generation Framework with Chain-of-Thought Reasoning

Abstract

Large Language Models (LLMs) have emerged as powerful tools for scientific domains, yet their application in chemistry remains underexplored relative to their potential. This work presents an integrated framework that combines chain-of-thought reasoning with synthetic data generation to enhance molecular property prediction and molecule generation. We present an end-to-end framework that fuses chain-of-thought (CoT) prompting, self-consistency decoding, and parameter-efficient fine-tuning of LLMs to create interpretable, adaptable, and low-resource solutions for molecular prediction and design.

We propose a prompting strategy that guides LLMs to reason step by step about chemical properties, using molecular representations as input. Instead of providing direct outputs, the model is instructed to produce structured rationales—explaining how molecular features influence properties. These rationales not only improve human interpretability but also reduce hallucinations and expose the model’s decision-making process for expert review. We further enable the generation of synthetic chemical data that includes both predictions and reasoning traces. To improve robustness and reduce variability in reasoning quality, we introduce a self-consistency decoding method. By sampling multiple reasoning chains for a given prompt and aggregating outcomes, we obtain more stable and reliable predictions. This approach enhances answer consistency while offering a practical mechanism to estimate confidence through rationale alignment.

Building on these outputs, we generate synthetic datasets that include both property predictions and accompanying rationales. We then fine-tune an LLM with LoRA adapters on this rationale-enriched synthetic dataset, allowing the model to learn domain-relevant reasoning patterns. The resulting framework is evaluated on benchmark molecular property datasets using standard regression metrics and provide performance over zero-shot baselines. Thus we have a unified pipeline from reasoning-based prompting to low-resource fine-tuning and evaluation, designed for rapid iteration and reproducibility.

Conference paper