Saurabh Paul, Christos Boutsidis, et al.
JMLR
Training large language models (LLMs) for programming tasks requires diverse and syntactically valid input data. While data augmentation can enhance generalization, uncontrolled complexity may lead to overfitting or invalid examples. In this work, we introduce a grammar-based augmentation method that systematically generates program-like data with controlled complexity. By leveraging formal grammars, our approach ensures syntactic correctness while promoting semantic diversity. Preliminary experiments demonstrate that our method produces well-distributed training datasets, improving model robustness without compromising generalization. This grammar-aware strategy offers a scalable and principled solution for augmenting structured data in LLM training.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
C.A. Micchelli, W.L. Miranker
Journal of the ACM
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR