A community effort to optimize sequence-based deep learning models of gene regulation

Abdul Muntakim Rafi; Daria Nogina; Dmitry Penzar; Dohoon Lee; Danyeong Lee; Nayeon Kim; Sangyeup Kim; Dohyeon Kim; Yeojin Shin; Il-Youp Kwak; Georgy Meshcheryakov; Andrey Lando; Arsenii Zinkevich; Byeong-Chan Kim; Juhyun Lee; Taein Kang; Eeshit Dhaval Vaishnav; Payman Yadollahpour; Yang AiWei; Wang JingZhe; Chen JiaXing; Carl Schmitz; Robert Salomone; Dimitri Perrin; Jake Bradford; Prasanna Kumar S; Krishnakant Gupta; Ashok Palaniappan; Andigoni Malousi; Konstantinos Kyriakidis; Konstantinos Kardamiliotis; Scott Emrich; Ashley Babjac; Owen Queen; Zhixiu Lu; Jesper Madsen; Gabija Kavaliauskaite; Andreas Møller; Onuralp Soylemez; Max Schubach; Pyaree Mohan Dash; Sebastian Röner; Yichao Li; Benedikt Brors; Lars Feuerbach; Cindy Körner; Nicholas Abad; Cen Wan; Carl Barton; Patrick Greaves; Ibrahim Alsaggaf; Kuei-Lin Huang; Jhih-Yu Chen; Chih-Hsun Wu; Sz-Hau Chen; Edward S. C. Shih; Hsuan-Kai Wang; Chih-Han Huang; Tsai-Min Chen; Mao Takatsu; Yu Hiratsuka; Ramana V. Davuluri; Han Liu; Yanrong Ji; Pallavi Surana; Rekha Sathian; Pratik Dutta; Zhihan Zhou; Jiayu Wen; Gunjan Dixit; Ke Ding; Yuedong Yang; Ken Chen; Maolin Ding; Yuanfang Guan; Michele Tinti; David T. Jones; Xiaoting Chen; Peter Koo; Ashwin Narayanan; Aayush Prakash; Ethan Labelson; Yiyang Yu; Rohan Ghotra; Wei Qiu; Wuwei Zhang; Xinming Tu; Tin Nguyen; Duc Tran; Maria-Anna Trapotsi; Fredrik Svensson; Susanne Bornelöv; Sun Kim; Jake Albrecht; Aviv Regev; Wuming Gong; Ivan V. Kulakovskiy; Pablo Meyer; Carl G. de Boer

doi:10.1038/s41587-024-02414-w

Nature Biotechnology

Paper

11 Oct 2024

A community effort to optimize sequence-based deep learning models of gene regulation

Download paper

Abstract

A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.

Poster