19 Aug 2021
6 minute read

Crowdsourcing to trace cell lineages

DREAM AI competition helps explain how cells turn into specialists, and where the process can go awry.

DREAM AI competition helps explain how cells turn into specialists, and where the process can go awry.

Every organism starts as a single cell. Through repeated rounds of cell division, they develop into complex organisms that can be composed of trillions of different cell types. How are the descendants of one cell in a human embryo fated to become heart muscle, while those of another become nerve tissue?

Tracing cell lineages provides the framework for understanding when, where, and how cell fate decisions are made. And, since with each cell division, something can go wrong, the ability to trace the path of cell differentiation provides insight into how developmental problems and diseases arise.

In our recently published Cell Systems paper, “Benchmarked approaches for reconstruction of in vitro cell lineages and in silico models of C. elegans and M. musculus developmental trees,” we detail a potential way forward for solving larger cell lineage trees, including the human one.1

A cellular tree of life

Just as Charles Darwin mapped the evolutionary trees of organisms, geneticists are now reconstructing cell lineage trees—the developmental histories of cells as they divide and differentiate. We’re far from fully understanding the lineages of complex organisms, but artificial intelligence can help accelerate our discovery and understanding.

The first cell lineage tree, for the tiny nematode Caenorhabditis elegans (C. elegans), was painstakingly completed decades ago with the aid of an optical microscope.

Cell lineage reconstruction has taken a leap forward in the past decade with the advent of CRISPR-based molecular tools. The insertion of a DNA array in the cell is targeted by CRISPR thus creating a set of mutations in a specific location of a cell’s genetic material to provide a marker that lets researchers infer the further iterations of that cell’s lineage.

Different versions of CRISPR-based methods have followed, but these methods have limitations. There’s a lot of “noise” in the data as CRISPR mutations happen with some randomness—and sequencing sometimes does not catch all mutations. There are also restrictions in the total available DNA “memory” for recording, as an actual array of characters targeted by CRISPR is inserted in the cell’s genome. In addition, the same edit patterns sometimes spontaneously appear in non-related cells, or the CRISPR-edited sequence is deleted during cell division.

Another issue: There’s no way to check whether or not these inferences are accurate.

A crowdsourcing challenge

My colleagues Michael Elowitz, a Caltech professor of biology and bioengineering, and Jay Shendure, a professor of genome sciences at the University of Washington, and I had the idea of applying machine learning (ML) techniques to this work. Could AI find new algorithms for reconstructing cell lineages and create benchmarks for their accuracy?

To find algorithms to test, we organized the Allen Institute lineage reconstruction DREAM Challenge, which ran from October 2019 through February 2020. The DREAM (Dialogue for Reverse Engineering and Methods) Challenge was a way to study efforts to reverse-engineer biological networks from high-throughput datasets—or solutions that use large amounts of data.

We produced three training sets for the participant teams to use. They included cell lineage trees, both in vitro and in silico, comprised of 100 cells, 1,000 cells (C. elegans), and 10,000 cells (a mouse simulation), along with their cells’ character arrays modified by a CRISPR-like mechanism. We provided the gold-standard tree solution to let the teams train or optimize their methods.

Graphical abstract of in vitro and in silico cell lineage trees.Graphical abstract of in vitro and in silico cell lineage trees.

Benchmarking AI approaches

With 10 teams submitting a total of 22 submissions for the three challenges, we found some clear winners. Both original methods, DCLEAR (distance-based cell lineage reconstruction) and AMberLand excelled. The former using an estimation of the distance between cells based on different character lengths; the latter applying in a novel way a classic ML method called Gradient Boosting, an ML technique that produces a prediction model in the form of a decision tree based on aggregating small trees.

We also learned the choice of the mutation rate and the diversity of mutations in the simulations has a strong effect on the accuracy of cell lineage reconstruction. There’s a sweet spot between “too low” and “too high” mutation rates.

Our results show that it could be possible to train algorithms on smaller trees and then use these to build algorithms for reconstructing much larger trees—even the human one. Having a training set of trees with the actual solution was essential for both developing new approaches and reaching this conclusion.

If we can unravel the mystery of how the human body originates from a single cell, that knowledge could put us on the path to new treatments and even cures for congenital diseases and developmental problems, from Down syndrome to cancer. Given the complexity of this problem and the potentially huge payoff makes it a perfect challenge for AI. This also shows that there are still many fields where AI can be applied and have real impact in the improvement of predictions. All that’s needed are new ideas and new datasets.


  1. Gong, W., Granados, A., Hu, J., et al. Benchmarked approaches for reconstruction of in vitro cell lineages and in silico models of C. elegans and M. musculus developmental trees. Cell Systems. Volume 12, ISSUE 8, P810-826.e4, August 18, 2021.