The Virtual Cell Challenge has developed new data sets to understand single cell genetic perturbations. Competitors will use these data, along with other relevant public or proprietary data, to build a predictive model. This page describes the Challenge data sets in detail, including how they were created, how they are structured, and the shape of the datasets that you will receive training, validation and final test.
Commentary in Cell ↗For this challenge, we used single-cell functional genomics (scFG) to generate approximately 300,000 single-cell RNA-seq profiles by silencing 300 carefully selected genes using CRISPR interference (CRISPRi).
We selected these perturbations to represent a broad range in strength of downstream changes (number of significant DEGs) and phenotypic diversity of perturbation effects in our initial broad, low depth screen in H1 hESCs while ensuring adequate representation in existing public datasets.
Our methodology for finalizing the list of target perturbations started with binning Arc’s high-coverage, cross-technology H1 data of ~2,500 genes.
Perturbation Bins by Effect size:
25% perturbations (612) with more than 100 differentially expressed genes
29% perturbations (704) with between 10 to 100 differentially expressed genes
46% perturbations (1,118) with very little or no effect
We further refined this list to focus on those genes with diverse phenotypic effects, based on our initial screen results and existing Gene Ontology annotations. Cells undergo different processes/states beyond what is captured by applied perturbation. Machine learning models can disentangle perturbation-related variation from other forms of variation.
And finally, we identified the training dataset by looking at overlapping perturbations between Arc’s H1 screen data and other sources such as Replogle GWPS, Replogle Essential, McFaline Figueroa, Jiang, Feng
We developed a rigorously validated perturbation dataset by selecting 300 genes that have shown measurable effects upon perturbation, as determined by the number of differentially expressed genes and on-target knockdown efficiency. These genes were identified using results from a human embryonic stem cell line (H1 hESC) perturbation screen across ~2500 genes.
Participants will receive,
This recently released resource is composed of scBaseCount and Tahoe-100M, the largest collection of publicly available observational and perturbational single cell RNA sequencing datasets, respectively. With single-cell data from over 350 million cells and counting, this repository offers the community an ever-expanding training set for the next generation of virtual cell models.
A continuously updated single-cell RNA-seq database that employs an agentic AI workflow to automate discovery and standardized preprocessing of publicly available data. scBaseCount comprises over 230 million cells (and expanding), spanning 21 organisms and 72 tissues.
Competitors are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.
Competitors will be able to download the training, validation and final test (final week) once they register and are logged into the application.