Arc has developed dedicated datasets to map single cell genetic perturbations. Competitors are welcome to use these data, along with other relevant public or proprietary data, to build a predictive model. This page describes the Challenge datasets in detail, including how they were created, how they are structured, and the shape of the datasets that will be available for training, validation and the final test.
Commentary in Cell ↗For this challenge, we used single-cell functional genomics to generate approximately 300,000 single-cell RNA-seq profiles by silencing 300 carefully selected genes using CRISPR interference (CRISPRi). To obtain single-cell gene expression profiles we used 10x Genomics GEM-X Flex and Illumina sequencing. The data are split into three groups for the Virtual Cell Challenge, to allow for training, validation of initial results, and developing a final entry for the competition.
Participants will receive:
Competitors will be able to download the training and validation datasets once they register and are logged into the application, and can access the final test dataset in the final week of the Challenge.
Obs
There are 38,176 unperturbed control cells in the training data denoted with a target_gene value of ‘non-targeting’. Competitors can optionally predict expression values for the control set during submission or copy expression values over from the training set.
target_gene,n_cells,median_umi_per_cellSH3BP4,2925,54551.0ZNF581,2502,53803.5ANXA6,2496,55175.0PACSIN3,2101,54088.0MGST1,2096,54217.5IGF1R,2056,53993.0ITGAV,2034,55356.0SLIRP,2000,54438.5CTSV,1989,53173.0MTFR1,1787,53795.0...,...,...
We selected these perturbations to represent a broad range in strength of downstream changes (the number of significant differentially expressed genes, known as DEGs) and phenotypic diversity of perturbation effects in our initial broad, low depth screen in H1 hESCs while ensuring adequate representation in existing public datasets. These data are not part of the Virtual Cell Challenge and are not publicly available.
Our methodology for finalizing the list of target perturbations started with filtering lists obtained by applying the following methods to ~2,500 genes.
Perturbation Bins by Effect size:
35% perturbations (935) which each result in more than 100 differentially expressed genes
29% perturbations (754) which each result in between 10 to 100 differentially expressed genes
46% perturbations (1,316) which each have very little or no effect
We further refined this list to focus on those genes with diverse phenotypic effects, based on our initial screen results and existing Gene Ontology annotations. We also ensured that the selected genes had been previously perturbed in other cell types across many of the publicly available perturbation datasets so that the Challenge evaluations will be able to test whether entrants’ models are capable of generalizing from other cell types used in these studies to the H1 hESC cells used in the Challenge.
This recently released resource is composed of scBaseCount and Tahoe-100M, the largest collection of publicly available observational and perturbational single cell RNA sequencing datasets, respectively. With single-cell data from over 300 million cells, this repository offers the community an ever-expanding training set for the next generation of virtual cell models.
A continuously updated single-cell RNA-seq database that employs a SparkAI workflow to automate discovery and standardized preprocessing of publicly available data. scBaseCount comprises over 300 million cells (and expanding), spanning 26 organisms and 72 tissues, with 150 genes specifically annotated.
The world’s largest single-cell dataset generated and open-sourced by Tahoe, containing 100 million cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments.
Competitors are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited. We are pleased that there are many data generation projects underway for computational modeling, and we hope the Challenge will inspire yet more efforts to generate high-quality reference datasets. Below is a selection of publicly available perturbation datasets, which in addition to the Arc Virtual Cell Atlas, might be useful:
Genome-scale Perturb-seq targeting all expressed genes with CRISPR interference (CRISPRi) across >2.5 million human cells (K562 and RPE1). The K562 genome-wide dataset contains perturbations that overlap with most of the genes used in the Arc VCC training and validation datasets.
Single-cell CRISPR screens of DepMap Common Essential Genes in Jurkat and HepG2 cells.
This dataset contains 90 cytokine perturbation responses in peripheral blood mononuclear cells (PBMCs) from 12 donors ranging across 18 cell types.