About the Data

The Virtual Cell Challenge has developed new data sets to understand single cell genetic perturbations. Competitors will use these data, along with other relevant public or proprietary data, to build a predictive model. This page describes the Challenge data sets in detail, including how they were created, how they are structured, and the shape of the datasets that you will receive training, validation and final test.

Commentary in Cell
Benchmarks Icon

Arc-generated Perturbation Benchmark

For this challenge, we used single-cell functional genomics (scFG) to generate approximately 300,000 single-cell RNA-seq profiles by silencing 300 carefully selected genes using CRISPR interference (CRISPRi).

We selected these perturbations to represent a broad range in strength of downstream changes (number of significant DEGs) and phenotypic diversity of perturbation effects in our initial broad, low depth screen in H1 hESCs while ensuring adequate representation in existing public datasets.

Our methodology for finalizing the list of target perturbations started with binning Arc’s high-coverage, cross-technology H1 data of ~2,500 genes.

Perturbation Bins by Effect size:

Strong

25% perturbations (612) with more than 100 differentially expressed genes

Subtle

29% perturbations (704) with between 10 to 100 differentially expressed genes

Negligible

46% perturbations (1,118) with very little or no effect

We further refined this list to focus on those genes with diverse phenotypic effects, based on our initial screen results and existing Gene Ontology annotations. Cells undergo different processes/states beyond what is captured by applied perturbation. Machine learning models can disentangle perturbation-related variation from other forms of variation.

And finally, we identified the training dataset by looking at overlapping perturbations between Arc’s H1 screen data and other sources such as Replogle GWPS, Replogle Essential, McFaline Figueroa, Jiang, Feng

Arc-generated Perturbation Benchmark
Training Icon

Training, Validation and Test

We developed a rigorously validated perturbation dataset by selecting 300 genes that have shown measurable effects upon perturbation, as determined by the number of differentially expressed genes and on-target knockdown efficiency. These genes were identified using results from a human embryonic stem cell line (H1 hESC) perturbation screen across ~2500 genes.

Challenge Dataset

Participants will receive,

  • Transcriptomic reference of unperturbed H1 hESCs
  • Training set consisting of single-cell profiles for 150 gene perturbations (~150,000 cells)
  • Validation set of 50 genes whose perturbations are used to create a live ranking leaderboard during the challenge
  • Final test set of 100 held-out perturbations , released a week prior to the submission deadline on October 27, 2025.
    Final rankings will be based solely on performance on the held-out test set.
Arc Institute

Arc Virtual Cell Atlas

This recently released resource is composed of scBaseCount and Tahoe-100M, the largest collection of publicly available observational and perturbational single cell RNA sequencing datasets, respectively. With single-cell data from over 350 million cells and counting, this repository offers the community an ever-expanding training set for the next generation of virtual cell models.

scBaseCount Graphic
Arc Institute
scBaseCount

A continuously updated single-cell RNA-seq database that employs an agentic AI workflow to automate discovery and standardized preprocessing of publicly available data. scBaseCount comprises over 230 million cells (and expanding), spanning 21 organisms and 72 tissues.

Tahoe Graphic
Tahoe
Tahoe-100
The world’s largest single-cell dataset generated and open-sourced by TahoeBio, containing 100M cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments.
Files Icon

File Formats

Competitors will be able to download the training, validation and final test (final week) once they register and are logged into the application.

1. Training Set: 150-gene-training_set.h5ad.gz (18.55 GB)
2. Validation Set: 50-gene-validation_set.csv (5.81 MB)
3. Final Test Set: 100-gene-final-test_set.csv (7.42 MB) Released Oct 27, 2025