About the Data

Arc has developed dedicated datasets to map single cell genetic perturbations. Competitors are welcome to use these data, along with other relevant public or proprietary data, to build a predictive model. This page describes the Challenge datasets in detail, including how they were created, how they are structured, and the shape of the datasets that will be available for training, validation and the final test.

Commentary in Cell
Arc Benchmark

Arc-generated Benchmark Dataset

For this challenge, we used single-cell functional genomics to generate approximately 300,000 single-cell RNA-seq profiles by silencing 300 carefully selected genes using CRISPR interference (CRISPRi). To obtain single-cell gene expression profiles we used 10x Genomics GEM-X Flex and Illumina sequencing. The data are split into three groups for the Virtual Cell Challenge, to allow for training, validation of initial results, and developing a final entry for the competition.

Challenge Dataset

Participants will receive:

  • Transcriptomic reference of unperturbed H1 human embryonic stem cells (hESCs)
  • Training set consisting of single-cell profiles for 150 gene perturbations (~150,000 cells)
  • Validation set of 50 gene perturbations, for which entrants’ predicted transcriptomic results will be used to create a live ranking leaderboard during the challenge
  • Final test set of 100 held-out perturbations, released on October 27, 2025, a week prior to the submission deadline.
    Final rankings will be based solely on performance on the final test set.
Files Icon

File Formats

Competitors will be able to download the training and validation datasets once they register and are logged into the application, and can access the final test dataset in the final week of the Challenge.

Training data
Training Set: adata_Training.h5ad
15 GB221,273 n_obs (cells) × 18,080 n_vars (genes)
Gene Expression File in AnnData H5AD format

Obs

cell barcode-batch indexAAACAAGCAACCTTGTACTTTAGG-Flex_1_01TTTGGACGTGGTGCAGATTCGGTT-Flex_3_16
target_geneCHMP3non-targeting
guide_idCHMP3_P1P2_A|CHMP3_P1P2_Bnon-targeting_00035|non-targeting_03439
batchFlex_1_01Flex_3_16
Var
gene name indexSAMD11NOC2L
gene_id (from Ensembl)ENSG00000187634ENSG00000188976
Control Cells

There are 38,176 unperturbed control cells in the training data denoted with a target_gene value of ‘non-targeting’. Competitors can optionally predict expression values for the control set during submission or copy expression values over from the training set.

Validation data
Validation Set: pert_counts_Validation.csv
1 KB50 rows (target_genes)
Field nameDescription
target_geneGene symbol targeted for perturbation
n_cellsRecommended number of cells to predict for each perturbation to maximize model performance
median_umi_per_cellThe median number of Unique Molecular Identifiers per cell for each perturbation
target_gene,n_cells,median_umi_per_cellSH3BP4,2925,54551.0ZNF581,2502,53803.5ANXA6,2496,55175.0PACSIN3,2101,54088.0MGST1,2096,54217.5IGF1R,2056,53993.0ITGAV,2034,55356.0SLIRP,2000,54438.5CTSV,1989,53173.0MTFR1,1787,53795.0...,...,...
Final Test data
Final Test Set: pert_counts_Test.csv
1.9 KB100 rows (target_genes)Released Oct 27, 2025
Target Gene Section

Target Gene Selection

We selected these perturbations to represent a broad range in strength of downstream changes (the number of significant differentially expressed genes, known as DEGs) and phenotypic diversity of perturbation effects in our initial broad, low depth screen in H1 hESCs while ensuring adequate representation in existing public datasets. These data are not part of the Virtual Cell Challenge and are not publicly available.

Our methodology for finalizing the list of target perturbations started with filtering lists obtained by applying the following methods to ~2,500 genes.

Perturbation Bins by Effect size:

Strong

35% perturbations (935) which each result in more than 100 differentially expressed genes

Subtle

29% perturbations (754) which each result in between 10 to 100 differentially expressed genes

Negligible

46% perturbations (1,316) which each have very little or no effect

We further refined this list to focus on those genes with diverse phenotypic effects, based on our initial screen results and existing Gene Ontology annotations. We also ensured that the selected genes had been previously perturbed in other cell types across many of the publicly available perturbation datasets so that the Challenge evaluations will be able to test whether entrants’ models are capable of generalizing from other cell types used in these studies to the H1 hESC cells used in the Challenge.

Arc-generated Perturbation Benchmark
Arc Institute

Arc Virtual Cell Atlas

This recently released resource is composed of scBaseCount and Tahoe-100M, the largest collection of publicly available observational and perturbational single cell RNA sequencing datasets, respectively. With single-cell data from over 300 million cells, this repository offers the community an ever-expanding training set for the next generation of virtual cell models.

scBaseCount Graphic
Arc Institute
scBaseCount

A continuously updated single-cell RNA-seq database that employs a SparkAI workflow to automate discovery and standardized preprocessing of publicly available data. scBaseCount comprises over 300 million cells (and expanding), spanning 26 organisms and 72 tissues, with 150 genes specifically annotated.

Tahoe Graphic
Tahoe
Tahoe-100M

The world’s largest single-cell dataset generated and open-sourced by Tahoe, containing 100 million cells from ~60,000 drug perturbation experiments, mapping the response of 50 cancer models to 1,100+ drug treatments.

Public training data sources Icon

Public Perturbation Datasets

Competitors are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited. We are pleased that there are many data generation projects underway for computational modeling, and we hope the Challenge will inspire yet more efforts to generate high-quality reference datasets. Below is a selection of publicly available perturbation datasets, which in addition to the Arc Virtual Cell Atlas, might be useful:

Genome-scale Perturb-seq targeting all expressed genes with CRISPR interference (CRISPRi) across >2.5 million human cells (K562 and RPE1). The K562 genome-wide dataset contains perturbations that overlap with most of the genes used in the Arc VCC training and validation datasets.

Single-cell CRISPR screens of DepMap Common Essential Genes in Jurkat and HepG2 cells.

Jiang et al., 2025PaperDataset

Perturb-seq experiments in six different cancer cell lines from different tissues of origin: A549 (lung), MCF7 (breast), HT29 (colon), HAP1 (bone marrow), BxPC3 (pancreas), and K562 (bone marrow).

Srivatsan et al., 2020PaperDataset

Introduces “sci-Plex,” which uses “nuclear hashing” to quantify global transcriptional responses to thousands of independent perturbations at single-cell resolution and applies it to screen three cancer cell lines exposed to 188 compounds.

McFaline-Figuero et al., 2024PaperDataset

Introduces sci-Plex-Gene-by-Environment, a platform for combined single-cell genetic and chemical screening at scale and applies it to screen combinations of chemical and genetic perturbations in glioblastoma cell lines.

Parse-10 Million Human PBMCs in a Single ExperimentDataset

This dataset contains 90 cytokine perturbation responses in peripheral blood mononuclear cells (PBMCs) from 12 donors ranging across 18 cell types.