Evaluation

The Virtual Cell Challenge will evaluate submitted results from entrants’ models against the known answers obtained from laboratory testing, and will generate a score for each entry that reflects the entry’s performance. The scoring methodology and other details of the submission process are outlined below.

Commentary in Cell
Scoring Icon

Task

Predictive models can be trained to generalize along several axes. Two core dimensions are (i) generalization across biological context (e.g. cell type, cell line, culture conditions, or even in vivo versus in vitro settings) and (ii) generalization to novel genetic and/or chemical perturbations, including their combinations.

This challenge focuses on context generalization as a highly challenging real-world task: participants will predict the effects of single-gene perturbations in a test cell type (H1). The transcriptomic effect of these genetic perturbations has been previously reported in at least one other cellular context. This reflects a common experimental reality: testing all perturbations in every context is impractical due to cost, yet accurate context-specific predictions are crucial because responses depend on factors like cell type, state, differentiation stage, culture conditions, and genetic background.

Challenge Task Figure

Given that most published single cell genetic perturbation datasets span only a handful of cell lines, true zero-shot generalization to new cell states is likely premature. A more appropriate strategy at this stage is few-shot adaptation, where a subset of perturbations in the new cellular context is provided to guide model generalization. To support this, we provide expression profiles for a subset of perturbations measured directly in H1 hESCs, enabling participants to adapt their models before predicting responses to the remaining unseen perturbations in the same cell type.

Submissions Icon

Submissions

Competitors will submit prediction files in .vcc file format, generated via the the cell-eval Python module. The .vcc format is a simple wrapper around an AnnData h5ad file, that ensures the submission meets the requirements of the Challenge.

Only one submission is allowed per 24 hours, resetting at midnight UTC. During the initial phase of the Challenge, competitors' results will show on the live rankings public leaderboard. In the final week of the Challenge, a "final test set" will be released. During the final week, competitors will submit final results, which will not appear on the leaderboard. The winners of the Challenge will be announced in December based on final submissions.

The results file must contain gene expression counts caused by each identified perturbation in the validation dataset (during the live ranking period) and final test dataset (during the final week of the Challenge), as predicted by competitors' models.

Submission File Requirements

For a full walkthrough of generating a valid prediction file, see our tutorial in cell-eval. In conjunction with the cell-eval 'prep' utility, the notebook ensures prediction files meet the following requirements:

  • The perturbation column must be named target_gene in .obs. This should indicate the gene targeted by CRISPR or non-targeting for control cells.
  • Your dataset must include exactly 18,080 genes in .var, matching the gene set used during training. Predictions are expected for all 18,080 genes.
  • Refer to the pert_counts_Validation.csv (available for download in the app) to:
    • Select the correct target genes for validation.
    • Determine the recommended number of cells to generate per perturbation. You may include fewer, but this will negatively impact your score.
  • You must include control cells in your submission (rows with target_gene == 'non-targeting'), but the total number of cells (including controls) must not exceed 100,000.
  • Ensure that your expression matrix (.X) is of type float32 to avoid compatibility issues during processing.
  • Count values in your submission must be either integers or log1p-normalized. Other transformations will either give an error or an incorrect score. You should not submit counts that are normalized but not log transformed.
Sample Submission: sample_pred.h5ad
98,297 n_obs (target_gene) × 18,080 n_vars (genes)
A sample submission file in AnnData H5AD format after it has been run through the cell-eval 'prep' utility.

Obs

 01234...9892298923989249892598926
target_geneUQCRBnon-targetingVCLnon-targetingnon-targeting...PACSIN3non-targetingCTSVKLHDC2non-targeting
Var — index of gene names to predict

sample_pred.var

Index: [SAMD11, NOC2L, KLHL17, PLEKHN1, PERM1, HES4, ISG15, AGRN, RNF223, C1orf159, TTLL10, TNFRSF18, TNFRSF4, SDF4, B3GALT6, C1QTNF12, UBE2J2, SCNN1D, ACAP3, PUSL1, INTS11, CPTP, TAS1R3, DVL1, MXRA8, AURKAIP1, CCNL2, ANKRD65, TMEM88B, VWA1, ATAD3C, ATAD3B, ATAD3A, TMEM240, SSU72, FNDC10, MIB2, MMP23B, CDK11B, CDK11A, NADK, GNB1, CALML6, TMEM52, CFAP74, GABRD, PRKCZ, FAAP20, SKI, RER1, PEX10, PLCH2, PANK4, HES5, TNFRSF14, PRXL2B, MMEL1, ACTRT2, PRDM16, ARHGEF16, MEGF6, TPRG1L, WRAP73, TP73, CCDC27, SMIM1, LRRC47, CEP104, DFFB, C1orf174, AJAP1, NPHP4, KCNAB2, CHD5, RNF207, ICMT, HES3, GPR153, ACOT7, HES2, ESPN, TNFRSF25, PLEKHG5, NOL9, TAS1R1, ZBTB48, KLHL21, PHF13, THAP3, DNAJC11, CAMTA1, VAMP3, PER3, UTS2, TNFRSF9, ERRFI1, SLC45A1, RERE, ENO1, CA6, ...]
Scoring Icon

Scoring Rubric

Evaluation metrics should reflect the core purpose of a virtual cell: simulating cellular behavior in silico that is usually determined experimentally. This challenge is focused on predicting gene expression responses to genetic perturbations, which reflects the most common readout from single cell functional genomics experiments: post-perturbation expression counts and differentially expressed genes. Based on these criteria, we have designed three metrics to evaluate model performance for this year’s Challenge. Future challenges will likely include additional perturbation types. We hope that the Challenge will encourage researchers to discuss, and continue refining, the most effective and relevant metrics for evaluating virtual cell model performance.

Metrics Diagram
Differential Expression Score (DES)

The differential expression score evaluates how accurately a model predicts differential gene expression, a key output of most single-cell functional genomics experiments and a key input for downstream biological interpretation.

First, for each perturbation (predicted and ground truth), we calculate differential gene expression p-values between perturbed and control cells, using the Wilcoxon rank-sum test with tie correction. To define significant DE genes for prediction Gk,predG_{k,pred} and ground truth Gk,trueG_{k,true} for perturbation kk, we use the Benjamini-Hochberg procedure to control the False Discovery Rate at level α=0.05\alpha=0.05. If the predicted set size nk,pred=Gk,predn_{k,pred}=|G_{k,pred}| is smaller or equal to the true set size, nk,true=Gk,truen_{k,true}= |G_{k,true}|, we define the Differential Expression Score as the intersection between the predicted and true sets, normalized to the size of the true set:

DESk=Gk,predGk,truenk,trueDES_{k}=\frac{G_{k,pred} \cap G_{k,true}}{n_{k,true}}

If nk,pred>nk,truen_{k,pred} > n_{k,true}, we employ a different calculation to avoid overpenalizing predictions that overestimate the significance of differential expression. We define the predicted DE gene set G~k,pred\tilde{G}_{k,pred} by selecting nk,truen_{k,true} genes with the largest absolute values of fold changes (with respect to control cells) from the full predicted significant set Gk,predG_{k,pred}, and calculate the normalized intersection:

DESk=G~k,predGk,truenk,trueDES_{k}=\frac{\tilde{G}_{k,pred} \cap G_{k,true}}{n_{k,true}}

To obtain the overall score, we calculate the mean of DESkDES_k over all predicted perturbations.

Perturbation Discrimination Score (PDS)

Adapted from Wu et al., 2024, the perturbation discrimination score measures a model's ability to distinguish between perturbations by ranking predictions according to their similarity to the true perturbational effect, regardless of their effect size. First, we calculate pseudobulk expression, predicted y^k\hat y_k and true yky_k, for each perturbation kk (1kN1\le k \le N) by averaging the log1p-normalized expressions of all genes over all perturbed cells. Next, we calculate the Manhattan (L1L1) distance between a predicted perturbation pp and true perturbations tt and sort it in ascending order:

dpt=dL1(y^p,yt)sort by td_{pt} = d_{L1}(\hat y_p, y_t) |_{sort\ by\ t}

The target gene for each perturbation is excluded from the distance calculation. The index (rank) of the true perturbation in this ordered list, argind{dpt}t=pargind\{d_{pt}\}_{t=p}, is used to define the discrimination score by normalizing it to the total number of perturbations:

PDSp=1argind{dpt}t=p1NPDS_p = 1 - \frac{ argind\{d_{pt}\}_{t=p} - 1}{N}

If the predicted perturbation has the minimal distance to the true perturbation, PDSp=1PDS_p=1.

Finally, the overall score is calculated as the mean of all predicted perturbation scores.

Mean Absolute Error (MAE)

To ensure that predictions are also evaluated across all genes, including those that are not differentially expressed, we include a third metric: mean absolute error (MAE). While MAE is less biologically interpretable, it captures overall predictive accuracy and provides a view of model performance across the entire gene expression profile. We use the standard definition of MAE, calculating the mean absolute difference between the pseudobulk predicted y^k\hat y_k and true yky_k expressions:

MAEk=1Gg=1Gy^kgykgMAE_k = \frac{1}{G} \sum_{g=1}^{G} {|\hat y_{kg} - y_{kg}|}

where gg are gene indexes and GG is the number of genes. To obtain the overall score, we calculate the mean of MAEkMAE_k over all predicted perturbations.

Overall Score

Overall score on the leaderboard, and ultimately the scoring of final entries eligible for prizes, is determined by averaging the improvement of the three metrics relative to a baseline based on the cell-mean model of the training dataset.

Differential Expression Score (DES) and Perturbation Discrimination Score (PDS) range from 0 (worst) to 1 (best), so we define the scaled scores as

DESscaled=DESpredictionDESbaseline1DESbaselineDES_{scaled} = \frac{DES_{prediction}-DES_{baseline}} {1-DES_{baseline}}
PDSscaled=PDSpredictionPDSbaseline1PDSbaselinePDS_{scaled} = \frac{PDS_{prediction}-PDS_{baseline}} {1-PDS_{baseline}}

Here baseline scores are calculated for the cell-mean baseline model, which makes predictions by simply averaging the expression from all perturbations. The baseline scores are pre-calculated on the Training dataset and can be seen in the raw score table.

For mean absolute error, which has the best value of 0, we define the scaled score as:

MAEscaled=MAEbaselineMAEpredictionMAEbaselineMAE_{scaled} = \frac{MAE_{baseline} - MAE_{prediction}} {MAE_{baseline}}

If a scaled score is negative (i.e. prediction performs worse than the baseline), it’s clipped to 0. As defined above, the scaled scores range from 0 (performance equal or worse than the baseline) to 1 (exact match to the ground truth). Finally, we take the mean of the three scaled score to obtain the overall leaderboard score:

S=13(DESscaled+PDSscaled+MAEscaled)S = \frac{1}{3} ( DES_{scaled} + PDS_{scaled} + MAE_{scaled} )