The Virtual Cell Challenge will evaluate submitted results from entrants’ models against the known answers obtained from laboratory testing, and will generate a score for each entry that reflects the entry’s performance. The scoring methodology and other details of the submission process are outlined below.
Commentary in Cell ↗Predictive models can be trained to generalize along several axes. Two core dimensions are (i) generalization across biological context (e.g. cell type, cell line, culture conditions, or even in vivo versus in vitro settings) and (ii) generalization to novel genetic and/or chemical perturbations, including their combinations.
This challenge focuses on context generalization as a highly challenging real-world task: participants will predict the effects of single-gene perturbations in a test cell type (H1). The transcriptomic effect of these genetic perturbations has been previously reported in at least one other cellular context. This reflects a common experimental reality: testing all perturbations in every context is impractical due to cost, yet accurate context-specific predictions are crucial because responses depend on factors like cell type, state, differentiation stage, culture conditions, and genetic background.
Given that most published single cell genetic perturbation datasets span only a handful of cell lines, true zero-shot generalization to new cell states is likely premature. A more appropriate strategy at this stage is few-shot adaptation, where a subset of perturbations in the new cellular context is provided to guide model generalization. To support this, we provide expression profiles for a subset of perturbations measured directly in H1 hESCs, enabling participants to adapt their models before predicting responses to the remaining unseen perturbations in the same cell type.
Competitors will submit prediction files in .vcc file format, generated via the the cell-eval Python module. The .vcc format is a simple wrapper around an AnnData h5ad file, that ensures the submission meets the requirements of the Challenge.
Only one submission is allowed per 24 hours, resetting at midnight UTC. During the initial phase of the Challenge, competitors' results will show on the live rankings public leaderboard. In the final week of the Challenge, a "final test set" will be released. During the final week, competitors will submit final results, which will not appear on the leaderboard. The winners of the Challenge will be announced in December based on final submissions.
The results file must contain gene expression counts caused by each identified perturbation in the validation dataset (during the live ranking period) and final test dataset (during the final week of the Challenge), as predicted by competitors' models.
For a full walkthrough of generating a valid prediction file, see our tutorial in cell-eval. In conjunction with the cell-eval 'prep' utility, the notebook ensures prediction files meet the following requirements:
Obs
sample_pred.var
Evaluation metrics should reflect the core purpose of a virtual cell: simulating cellular behavior in silico that is usually determined experimentally. This challenge is focused on predicting gene expression responses to genetic perturbations, which reflects the most common readout from single cell functional genomics experiments: post-perturbation expression counts and differentially expressed genes. Based on these criteria, we have designed three metrics to evaluate model performance for this year’s Challenge. Future challenges will likely include additional perturbation types. We hope that the Challenge will encourage researchers to discuss, and continue refining, the most effective and relevant metrics for evaluating virtual cell model performance.
The differential expression score evaluates how accurately a model predicts differential gene expression, a key output of most single-cell functional genomics experiments and a key input for downstream biological interpretation.
First, for each perturbation (predicted and ground truth), we calculate differential gene expression p-values between perturbed and control cells, using the Wilcoxon rank-sum test with tie correction. To define significant DE genes for prediction and ground truth for perturbation , we use the Benjamini-Hochberg procedure to control the False Discovery Rate at level . If the predicted set size is smaller or equal to the true set size, , we define the Differential Expression Score as the intersection between the predicted and true sets, normalized to the size of the true set:
If , we employ a different calculation to avoid overpenalizing predictions that overestimate the significance of differential expression. We define the predicted DE gene set by selecting genes with the largest absolute values of fold changes (with respect to control cells) from the full predicted significant set , and calculate the normalized intersection:
To obtain the overall score, we calculate the mean of over all predicted perturbations.
Adapted from Wu et al., 2024, the perturbation discrimination score measures a model's ability to distinguish between perturbations by ranking predictions according to their similarity to the true perturbational effect, regardless of their effect size. First, we calculate pseudobulk expression, predicted and true , for each perturbation () by averaging the log1p-normalized expressions of all genes over all perturbed cells. Next, we calculate the Manhattan () distance between a predicted perturbation and true perturbations and sort it in ascending order:
The target gene for each perturbation is excluded from the distance calculation. The index (rank) of the true perturbation in this ordered list, , is used to define the discrimination score by normalizing it to the total number of perturbations:
If the predicted perturbation has the minimal distance to the true perturbation, .
Finally, the overall score is calculated as the mean of all predicted perturbation scores.
To ensure that predictions are also evaluated across all genes, including those that are not differentially expressed, we include a third metric: mean absolute error (MAE). While MAE is less biologically interpretable, it captures overall predictive accuracy and provides a view of model performance across the entire gene expression profile. We use the standard definition of MAE, calculating the mean absolute difference between the pseudobulk predicted and true expressions:
where are gene indexes and is the number of genes. To obtain the overall score, we calculate the mean of over all predicted perturbations.
Overall score on the leaderboard, and ultimately the scoring of final entries eligible for prizes, is determined by averaging the improvement of the three metrics relative to a baseline based on the cell-mean model of the training dataset.
Differential Expression Score (DES) and Perturbation Discrimination Score (PDS) range from 0 (worst) to 1 (best), so we define the scaled scores as
Here baseline scores are calculated for the cell-mean baseline model, which makes predictions by simply averaging the expression from all perturbations. The baseline scores are pre-calculated on the Training dataset and can be seen in the raw score table.
For mean absolute error, which has the best value of 0, we define the scaled score as:
If a scaled score is negative (i.e. prediction performs worse than the baseline), it’s clipped to 0. As defined above, the scaled scores range from 0 (performance equal or worse than the baseline) to 1 (exact match to the ground truth). Finally, we take the mean of the three scaled score to obtain the overall leaderboard score:
We have developed a set of utilities and other resources to assist participants in the Challenge.