Task Objective¶
This task extends Task 2 by integrating histopathology and transcriptomics to predict recurrence in HR-NMIBC patients. The aim is to model patient-level time-to-recurrence using both morphological and molecular data. Schematic overview of the multimodal prediction pipeline. Histopathology, RNA-seq and clinical data are encoded using pretrained networks and combined for prediction. Please note that RNA-seq data is derived from a selected tumor region within the histopathology slide.

Evaluation Metric¶
-
Model performance is evaluated using the censored concordance index (C-index). This metric measures the proportion of all comparable patient pairs where the model correctly predicts the ordering of outcomes.
-
Two patients are considered comparable if:
- Both experienced the event (e.g., recurrence) at different times, or
- One experienced the event, and the other was event-free, but with a longer observed follow-up time
-
A pair is not comparable if both patients experienced the event at the same time.
-
A pair is considered concordant if the patient with the higher predicted risk score has a shorter actual survival time. In other words, the model correctly orders the two patients in terms of risk.
-
The C-index ranges from:
- 0.5 → random predictions
- 1.0 → perfect concordance
The complete evaluation pipeline, including code for computing the censored concordance index, will be made publicly available to ensure transparency and reproducibility.
Data Details¶
Training Data¶
• 🧠 Histopathology: A single H&E-stained whole slide image (WSI) per patient, with 0.25 µm/pixel resolution at its highest resolution. Note that this WSI is either of an adjacent section of the H&E slide used for bulk RNA-seq, the same H&E slide with a punched cavity on the tissue section, or an H&E slide of another tumor of the same patient.
• 🧠 Histopathology: Binary tissue mask outlining the tissue section
• 🧬 Transcriptomics: Bulk RNA-seq data extracted from selected tumor regions, normalized using DESeq2
• 📋 Clinical Data: Same variables as Task 2.
| Feature | Type / Values | Description |
|---|---|---|
| age | Integer (years) | Age of the patient in years |
| sex | Male / Female | Biological sex of the patient |
| smoking | Yes / No | Smoking history |
| tumor | Primary / Recurrence | Indicates whether the tumor is primary or recurrent |
| stage | TaHG / T1HG / T2HG | Tumor stage: Ta (inner lining), T1 (connective tissue), T2 (muscle invasion); all high-grade |
| substage | T1m / T1e | T1m: ≤ 0.5mm invasion; T1e: > 0.5mm invasion |
| grade | G2 / G3 | G2: moderately differentiated; G3: poorly differentiated |
| reTUR | Yes / No | Re-transurethral resection (TUR) performed before BCG induction |
| LVI | Yes / No | Lymphovascular invasion observed on H&E slide |
| variant | UCC / UCC + Variant | Urothelial carcinoma alone or with variant histology |
| EORTC | High risk / Highest risk | European Organization for Research and Treatment of Cancer (EORTC) risk classification |
| no_instillations | Integer | Total number of BCG instillations. "-1" indicates missing data. |
| BRS | BRS1 / BRS2 / BRS3 | Biomarker-derived BCG response subtype from RNA-seq |
| Reference Standard | ||
| progression | 0 / 1 | Progression to advanced disease (1-true/0-false) |
| time_to_prog_or_FUend | Float (months) | Time to progression or end of follow-up in months |
Data versions¶
v1¶
- 126 paired multimodal training data (
_HE.tif,_HE_mask.tif,_CD.json, and_RNA.json). - Contains incorrect histopathology slides and/or tissue mask (_HE.tif, _HE_mask.tif):
- Corrupted:
3A_024, - Incorrect spacings:
3A_017, 3A_031, 3A_042, 3A_050, 3A_141, 3A_143, and 3A_157 - Scan is out of focus, resulting in failed tissue segmentation: 3A_025 (blank mask)
- Corrupted:
v2¶
- 176 paired multimodal training data (
_HE.tif,_HE_mask.tif,_CD.json, and_RNA.json). Note that all the clinical data (_CD.json) files have been updated to reflect the features used in validation and test. - These slides are fixed:
3A_024, 3A_017, 3A_031, 3A_042, 3A_050, 3A_141, 3A_143, and 3A_157.3A_025slide which is out of focus, however, cannot be fixed. - Added extra materials to support model training:
- Note: The newly added 50 RNA-seq files (
_RNA.json) each begins with a prefix "3B", which indicates the source "Cohort B" as described in [1]. These Cohort B samples were sequenced using a different protocol than those from Cohort A (labeled with a prefix "3A"). Please note that no batch effect adjustment was performed on these 2 cohorts. - UPDATE on 2025-06-08: The recently released V2 dataset contains clinical data files (_CD.json) with a missing parameter ("progression"). The clinical data files have been corrected and reuploaded to AWS, ready to be downloaded via the AWS command line below.
Download Training Data¶
Instruction (latest version)¶
- Install AWS CLI https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- Bucket name:
s3://chimera-challenge/v2/task3/ - Command line:
aws s3 sync --no-sign-request s3://chimera-challenge/v2/task3/ <destination_path>
Bucket structure (latest version)¶
v2/
task3/
data/
task3_quality_control.csv
{patient_id}/
{patient_id_CD.json}
{patient_id_HE.tif}
{patient_id_HE_mask.tif}
{patient_id_RNA.json}
features/
coordinates/
{patient_id_HE.npy}
{patient_id_HE.npy}
{patient_id_HE.npy}
features/
{patient_id_HE.pt}
{patient_id_HE.pt}
{patient_id_HE.pt}