Task Objective¶
This task extends Task 2 by integrating histopathology and transcriptomics to predict recurrence in HR-NMIBC patients. The aim is to model patient-level time-to-recurrence using both morphological and molecular data. Schematic overview of the multimodal prediction pipeline. Histopathology, RNA-seq and clinical data are encoded using pretrained networks and combined for prediction. Please note that RNA-seq data is derived from a selected tumor region within the histopathology slide.
Evaluation Metric¶
-
Model performance is evaluated using the censored concordance index (C-index). This metric measures the proportion of all comparable patient pairs where the model correctly predicts the ordering of outcomes.
-
Two patients are considered comparable if:
- Both experienced the event (e.g., recurrence) at different times, or
- One experienced the event, and the other was event-free, but with a longer observed follow-up time
-
A pair is not comparable if both patients experienced the event at the same time.
-
A pair is considered concordant if the patient with the higher predicted risk score has a shorter actual survival time. In other words, the model correctly orders the two patients in terms of risk.
-
The C-index ranges from:
- 0.5 → random predictions
- 1.0 → perfect concordance
The complete evaluation pipeline, including code for computing the censored concordance index, will be made publicly available to ensure transparency and reproducibility.
Data Details¶
Training Data¶
• 🧠 Histopathology: A single H&E-stained whole slide image (WSI) per patient, with 0.25 µm/pixel resolution at its highest resolution. Note that this WSI is either of an adjacent section of the H&E slide used for bulk RNA-seq, the same H&E slide with a punched cavity on the tissue section, or an H&E slide of another tumor of the same patient.
• 🧠 Histopathology: Binary tissue mask outlining the tissue section
• 🧬 Transcriptomics: Bulk RNA-seq data extracted from selected tumor regions, normalized using DESeq2
• 📋 Clinical Data: Same variables as Task 2.
Feature | Type / Values | Description |
---|---|---|
age | Integer (years) | Age of the patient in years |
sex | Male / Female | Biological sex of the patient |
smoking | Yes / No | Smoking history |
tumor | Primary / Recurrence | Indicates whether the tumor is primary or recurrent |
stage | TaHG / T1HG / T2HG | Tumor stage: Ta (inner lining), T1 (connective tissue), T2 (muscle invasion); all high-grade |
substage | T1m / T1e | T1m: ≤ 0.5mm invasion; T1e: > 0.5mm invasion |
grade | G2 / G3 | G2: moderately differentiated; G3: poorly differentiated |
reTUR | Yes / No | Re-transurethral resection (TUR) performed before BCG induction |
LVI | Yes / No | Lymphovascular invasion observed on H&E slide |
variant | UCC / UCC + Variant | Urothelial carcinoma alone or with variant histology |
EORTC | High risk / Highest risk | European Organization for Research and Treatment of Cancer (EORTC) risk classification |
no_instillations | Integer | Total number of BCG instillations. "-1" indicates missing data. |
BRS | BRS1 / BRS2 / BRS3 | Biomarker-derived BCG response subtype from RNA-seq |
Reference Standard | ||
progression | 0 / 1 | Progression to advanced disease (1-true/0-false) |
time_to_HG_recur_or_FUend | Float (months) | Time to high-grade recurrence or end of follow-up in months |
Additional information (not used in evaluation/test) | ||
HG_recur_BCG_failure | 0 / 1 | BCG failure (1-true/0-false) |
time_to_prog_or_FUend | Float (months) | Time to progression or end of follow-up in months |
time_to_FUend | Float (months) | Time to end of follow-up in months |
Data versions¶
v1¶
- 126 paired multimodal training data (
_HE.tif
,_HE_mask.tif
,_CD.json
, and_RNA.json
). - Contains incorrect histopathology slides and/or tissue mask (_HE.tif, _HE_mask.tif):
- Corrupted:
3A_024
, - Incorrect spacings:
3A_017, 3A_031, 3A_042, 3A_050, 3A_141, 3A_143, and 3A_157
- Scan is out of focus, resulting in failed tissue segmentation: 3A_025 (blank mask)
- Corrupted:
v2¶
- 176 paired multimodal training data (
_HE.tif
,_HE_mask.tif
,_CD.json
, and_RNA.json
). Note that all the clinical data (_CD.json
) files have been updated to reflect the features used in validation and test. - These slides are fixed:
3A_024, 3A_017, 3A_031, 3A_042, 3A_050, 3A_141, 3A_143, and 3A_157
.3A_025
slide which is out of focus, however, cannot be fixed. - Added extra materials to support model training:
- Note: The newly added 50 RNA-seq files (
_RNA.json
) each begins with a prefix "3B", which indicates the source "Cohort B" as described in [1]. These Cohort B samples were sequenced using a different protocol than those from Cohort A (labeled with a prefix "3A"). Please note that no batch effect adjustment was performed on these 2 cohorts. - UPDATE on 2025-06-08: The recently released V2 dataset contains clinical data files (_CD.json) with a missing parameter ("progression"). The clinical data files have been corrected and reuploaded to AWS, ready to be downloaded via the AWS command line below.
Download Training Data¶
Instruction (latest version)¶
- Install AWS CLI https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- Bucket name:
s3://chimera-challenge/v2/task3/
- Command line:
aws s3 sync --no-sign-request s3://chimera-challenge/v2/task3/ <destination_path>
Bucket structure (latest version)¶
v2/ task3/ data/ task3_quality_control.csv {patient_id}/ {patient_id_CD.json} {patient_id_HE.tif} {patient_id_HE_mask.tif} {patient_id_RNA.json} features/ coordinates/ {patient_id_HE.npy} {patient_id_HE.npy} {patient_id_HE.npy} features/ {patient_id_HE.pt} {patient_id_HE.pt} {patient_id_HE.pt}