Participation & Rules - The Head and Neck Organ-at-Risk CT & MR Segmentation Challenge

The Head and Neck Organ-at-Risk CT & MR Segmentation Challenge Banner

🚀 Challenge report paper has been accepted for publication in the Radiotherapy & Oncology journal! 🎉

📣 Challenge Structure 📣¶

The HaN-Seg 2023 grand challenge is divided into two phases (timeline):

1️⃣ Preliminary Test Phase (duration: 7 months):

Registration is open to anyone, the only requirement is that you register as a team (a single participant also counts as a team). Just click the Join button on the upper right part of the main challenge website. The public training dataset is available for download at Zenodo. Teams can participate by submitting their algorithms in form of Docker containers, see the example algorithm docker image. We limit the number of submissions to 1 per week. When submitted, each algorithm is executed on the grand-challenge.org platform and its performance is estimated on the four preliminary test cases. Team rankings are updated accordingly on the live public leaderboard.

2️⃣ Final Test Phase (duration: 3.5 months):

The final test phase will consist of 14 test cases (4 from the previous phase and 10 new cases). Teams can use this time to perform some final method development and fine-tuning of hyperparameters. Only a single submission per week will be allowed in this phase. Once this phase is finished, the organizers will rank the teams based on the following protocol:

Statistical ranking will be applied for each of the metrics by pairwise comparison of the algorithms using the Wilcoxon signed-rank test, resulting in a significance score and metric-specific rank.
The final rank will be obtained by aggregating the ranks over both metrics.
Identical ranks will be assigned to algorithms that will show only marginal differences in the performance, and therefore statistically significant differences among the results of different participating teams will be evaluated.
In the case the final rank will be equal for multiple participating teams, they will be ordered by the metric-based aggregation according to the mean of all metrics.

🚓 Rules 🚓¶

We embrace rules similar to the PI-CAI grand challenge rules:

All participants must form teams (even if the team is composed of a single participant), and each participant can only be a member of a single team.
Any individual participating with multiple or duplicate Grand Challenge profiles will be disqualified.
Anonymous participation is not allowed. To qualify for ranking on the validation/testing leaderboards, true names and affiliations [university, institute or company (if any), country] must be displayed accurately on verified Grand Challenge profiles, for all participants.
Members of sponsoring or organizing centers (i.e. University of Ljubljana, Faculty of Electrical Engineering; Institute of Oncology Ljubljana; University of Copenhagen, Department of Computer Science) may participate in the challenge, but are not eligible for prizes or the final ranking in the Final Testing Phase.
This challenge only supports the submission of fully automated methods in Docker containers. It is not possible to submit semi-automated or interactive methods.
All Docker containers submitted to the challenge will be run in an offline setting (i.e. they will not have access to the internet, and cannot download/upload any resources). All necessary resources (e.g. pre-trained weights) must be encapsulated in the submitted containers apriori.
Participants competing for prizes may use pre-trained AI models based on computer vision and/or medical imaging datasets (e.g. ImageNet, Medical Segmentation Decathlon). They may also use external datasets to train their AI algorithms. However, such data and/or models must be published under a permissive license (within 3 months of the Preliminary Development Phase deadline) to give all other participants a fair chance at competing on equal footing. They must also clearly state the use of external data in their submission, using the algorithm name [e.g. "HaN-Seg Model (trained w/ private data)"], algorithm page and/or a supporting publication/URL.
Researchers and companies, who are interested in benchmarking their institutional AI models or products, but not in competing for prizes, can freely use private or unpublished external datasets to train their AI algorithms. They must clearly state the use of external data in their submission, using the algorithm name [e.g. "HaN-Seg Model (trained w/ private data)"], algorithm page and/or a supporting publication/URL. They are not obligated to publish their AI models and/or datasets, before or anytime after the submission.
To participate in the Final Testing Phase as one of the top 10 teams, participants submit a short arXiv paper on their methodology (2–3 pages) and a public/private URL to their source code on GitHub (hosted with a permissive license). We take these measures to ensure the credibility and reproducibility of all proposed solutions, and to promote open-source AI development.
Participants of the HaN-Seg 2023 challenge may publish their own results separately, however, they must not submit their papers before June 1st 2024. Papers published after June 1st 2024 are requested to cite our dataset and challenge paper (once it has been published).
Organizers of the HaN-Seg 2023 challenge reserve the right to disqualify any participant or participating team, at any point in time, on grounds of unfair or dishonest practices.
All participants reserve the right to drop out of the HaN-Seg 2023 challenge and forego any further participation. However, they will not be able to retract their prior submissions or any published results till that point in time.

Computational Limitations (similar to those of the Shifts challenge):

We place a hard limitation on computational resources. Models must run on, at most, within 15 minutes per 1 input sample on the Grand-Challenge backend. Submitted solutions which break these limitations will not be considered in the leaderboard. This is done for several reasons. Firstly, to decrease costs, as every model evaluation on Grand Challenge costs money. Secondly, for real-world applicability, as in many practical applications, there are significant limitations placed on the computational resources or memory budgets and run times of algorithms. Finally, we level the playing field for participants who do not have access to vast amounts of computational resources.

👩‍⚖️ Evaluation 👩‍⚖️¶

Evaluation metrics are based on the recommendations from the Metrics Reloaded framework:

Dice Similarity Coefficient (DSC) for all organs,
95-th percentile of Hausdorff distance (HD95) for all organs and
[Update after the Final Phase ended] Upon reviewing the centerline DSC originally proposed for evaluation of tubular organs in the Metrics Reloaded paper, we empirically observed that this metric does not provide an objective and robust way to compare tubular organs and was thus excluded from the metrics.
Please note that this decision did not influence team ranking in the Final Phase of the challenge.
centerline DSC (clDSC), but only for the following (tubular) organs:
```
  A_Carotid_L, A_Carotid_R, Brainstem and SpinalCord.
```

Important notes:

To compute overall mean metrics, we will first aggregate results over all cases and finally over all organs.
Ranking in the Preliminary Test Phase is performed based on the mean rank of all three metrics: DSC, HD95 ~~and clDSC (only for tubular organs)~~.
Statistical ranking (as described above) will only be used to determine the final ranking (i.e. after the end of the Final Test Phase).