Aller au contenu principal
Journal Club

Journal Club by SWISS/KNIFE

Original Paper

Dimitrios Chatziisaak, Pascal Burri, Moritz Sparn, Dieter Hahnloser, Thomas Steffen, Stephan Bischofberger. 

Concordance of ChatGPT artificial intelligence decision-making in colorectal cancer multidisciplinary meetings: retrospective study. BJS Open 2025 May 7;9(3):zraf040.  doi: 10.1093/bjsopen/zraf040.

The article evaluates the concordance between ChatGPT-4, a large language model (LLM), and the treatment recommendations of multidisciplinary team meetings (MDTs) in colorectal cancer (CRC) management. MDTs are widely regarded as the gold standard for complex oncology decision-making, but they require considerable time and resources. Artificial intelligence has the potential to streamline and support these processes.

This single-centre retrospective study included 100 consecutive adult patients diagnosed with colorectal cancer (ICD-10: C18–C20) between September and December 2023. Each patient’s case was presented both to the institutional MDT and to ChatGPT-4 using real-world clinical data — colonoscopy reports, imaging, histopathology, and demographic information — without pre-processing, filtering, or interpretation.

Two MDT time points were analyzed:

  • Pretherapeutic MDT 1 (before initiation of treatment).
  • Posttherapeutic MDT 2 (after surgical intervention).

ChatGPT-4 was tasked with recommending a single best treatment plan for each case in accordance with the German S3 guidelines. Structured prompts were designed to reflect the real-world format of MDT case presentations.

Three independent reviewers compared ChatGPT-4’s output with MDT recommendations and classified agreement as complete concordance, partial concordance, or discordance.

  • For MDT 1, ChatGPT achieved complete concordance in 72.5%partial concordance in 10.2%, and discordance in 17.3% of cases.
  • For MDT 2, the rates were 82.8%11.8%, and 5.4%, respectively.

Discordance was more frequent in older patients (>77 years) and those with ASA (American Society of Anesthesiologists) score ≥ III. Multinomial logistic regression confirmed age above 77 years as a significant predictor of discordance (P = 0.008). Cases involving rectal cancer and nodal stage N1 were more likely to show partial concordance. Interestingly, MDT decisions in discordant cases tended to deviate from strict guideline adherence more often than ChatGPT, reflecting the importance of individualized clinical judgment in complex scenarios.

ChatGPT consistently followed guideline-based recommendations and did not produce hallucinations (false or fabricated outputs). However, it was unable to account for clinical nuances such as frailty, comorbidities, and psychosocial factors, leading to occasional oversimplifications.

Compared with earlier studies that relied on pre-processed or simplified case inputs, this work represents a more real-world application, using raw, unfiltered data. The findings support cautious optimism: LLMs may assist MDTs in reinforcing guideline adherence and providing a baseline recommendation but cannot replace the depth of clinical expertise, contextual reasoning, and patient-centered considerations provided by human teams.

Interview with Dr. med. Dimitrios Chatziisaak (St. Galler Spitalverbunde)

 

What inspired you to conduct this study?

The idea was born from two realities that every colorectal surgeon knows well. First, the treatment of colorectal cancer has become incredibly complex. Decisions today depend on a combination of surgery, oncology, radiology, pathology, and sometimes even genetics, all discussed in the setting of multidisciplinary meetings. These meetings are the backbone of modern cancer care, but they also demand significant time, coordination, and attention to detail.

Second, artificial intelligence has reached a stage where it is no longer a futuristic concept but a practical tool entering our daily lives. We asked ourselves: could such a system be applied meaningfully in a high-stakes environment like cancer care? Our inspiration was not curiosity alone, but also the recognition that if AI can reliably support consistency in treatment planning, it could help optimize workflows, reduce errors, and free up more time for clinicians to focus on what really matters — the patient in front of them.

So the study was driven by a vision of partnership: humans and AI, working together to improve both the efficiency and the safety of decision-making in oncology.

Were there any unexpected findings?

Yes, several. One of the most striking was the degree of concordance between ChatGPT and the decisions of the multidisciplinary team. We anticipated some overlap, but the consistency in more standard, guideline-driven cases was higher than we expected. This showed us that AI can indeed function as a reliable “baseline checker” — aligning with established standards of care in many scenarios.

But what was equally interesting was that in certain cases the AI picked up on small but relevant details that might otherwise have been underemphasized in the discussion. For example, subtle aspects in the histopathology reports — such as margin descriptions or specific features of tumor differentiation — were sometimes highlighted more explicitly by the AI. These details may not have altered the treatment pathway in that specific meeting, but under different circumstances, they could prove clinically significant.

At the same time, the study reminded us of AI’s current boundaries. In highly complex cases requiring deep contextualization, weighing comorbidities, or considering the patient’s overall goals of care, the AI could not match the nuanced reasoning of an experienced team. So while it was encouraging to see the system perform well, it also reinforced the irreplaceable role of human expertise and empathy in cancer care.

What is the direct impact on the surgeon’s work?

For surgeons, the direct impact lies in the possibility of having a structured, guideline-based support tool at their side. Surgeons often lead discussions in tumor boards, and they bear the responsibility of synthesizing the different perspectives into a coherent treatment plan. Having an AI that can review the case beforehand and suggest a plan aligned with international guidelines could save valuable time during meetings and ensure that no step is overlooked.

More importantly, it can provide a form of reassurance. Knowing that an independent system has arrived at the same conclusion adds confidence, especially in straightforward cases. In turn, this allows surgeons to dedicate more energy to the more difficult, borderline cases where human judgment truly makes the difference. In the future, we could see AI also helping with summarizing prior cases, comparing treatment options, and even generating structured documentation for the medical record, further streamlining a surgeon’s workload.

What is your learning point from this project?

The strongest learning point for me is that innovation in medicine works best when it is collaborative, not competitive. This project showed us that AI is not here to replace, but to complement. The strengths of AI — speed, data recall, consistency — can fill gaps that human clinicians naturally have, while the strengths of humans — clinical judgment, empathy, and ethical reasoning — cover everything that AI lacks.

Personally, this project taught me humility as well as optimism. Humility because we saw how even the most advanced algorithms can fall short in subtle but crucial areas; optimism because we also saw that with the right safeguards, AI can truly enhance our decision-making. Another key lesson was the importance of multidisciplinary teamwork not only within medicine, but also across fields. Collaborating with data scientists, engineers, and clinicians opened a whole new perspective on how we can co-create the future of patient care.

Are there any subsequent projects planned?

Yes — in fact, this study was just a first step. One of our next goals is to test AI integration in real-time multidisciplinary meetings, not just retrospectively. This will allow us to measure not only concordance, but also the effect on workflow, decision speed, and overall satisfaction of the team members. Another important direction is to expand the scope beyond colorectal cancer, applying the same methodology to other tumor boards, such as hepatobiliary or upper GI cancers, where surgical strategies can be even more nuanced. We are also interested in exploring prospective trials where AI serves as a “silent participant” in tumor boards — recording its recommendations and later comparing outcomes. The long-term vision is to move from simply asking, “Can AI mimic what we do?” to “How can AI help us do better?” This means developing systems that are adaptive, transparent, and co-designed with clinicians, so they become trusted partners in our work rather than black-box advisors.

 

Pour que ce site fonctionne correctement et pour améliorer votre expérience, nous utilisons des cookies. Pour de plus amples informations, veuillez consulter notre Politique en matière de cookies.

  • Les cookies nécessaires permettent d'assurer la fonctionnalité de base. Le site web ne peut pas fonctionner correctement sans ces cookies, qui ne peuvent être désactivés qu'en modifiant les préférences de votre navigateur.