Urology consultants versus large language models : potentials and hazards for medical advice in urology

Eckrich, Johanna; Ellinger, Jörg; Cox, Alexander; Stein, Johannes; Ritter, Manuel; Blaikie, Andrew; Kuhn, Sebastian; Buhr, Christoph Raphael

doi:https://doi.org/10.25358/openscience-11804

Urology consultants versus large language models : potentials and hazards for medical advice in urology

dc.contributor.author	Eckrich, Johanna
dc.contributor.author	Ellinger, Jörg
dc.contributor.author	Cox, Alexander
dc.contributor.author	Stein, Johannes
dc.contributor.author	Ritter, Manuel
dc.contributor.author	Blaikie, Andrew
dc.contributor.author	Kuhn, Sebastian
dc.contributor.author	Buhr, Christoph Raphael
dc.date.accessioned	2025-03-24T11:05:06Z
dc.date.available	2025-03-24T11:05:06Z
dc.date.issued	2024
dc.description.abstract	Background: Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards. Objective: This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants. Methods: Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories: ‘medical adequacy’, ‘conciseness’, ‘coherence’ and ‘comprehensibility’. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched. Results: Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy (p < 0.0001) and coherence (p = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants. Conclusions: The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM ‘consultations’. Further investigations are necessary for new generations.
dc.identifier.doi	https://doi.org/10.25358/openscience-11804
dc.identifier.uri	https://openscience.ub.uni-mainz.de/handle/20.500.12030/11825
dc.language.iso	eng
dc.rights	CC-BY-4.0
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject.ddc	610 Medizin	de
dc.subject.ddc	610 Medical sciences	en
dc.title	Urology consultants versus large language models : potentials and hazards for medical advice in urology
dc.type	Zeitschriftenaufsatz
jgu.journal.issue	5
jgu.journal.title	BJUI compass
jgu.journal.volume	5
jgu.organisation.department	FB 04 Medizin
jgu.organisation.name	Johannes Gutenberg-Universität Mainz
jgu.organisation.number	2700
jgu.organisation.place	Mainz
jgu.organisation.ror	https://ror.org/023b0x485
jgu.pages.end	558
jgu.pages.start	552
jgu.publisher.doi	10.1002/bco2.359
jgu.publisher.eissn	2688-4526
jgu.publisher.name	Wiley
jgu.publisher.place	Hoboken, NJ
jgu.publisher.year	2024
jgu.rights.accessrights	openAccess
jgu.subject.ddccode	610
jgu.subject.dfg	Lebenswissenschaften
jgu.type.contenttype	Scientific article
jgu.type.dinitype	Article	en_GB
jgu.type.resource	Text
jgu.type.version	Published version

Files

Original bundle

Now showing 1 - 1 of 1

Name:: urology_consultants_versus_la-2025032412050699165.pdf
Size:: 402.95 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 5.1 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

DFG-491381577-G