Urology consultants versus large language models : potentials and hazards for medical advice in urology

dc.contributor.authorEckrich, Johanna
dc.contributor.authorEllinger, Jörg
dc.contributor.authorCox, Alexander
dc.contributor.authorStein, Johannes
dc.contributor.authorRitter, Manuel
dc.contributor.authorBlaikie, Andrew
dc.contributor.authorKuhn, Sebastian
dc.contributor.authorBuhr, Christoph Raphael
dc.date.accessioned2025-03-24T11:05:06Z
dc.date.available2025-03-24T11:05:06Z
dc.date.issued2024
dc.description.abstractBackground: Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards. Objective: This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants. Methods: Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories: ‘medical adequacy’, ‘conciseness’, ‘coherence’ and ‘comprehensibility’. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched. Results: Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy (p < 0.0001) and coherence (p = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants. Conclusions: The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM ‘consultations’. Further investigations are necessary for new generations.
dc.identifier.doihttps://doi.org/10.25358/openscience-11804
dc.identifier.urihttps://openscience.ub.uni-mainz.de/handle/20.500.12030/11825
dc.language.isoeng
dc.rightsCC-BY-4.0
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subject.ddc610 Medizinde
dc.subject.ddc610 Medical sciencesen
dc.titleUrology consultants versus large language models : potentials and hazards for medical advice in urology
dc.typeZeitschriftenaufsatz
jgu.journal.issue5
jgu.journal.titleBJUI compass
jgu.journal.volume5
jgu.organisation.departmentFB 04 Medizin
jgu.organisation.nameJohannes Gutenberg-Universität Mainz
jgu.organisation.number2700
jgu.organisation.placeMainz
jgu.organisation.rorhttps://ror.org/023b0x485
jgu.pages.end558
jgu.pages.start552
jgu.publisher.doi10.1002/bco2.359
jgu.publisher.eissn2688-4526
jgu.publisher.nameWiley
jgu.publisher.placeHoboken, NJ
jgu.publisher.year2024
jgu.rights.accessrightsopenAccess
jgu.subject.ddccode610
jgu.subject.dfgLebenswissenschaften
jgu.type.contenttypeScientific article
jgu.type.dinitypeArticleen_GB
jgu.type.resourceText
jgu.type.versionPublished version

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
urology_consultants_versus_la-2025032412050699165.pdf
Size:
402.95 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
5.1 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections