Urology consultants versus large language models : potentials and hazards for medical advice in urology
| dc.contributor.author | Eckrich, Johanna | |
| dc.contributor.author | Ellinger, Jörg | |
| dc.contributor.author | Cox, Alexander | |
| dc.contributor.author | Stein, Johannes | |
| dc.contributor.author | Ritter, Manuel | |
| dc.contributor.author | Blaikie, Andrew | |
| dc.contributor.author | Kuhn, Sebastian | |
| dc.contributor.author | Buhr, Christoph Raphael | |
| dc.date.accessioned | 2025-03-24T11:05:06Z | |
| dc.date.available | 2025-03-24T11:05:06Z | |
| dc.date.issued | 2024 | |
| dc.description.abstract | Background: Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards. Objective: This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants. Methods: Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories: ‘medical adequacy’, ‘conciseness’, ‘coherence’ and ‘comprehensibility’. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched. Results: Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy (p < 0.0001) and coherence (p = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants. Conclusions: The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM ‘consultations’. Further investigations are necessary for new generations. | |
| dc.identifier.doi | https://doi.org/10.25358/openscience-11804 | |
| dc.identifier.uri | https://openscience.ub.uni-mainz.de/handle/20.500.12030/11825 | |
| dc.language.iso | eng | |
| dc.rights | CC-BY-4.0 | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject.ddc | 610 Medizin | de |
| dc.subject.ddc | 610 Medical sciences | en |
| dc.title | Urology consultants versus large language models : potentials and hazards for medical advice in urology | |
| dc.type | Zeitschriftenaufsatz | |
| jgu.journal.issue | 5 | |
| jgu.journal.title | BJUI compass | |
| jgu.journal.volume | 5 | |
| jgu.organisation.department | FB 04 Medizin | |
| jgu.organisation.name | Johannes Gutenberg-Universität Mainz | |
| jgu.organisation.number | 2700 | |
| jgu.organisation.place | Mainz | |
| jgu.organisation.ror | https://ror.org/023b0x485 | |
| jgu.pages.end | 558 | |
| jgu.pages.start | 552 | |
| jgu.publisher.doi | 10.1002/bco2.359 | |
| jgu.publisher.eissn | 2688-4526 | |
| jgu.publisher.name | Wiley | |
| jgu.publisher.place | Hoboken, NJ | |
| jgu.publisher.year | 2024 | |
| jgu.rights.accessrights | openAccess | |
| jgu.subject.ddccode | 610 | |
| jgu.subject.dfg | Lebenswissenschaften | |
| jgu.type.contenttype | Scientific article | |
| jgu.type.dinitype | Article | en_GB |
| jgu.type.resource | Text | |
| jgu.type.version | Published version |