J Med Syst. 2025 Feb 14;49(1):23. doi: 10.1007/s10916-025-02152-9.
ABSTRACT
BACKGROUND: Generative large language models (LLMs) are increasingly integrated into the medical field. However, their actual efficacy in clinical decision-making remains partially unexplored. This study aimed to assess the performance of the three LLMs, ChatGPT-4, Gemini, and Med-Go, in the domain of professional medicine when confronted with actual clinical cases.
METHODS: This study involved 134 clinical cases spanning nine medical disciplines. Each LLM was required to provide suggestions for diagnosis, diagnostic criteria, differential diagnosis, examination and treatment for every case. Responses were scored by two experts using a predefined rubric.
RESULTS: In overall performance among the models, Med-Go achieved the highest median score (37.5, IQR 31.9-41.5), while Gemini recorded the lowest (33.0, IQR 25.5-36.6), showing significant statistical difference among the three LLMs (p < 0.001). Analysis revealed that responses related to differential diagnosis were the weakest, while those pertaining to treatment recommendations were the strongest. Med-Go displayed notable performance advantages in gastroenterology, nephrology, and neurology.
CONCLUSIONS: The findings show that all three LLMs achieved over 60% of the maximum possible score, indicating their potential applicability in clinical practice. However, inaccuracies that could lead to adverse decisions underscore the need for caution in their application. Med-Go's superior performance highlights the benefits of incorporating specialized medical knowledge into LLMs training. It is anticipated that further development and refinement of medical LLMs will enhance their precision and safety in clinical use.
PMID:39948214 | DOI:10.1007/s10916-025-02152-9