Muestra métricas de impacto externas asociadas a la publicación. Para mayor detalle:
| Indexado |
|
||||
| DOI | 10.1186/S12909-025-07250-3 | ||||
| Año | 2025 | ||||
| Tipo | artículo de investigación |
Citas Totales
Autores Afiliación Chile
Instituciones Chile
% Participación
Internacional
Autores
Afiliación Extranjera
Instituciones
Extranjeras
Background Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties. Methods GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal-Wallis and Mann-Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed. Results MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies (p < 0.001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance (F2,54 = 1.45, p = 0.24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks. Conclusions and relevance Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o's performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.
| Ord. | Autor | Género | Institución - País |
|---|---|---|---|
| 1 | Altermatt, Fernando R. | - |
Pontificia Universidad Católica de Chile - Chile
|
| 2 | Neyem, Andres | - |
Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile Agencia Nacional de Investigación y Desarrollo - Chile |
| 3 | Sumonte, Nicolas | - |
Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile Agencia Nacional de Investigación y Desarrollo - Chile |
| 4 | Mendoza, Marcelo | - |
Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile Agencia Nacional de Investigación y Desarrollo - Chile |
| 5 | Villagran, Ignacio | Hombre |
Pontificia Universidad Católica de Chile - Chile
|
| 6 | Lacassie, Hector J. | - |
Pontificia Universidad Católica de Chile - Chile
|
| Fuente |
|---|
| Fondo de Fomento al Desarrollo Científico y Tecnológico |
| Agencia Nacional de Investigación y Desarrollo |
| National Research and Development Agency |
| National Research and Development Agency (ANID) |
| CENIA |
| National Center for Artificial Intelligence |
| National Center for Artificial Intelligence (CENIA) |
| National Research and Development Agency (ANID), FONDEF IDeA ID23I10319 |
| Agradecimiento |
|---|
| The authors would like to acknowledge the support of the National Research and Development Agency (ANID) and the National Center for Artificial Intelligence (CENIA) for providing the resources and funding necessary to conduct this research. Special thanks are extended to the entire research team for their collaboration and dedication throughout the study. |
| This work was partially supported by the National Research and Development Agency (ANID), FONDEF IDeA ID23I10319. The contributions of A.N., N.I.S., and M.M. were supported in part by the National Center for Artificial Intelligence under Grant FB210017, Basal ANID. |
| The authors would like to acknowledge the support of the National Research and Development Agency (ANID) and the National Center for Artificial Intelligence (CENIA) for providing the resources and funding necessary to conduct this research. Special thanks are extended to the entire research team for their collaboration and dedication throughout the study. |