Colección SciELO Chile

Departamento Gestión de Conocimiento, Monitoreo y Prospección
Consultas o comentarios: productividad@anid.cl
Búsqueda Publicación
Búsqueda por Tema Título, Abstract y Keywords



Performance of single-agent and multi-agent language models in Spanish language medical competency exams
Indexado
WoS WOS:001484399000005
Scopus SCOPUS_ID:105004434766
DOI 10.1186/S12909-025-07250-3
Año 2025
Tipo artículo de investigación

Citas Totales

Autores Afiliación Chile

Instituciones Chile

% Participación
Internacional

Autores
Afiliación Extranjera

Instituciones
Extranjeras


Abstract



Background Large language models (LLMs) like GPT-4o have shown promise in advancing medical decision-making and education. However, their performance in Spanish-language medical contexts remains underexplored. This study evaluates the effectiveness of single-agent and multi-agent strategies in answering questions from the EUNACOM, a standardized medical licensure exam in Chile, across 21 medical specialties. Methods GPT-4o was tested on 1,062 multiple-choice questions from publicly available EUNACOM preparation materials. Single-agent strategies included Zero-Shot, Few-Shot, Chain-of-Thought (CoT), Self-Reflection, and MED-PROMPT, while multi-agent strategies involved Voting, Weighted Voting, Borda Count, MEDAGENTS, and MDAGENTS. Each strategy was tested under three temperature settings (0.3, 0.6, 1.2). Performance was assessed by accuracy, and statistical analyses, including Kruskal-Wallis and Mann-Whitney U tests, were performed. Computational resource utilization, such as API calls and execution time, was also analyzed. Results MDAGENTS achieved the highest accuracy with a mean score of 89.97% (SD = 0.56%), outperforming all other strategies (p < 0.001). MEDAGENTS followed with a mean score of 87.99% (SD = 0.49%), and the CoT with Few-Shot strategy scored 87.67% (SD = 0.12%). Temperature settings did not significantly affect performance (F2,54 = 1.45, p = 0.24). Specialty-level analysis showed the highest accuracies in Psychiatry (95.51%), Neurology (95.49%), and Surgery (95.38%), while lower accuracies were observed in Neonatology (77.54%), Otolaryngology (76.64%), and Urology/Nephrology (76.59%). Notably, several exam questions were correctly answered using simpler single-agent strategies without employing complex reasoning or collaboration frameworks. Conclusions and relevance Multi-agent strategies, particularly MDAGENTS, significantly enhance GPT-4o's performance on Spanish-language medical exams, leveraging collaboration to improve diagnostic accuracy. However, simpler single-agent strategies are sufficient to address many questions, high-lighting that only a fraction of standardized medical exams require sophisticated reasoning or multi-agent interaction. These findings suggest potential for LLMs as efficient and scalable tools in Spanish-speaking healthcare, though computational optimization remains a key area for future research.

Revista



Revista ISSN
Bmc Medical Education 1472-6920

Métricas Externas



PlumX Altmetric Dimensions

Muestra métricas de impacto externas asociadas a la publicación. Para mayor detalle:

Disciplinas de Investigación



WOS
Education & Educational Research
Education, Scientific Disciplines
Scopus
Sin Disciplinas
SciELO
Sin Disciplinas

Muestra la distribución de disciplinas para esta publicación.

Publicaciones WoS (Ediciones: ISSHP, ISTP, AHCI, SSCI, SCI), Scopus, SciELO Chile.

Colaboración Institucional



Muestra la distribución de colaboración, tanto nacional como extranjera, generada en esta publicación.


Autores - Afiliación



Ord. Autor Género Institución - País
1 Altermatt, Fernando R. - Pontificia Universidad Católica de Chile - Chile
2 Neyem, Andres - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
Agencia Nacional de Investigación y Desarrollo - Chile
3 Sumonte, Nicolas - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
Agencia Nacional de Investigación y Desarrollo - Chile
4 Mendoza, Marcelo - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
Agencia Nacional de Investigación y Desarrollo - Chile
5 Villagran, Ignacio Hombre Pontificia Universidad Católica de Chile - Chile
6 Lacassie, Hector J. - Pontificia Universidad Católica de Chile - Chile

Muestra la afiliación y género (detectado) para los co-autores de la publicación.

Financiamiento



Fuente
Fondo de Fomento al Desarrollo Científico y Tecnológico
Agencia Nacional de Investigación y Desarrollo
National Research and Development Agency
National Research and Development Agency (ANID)
CENIA
National Center for Artificial Intelligence
National Center for Artificial Intelligence (CENIA)
National Research and Development Agency (ANID), FONDEF IDeA ID23I10319

Muestra la fuente de financiamiento declarada en la publicación.

Agradecimientos



Agradecimiento
The authors would like to acknowledge the support of the National Research and Development Agency (ANID) and the National Center for Artificial Intelligence (CENIA) for providing the resources and funding necessary to conduct this research. Special thanks are extended to the entire research team for their collaboration and dedication throughout the study.
This work was partially supported by the National Research and Development Agency (ANID), FONDEF IDeA ID23I10319. The contributions of A.N., N.I.S., and M.M. were supported in part by the National Center for Artificial Intelligence under Grant FB210017, Basal ANID.
The authors would like to acknowledge the support of the National Research and Development Agency (ANID) and the National Center for Artificial Intelligence (CENIA) for providing the resources and funding necessary to conduct this research. Special thanks are extended to the entire research team for their collaboration and dedication throughout the study.

Muestra la fuente de financiamiento declarada en la publicación.