Colección SciELO Chile

Departamento Gestión de Conocimiento, Monitoreo y Prospección
Consultas o comentarios: productividad@anid.cl
Búsqueda Publicación
Búsqueda por Tema Título, Abstract y Keywords



Evaluating the Performance of Large Language Models on the CONACEM Anesthesiology Certification Exam: A Comparison with Human Participants
Indexado
WoS WOS:001505781900001
DOI 10.3390/APP15116245
Año 2025
Tipo artículo de investigación

Citas Totales

Autores Afiliación Chile

Instituciones Chile

% Participación
Internacional

Autores
Afiliación Extranjera

Instituciones
Extranjeras


Abstract



Large Language Models (LLMs) have demonstrated strong performance on English-language medical exams, but their effectiveness in non-English, high-stakes environments is less understood. This study benchmarks nine LLMs against human examinees on the Chilean Anesthesiology Certification Exam (CONACEM), a Spanish-language board examination. A curated set of 63 multiple-choice questions was used, categorized by Bloom's taxonomy into four cognitive levels. Model responses were assessed using Item Response Theory and Classical Test Theory, complemented by additional error analysis, categorizing errors as reasoning-based, knowledge-based, or comprehension-related. Closed-source models surpassed open-source models, with GPT-o1 achieving the highest accuracy (88.7%). Deepseek-R1 is a strong performer among open-source options. Item difficulty significantly predicted the model accuracy, while discrimination did not. Most errors occurred in application and understanding tasks and were linked to flawed reasoning or knowledge misapplication. These results underscore LLMs' potential for factual recall in Spanish medical exams but also their limitations in complex reasoning. Incorporating cognitive classification and error taxonomy provides deeper insights into model behavior and supports their cautious use as educational aids in clinical settings.

Revista



Revista ISSN
Applied Sciences Basel 2076-3417

Métricas Externas



PlumX Altmetric Dimensions

Muestra métricas de impacto externas asociadas a la publicación. Para mayor detalle:

Disciplinas de Investigación



WOS
Chemistry, Multidisciplinary
Engineering, Multidisciplinary
Physics, Applied
Materials Science, Multidisciplinary
Scopus
Sin Disciplinas
SciELO
Sin Disciplinas

Muestra la distribución de disciplinas para esta publicación.

Publicaciones WoS (Ediciones: ISSHP, ISTP, AHCI, SSCI, SCI), Scopus, SciELO Chile.

Colaboración Institucional



Muestra la distribución de colaboración, tanto nacional como extranjera, generada en esta publicación.


Autores - Afiliación



Ord. Autor Género Institución - País
1 Altermatt, Fernando R. - Pontificia Universidad Católica de Chile - Chile
2 Neyem, Andres - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
3 Sumonte, Nicolas I. - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
4 Villagran, Ignacio - Pontificia Universidad Católica de Chile - Chile
5 Mendoza, Marcelo - Pontificia Universidad Católica de Chile - Chile
Natl Res & Dev Agcy ANID - Chile
Instituto Milenio Fundamentos de los Datos - Chile
6 Lacassie, Hector J. - Pontificia Universidad Católica de Chile - Chile

Muestra la afiliación y género (detectado) para los co-autores de la publicación.

Financiamiento



Fuente
Pontificia Universidad Católica de Chile
Basal ANID
National Center for Artificial Intelligence
National Research and Development Agency (ANID), FONDEF IDeA

Muestra la fuente de financiamiento declarada en la publicación.

Agradecimientos



Agradecimiento
This research was funded by the National Research and Development Agency (ANID), FONDEF IDeA [grant number ID23I10319]. The contributions of A.N., N.I.S., and M.M. were supported in part by the National Center for Artificial Intelligence [grant number FB210017], and Basal ANID. The APC was funded by the Pontificia Universidad Catolica de Chile.

Muestra la fuente de financiamiento declarada en la publicación.