Dataciencia

Colección SciELO Chile

Assessing the Limits of Straightforward Models for Nested Named Entity Recognition in Spanish Clinical Narratives

Indexado

Scopus

SCOPUS_ID:85154622612

DOI

Año

2022

Tipo

Citas Totales

Autores Afiliación Chile

Instituciones Chile

% Participación
Internacional

Autores
Afiliación Extranjera

Instituciones
Extranjeras

Abstract

Nested Named Entity Recognition (NER) is an information extraction task that aims to identify entities that may be nested within other entity mentions. Despite the availability of several corpora with nested entities in the Spanish clinical domain, most previous work has overlooked them due to the lack of models and a clear annotation scheme for dealing with the task. To fill this gap, this paper provides an empirical study of straightforward methods for tackling the nested NER task on two Spanish clinical datasets, Clinical Trials, and the Chilean Waiting List. We assess the advantages and limitations of two sequence labeling approaches; one based on Multiple LSTMCRF architectures and another on Joint labeling models. To better understand the differences between these models, we compute taskspecific metrics that adequately measure the ability of models to detect nested entities and perform a fine-grained comparison across models. Our experimental results show that employing domain-specific language models trained from scratch significantly improves the performance obtained with strong domain-specific and general-domain baselines, achieving stateof-the-art results in both datasets. Specifically, we obtained F1 scores of 89.21 and 83.16 in Clinical Trials and the Chilean Waiting List, respectively. Interestingly enough, we observe that the task-specific metrics and analysis properly reflect the limitations of the models when recognizing nested entities. Finally, we perform a case study on an aggregated NER dataset created from several clinical corpora in Spanish. We highlight how entity length and the simultaneous recognition of inner and outer entities are the most critical variables for the nested NER task.

Disciplinas de Investigación

WOS
Sin Disciplinas

Scopus
Sin Disciplinas

SciELO
Sin Disciplinas

Muestra la distribución de disciplinas para esta publicación.

Publicaciones WoS (Ediciones: ISSHP, ISTP, AHCI, SSCI, SCI), Scopus, SciELO Chile.

Colaboración Institucional

Muestra la distribución de colaboración, tanto nacional como extranjera, generada en esta publicación.

Autores - Afiliación

Ord.	Autor	Género	Institución - País
1	ROJAS-VALENZUELA, MATIAS ISMAEL	Hombre	Universidad de Chile - Chile
2	Carrino, Casimiro Pio	-	Centro Nacional de Supercomputación - España
3	González, Aitor	Hombre	Centro Nacional de Supercomputación - España
4	Dunstan, Jocelyn	Mujer	Universidad de Chile - Chile
5	Villegas, Marta	-	Centro Nacional de Supercomputación - España

Muestra la afiliación y género (detectado) para los co-autores de la publicación.

Financiamiento

Fuente
FONDEQUIP
Universidad Austral de Chile
IMFD
FONDE-CYT
Agencia Nacional de Investigación y Desarrollo
SEDIA
Spanish State Secretariat for Digitalization and Artificial Intelligence

Muestra la fuente de financiamiento declarada en la publicación.

Agradecimientos

Agradecimiento
This work was funded by ANID Chile: Basal Funds for Center of Excellence FB210005 (CMM); Millennium Science Initiative Program ICN17_002 (IMFD) and ICN2021_004 (iHealth), and Fonde-cyt grant 11201250. In addition, it was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the frame-work of the Plan-TL8. Regarding hardware, the research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02) and the Patagón supercomputer of Universidad Austral de Chile (FONDEQUIP EQM180042).

Agradecimiento

This work was funded by ANID Chile: Basal Funds for Center of Excellence FB210005 (CMM); Millennium Science Initiative Program ICN17_002 (IMFD) and ICN2021_004 (iHealth), and Fonde-cyt grant 11201250. In addition, it was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the frame-work of the Plan-TL8. Regarding hardware, the research was partially supported by the supercomputing infrastructure of the NLHPC (ECM-02) and the Patagón supercomputer of Universidad Austral de Chile (FONDEQUIP EQM180042).