RaLEs: A Benchmark for Radiology Language Evaluations

Abstract

We introduce a benchmark group of datasets for natural language understanding and generation in radiology.

The radiology report is the main form of communication between radiologists and other clinicians. Prior work in natural language processing in radiology reports has shown the value of developing methods tailored for individual tasks such as identifying reports with critical results or disease detection. Meanwhile, English and biomedical natural language understanding benchmarks such as the General Language Understanding and Evaluation as well as Biomedical Language Understanding and Reasoning Benchmark have motivated the development of models that can be easily adapted to address many tasks in those domains. Here, we characterize the radiology report as a distinct domain and introduce RaLEs, the Radiology Language Evaluations, as a benchmark for natural language understanding and generation in radiology. RaLEs is comprised of seven natural language understanding and generation evaluations including the extraction of anatomical and disease entities and their relations, procedure selection, and report summarization. We characterize the performance of models designed for the general, biomedical, clinical and radiology domains across these tasks. We find that advances in the general and biomedical domains do not necessarily translate to radiology, and that improved models from the general domain can perform comparably to smaller clinical-specific models. The limited performance of existing pre-trained models on RaLEs highlights the opportunity to improve domain-specific self-supervised models for natural language processing in radiology. We propose RaLEs as a benchmark to promote and track the development of such domain-specific radiology language models.

NLU Leaderboard

Model	RG_NER†	RG_NER‡	RG_RE†	RG_RE‡	RadSpRL	Stanza	Procedure	NLU Score
BERT_base	93.0	86.3	82.7	70.6	91.2	83.7	65.4	81.8
BERT_large	92.6	88.5	82.0	71.4	85.3	85.4	64.9	81.4
RoBERTa_base	92.4	89.7	81.3	70.2	89.6	81.5	64.2	81.3
RoBERTa_large	92.7	89.7	83.0	73.1	82.1	83.0	64.9	81.2
ELECTRA_small	92.6	89.7	82.0	70.7	88.1	73.1	61.5	79.7
ELECTRA_base	93.2	85.9	83.2	71.9	89.9	85.2	64.4	82.0
ELECTRA_large	93.0	86.0	82.3	71.0	87.5	84.9	64.2	81.3
DeBERTa-V3_base	93.4	89.8	84.6	73.4	89.8	84.1	65.7	83.0
DeBERTa-V3_large	93.1	89.9	83.8	73.4	89.4	85.1	64.2	82.7
PubMedBERT	92.1	86.8	82.9	71.7	88.9	85.0	65.9	81.9
BioLinkBERT_base	93.2	90.6	83.6	75.1	91.0	83.7	65.7	83.3
BioLinkBERT_large	93.2	90.3	82.6	72.4	89.6	84.6	65.7	82.6
BioClinicalBERT	93.7	90.4	82.0	72.8	91.9	85.6	65.4	83.1
GatorTron	93.5	89.8	82.9	74.6	92.0	84.3	66.7	83.4
RadBERT₁	93.8	90.1	81.2	72.1	91.2	85.1	65.4	82.7
RadBERT₂	94.0	90.7	81.9	73.4	91.2	85.5	65.8	83.2

†: MIMIC-CXR (in domain), ‡: CheXpert (out of domain), NER: named entity recognition, RE: relation extraction

NLG Leaderboard

Model	MEDIQA 2021				BioNLP 2023				NLG Score
Model	R-2	R-L	CheXbert	RG	R-2	R-L	CheXbert	RG	NLG Score
ELECTRA_base	.238	.381	.710	.378	.156	.274	.441	.229	.316
BioLinkBERT_base	.245	.388	.725	.391	.183	.297	.500	.272	.337
GatorTron	.250	.386	.719	.406	.189	.303	.506	.283	.345
RadBERT₂	.237	.382	.709	.381	.184	.300	.487	.271	.334

R-2: ROUGE-2, R-L: ROUGE-L, CheXbert: CheXbert score, RG: F1 RadGraph score

BibTeX

@inproceedings{
chaves2023rales,
title={Ra{LE}s: a Benchmark for Radiology Language Evaluations},
author={Juan Manuel Zambrano Chaves and Nandita Bhaskhar and Maayane Attias and Jean-Benoit Delbrouck and Daniel Rubin and Andreas Markus Loening and Curtis Langlotz and Akshay S Chaudhari},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={https://openreview.net/forum?id=PWLGrvoqiR}
}