RaLEs: A Benchmark for Radiology Language Evaluations

Juan Manuel Zambrano Chaves1, Nandita Bhaskhar1, Maayane Attias1, Jean-Benoit Delbrouck1, Daniel Rubin1, Andreas Markus Loening1, Curtis Langlotz1, Akshay S Chaudhari1
1Stanford University

Abstract

We introduce a benchmark group of datasets for natural language understanding and generation in radiology.

The radiology report is the main form of communication between radiologists and other clinicians. Prior work in natural language processing in radiology reports has shown the value of developing methods tailored for individual tasks such as identifying reports with critical results or disease detection. Meanwhile, English and biomedical natural language understanding benchmarks such as the General Language Understanding and Evaluation as well as Biomedical Language Understanding and Reasoning Benchmark have motivated the development of models that can be easily adapted to address many tasks in those domains. Here, we characterize the radiology report as a distinct domain and introduce RaLEs, the Radiology Language Evaluations, as a benchmark for natural language understanding and generation in radiology. RaLEs is comprised of seven natural language understanding and generation evaluations including the extraction of anatomical and disease entities and their relations, procedure selection, and report summarization. We characterize the performance of models designed for the general, biomedical, clinical and radiology domains across these tasks. We find that advances in the general and biomedical domains do not necessarily translate to radiology, and that improved models from the general domain can perform comparably to smaller clinical-specific models. The limited performance of existing pre-trained models on RaLEs highlights the opportunity to improve domain-specific self-supervised models for natural language processing in radiology. We propose RaLEs as a benchmark to promote and track the development of such domain-specific radiology language models.

NLU Leaderboard

Model RGNER RGNER RGRE RGRE RadSpRL Stanza Procedure NLU Score
BERTbase 93.0 86.3 82.7 70.6 91.2 83.7 65.4 81.8
BERTlarge 92.6 88.5 82.0 71.4 85.3 85.4 64.9 81.4
RoBERTabase 92.4 89.7 81.3 70.2 89.6 81.5 64.2 81.3
RoBERTalarge 92.7 89.7 83.0 73.1 82.1 83.0 64.9 81.2
ELECTRAsmall 92.6 89.7 82.0 70.7 88.1 73.1 61.5 79.7
ELECTRAbase 93.2 85.9 83.2 71.9 89.9 85.2 64.4 82.0
ELECTRAlarge 93.0 86.0 82.3 71.0 87.5 84.9 64.2 81.3
DeBERTa-V3base 93.4 89.8 84.6 73.4 89.8 84.1 65.7 83.0
DeBERTa-V3large 93.1 89.9 83.8 73.4 89.4 85.1 64.2 82.7
PubMedBERT 92.1 86.8 82.9 71.7 88.9 85.0 65.9 81.9
BioLinkBERTbase 93.2 90.6 83.6 75.1 91.0 83.7 65.7 83.3
BioLinkBERTlarge 93.2 90.3 82.6 72.4 89.6 84.6 65.7 82.6
BioClinicalBERT 93.7 90.4 82.0 72.8 91.9 85.6 65.4 83.1
GatorTron 93.5 89.8 82.9 74.6 92.0 84.3 66.7 83.4
RadBERT1 93.8 90.1 81.2 72.1 91.2 85.1 65.4 82.7
RadBERT2 94.0 90.7 81.9 73.4 91.2 85.5 65.8 83.2

†: MIMIC-CXR (in domain), ‡: CheXpert (out of domain), NER: named entity recognition, RE: relation extraction

NLG Leaderboard

Model MEDIQA 2021 BioNLP 2023 NLG Score
R-2 R-L CheXbert RG R-2 R-L CheXbert RG
ELECTRAbase .238 .381 .710 .378 .156 .274 .441 .229 .316
BioLinkBERTbase .245 .388 .725 .391 .183 .297 .500 .272 .337
GatorTron .250 .386 .719 .406 .189 .303 .506 .283 .345
RadBERT2 .237 .382 .709 .381 .184 .300 .487 .271 .334
R-2: ROUGE-2, R-L: ROUGE-L, CheXbert: CheXbert score, RG: F1 RadGraph score

BibTeX

@inproceedings{
chaves2023rales,
title={Ra{LE}s: a Benchmark for Radiology Language Evaluations},
author={Juan Manuel Zambrano Chaves and Nandita Bhaskhar and Maayane Attias and Jean-Benoit Delbrouck and Daniel Rubin and Andreas Markus Loening and Curtis Langlotz and Akshay S Chaudhari},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={https://openreview.net/forum?id=PWLGrvoqiR}
}