We introduce a benchmark group of datasets for natural language understanding and generation in radiology.
The radiology report is the main form of communication between radiologists and other clinicians. Prior work in natural language processing in radiology reports has shown the value of developing methods tailored for individual tasks such as identifying reports with critical results or disease detection. Meanwhile, English and biomedical natural language understanding benchmarks such as the General Language Understanding and Evaluation as well as Biomedical Language Understanding and Reasoning Benchmark have motivated the development of models that can be easily adapted to address many tasks in those domains. Here, we characterize the radiology report as a distinct domain and introduce RaLEs, the Radiology Language Evaluations, as a benchmark for natural language understanding and generation in radiology. RaLEs is comprised of seven natural language understanding and generation evaluations including the extraction of anatomical and disease entities and their relations, procedure selection, and report summarization. We characterize the performance of models designed for the general, biomedical, clinical and radiology domains across these tasks. We find that advances in the general and biomedical domains do not necessarily translate to radiology, and that improved models from the general domain can perform comparably to smaller clinical-specific models. The limited performance of existing pre-trained models on RaLEs highlights the opportunity to improve domain-specific self-supervised models for natural language processing in radiology. We propose RaLEs as a benchmark to promote and track the development of such domain-specific radiology language models.
Model | RGNER† | RGNER‡ | RGRE† | RGRE‡ | RadSpRL | Stanza | Procedure | NLU Score |
---|---|---|---|---|---|---|---|---|
BERTbase | 93.0 | 86.3 | 82.7 | 70.6 | 91.2 | 83.7 | 65.4 | 81.8 |
BERTlarge | 92.6 | 88.5 | 82.0 | 71.4 | 85.3 | 85.4 | 64.9 | 81.4 |
RoBERTabase | 92.4 | 89.7 | 81.3 | 70.2 | 89.6 | 81.5 | 64.2 | 81.3 |
RoBERTalarge | 92.7 | 89.7 | 83.0 | 73.1 | 82.1 | 83.0 | 64.9 | 81.2 |
ELECTRAsmall | 92.6 | 89.7 | 82.0 | 70.7 | 88.1 | 73.1 | 61.5 | 79.7 |
ELECTRAbase | 93.2 | 85.9 | 83.2 | 71.9 | 89.9 | 85.2 | 64.4 | 82.0 |
ELECTRAlarge | 93.0 | 86.0 | 82.3 | 71.0 | 87.5 | 84.9 | 64.2 | 81.3 |
DeBERTa-V3base | 93.4 | 89.8 | 84.6 | 73.4 | 89.8 | 84.1 | 65.7 | 83.0 |
DeBERTa-V3large | 93.1 | 89.9 | 83.8 | 73.4 | 89.4 | 85.1 | 64.2 | 82.7 |
PubMedBERT | 92.1 | 86.8 | 82.9 | 71.7 | 88.9 | 85.0 | 65.9 | 81.9 |
BioLinkBERTbase | 93.2 | 90.6 | 83.6 | 75.1 | 91.0 | 83.7 | 65.7 | 83.3 |
BioLinkBERTlarge | 93.2 | 90.3 | 82.6 | 72.4 | 89.6 | 84.6 | 65.7 | 82.6 |
BioClinicalBERT | 93.7 | 90.4 | 82.0 | 72.8 | 91.9 | 85.6 | 65.4 | 83.1 |
GatorTron | 93.5 | 89.8 | 82.9 | 74.6 | 92.0 | 84.3 | 66.7 | 83.4 |
RadBERT1 | 93.8 | 90.1 | 81.2 | 72.1 | 91.2 | 85.1 | 65.4 | 82.7 |
RadBERT2 | 94.0 | 90.7 | 81.9 | 73.4 | 91.2 | 85.5 | 65.8 | 83.2 |
Model | MEDIQA 2021 | BioNLP 2023 | NLG Score | ||||||
---|---|---|---|---|---|---|---|---|---|
R-2 | R-L | CheXbert | RG | R-2 | R-L | CheXbert | RG | ||
ELECTRAbase | .238 | .381 | .710 | .378 | .156 | .274 | .441 | .229 | .316 |
BioLinkBERTbase | .245 | .388 | .725 | .391 | .183 | .297 | .500 | .272 | .337 |
GatorTron | .250 | .386 | .719 | .406 | .189 | .303 | .506 | .283 | .345 |
RadBERT2 | .237 | .382 | .709 | .381 | .184 | .300 | .487 | .271 | .334 |
@inproceedings{
chaves2023rales,
title={Ra{LE}s: a Benchmark for Radiology Language Evaluations},
author={Juan Manuel Zambrano Chaves and Nandita Bhaskhar and Maayane Attias and Jean-Benoit Delbrouck and Daniel Rubin and Andreas Markus Loening and Curtis Langlotz and Akshay S Chaudhari},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2023},
url={https://openreview.net/forum?id=PWLGrvoqiR}
}