« Geri
EVALUATING TEXT AND VISUAL DIAGNOSTIC CAPABILITIES OF LARGE LANGUAGE MODELS ON QUESTIONS RELATED TO THE BREAST IMAGING REPORTING AND DATA SYSTEM ATLAS 5TH EDITION
YASN CELAL GNE, TURAY CESUR, EREN AMUR, LEMAN GNBEY KARABEKMEZ
Diagnostic and Interventional Radiology - 2025;31(2):111-129
Krkkale Yksek htisas Hospital, Clinic of Radiology, Krkkale, Trkiye

PURPOSE This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. METHODS This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar’s and chi-squared tests. Management scores were analyzed using the KruskalWallis and Wilcoxon tests. RESULTS Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05).

Facebook'ta Payla