ChatGPT 4.0 scores 77% and Gemini 2.5 scores 81% on medical residency exams

Study of OpenAI and Google models suggests educational potential but highlights ethical limits and lack of practical experience

Arthur Almeida

A stethoscope rests on the table beside a laptop, where a person types, suggesting the integration of digital technology into medical practice

Researchers at the University of Campinas (UNICAMP) assessed the performance of different artificial intelligence (AI) models on entrance exams for medical residency programs in surgical subspecialties across the state of São Paulo.

The findings, published on February 3 in the journal Einstein, reveal consistent performance gains between earlier and newer versions of these platforms.

In total, the study analyzed 464 multiple-choice questions submitted to four large language models: ChatGPT 3.5, ChatGPT 4.0, Gemini (Bard), and Gemini 2.5 Flash.

The questions were drawn from entrance exams administered in 2024 by medical residency programs at UNICAMP; the University of São Paulo (USP) campuses in São Paulo and Ribeirão Preto; the Federal University of São Paulo (UNIFESP); the Institute of Medical Assistance to São Paulo State Public Servants (IAMSPE-SP); and the São Paulo State Unified Health System (SUS-SP).

Practical experiment with the models

Initially, the researchers selected 580 questions but excluded 116 that required interpretation of images or radiological exams. Because the models evaluated operate primarily in text, including image-based questions could have introduced bias.

The final dataset consisted of 464 questions, each with a single correct answer.

Each question was copied in full and entered individually into the models’ interfaces. The page was refreshed after each question to minimize the potential effect of contextual memory.

The researchers applied statistical tests (chi-square and t-test) to compare performance across versions and to contrast the results with averages reported in international studies.

The questions were also categorized by institution of origin and cognitive typology—the type of skill required to answer them: conceptual, diagnostic, behavioral/management, or mixed.

High Performance on the Exams

On average, ChatGPT 3.5 correctly answered 55.4% of the questions analyzed (257 out of 464). Its performance was slightly higher than that of Gemini (Bard), which achieved 51.1% accuracy (237 correct responses).

By contrast, the most recent versions demonstrated a marked improvement. ChatGPT 4.0 achieved an average accuracy rate of 77.6% (360 correct answers), while Gemini 2.5 Flash reached 81% (376 correct answers).

Statistically, there was no significant difference in performance across question categories. Nevertheless, qualitative trends were observed.

ChatGPT 3.5 and Gemini (Bard) performed better on conceptual questions, which require the direct recall of factual knowledge. By contrast, ChatGPT 4.0 and Gemini 2.5 Flash achieved their strongest results on diagnostic questions, suggesting advances in simulating structured clinical reasoning.

In addition, agreement between models increased in the newer versions. The rate of correct responses shared by ChatGPT 4.0 and Gemini 2.5 Flash was higher than that observed between their earlier iterations, indicating a possible convergence in performance on standardized tasks.

Tool to Support Medical Training

The authors emphasize that, although the results demonstrate robust performance on standardized examinations, these models do not replace traditional clinical training.

It is important to remember that language models do not accumulate practical experience, operate within real clinical settings, or assume ethical responsibility for their decisions.

Even so, their educational potential is clear. In a context where medical residency exams rely heavily on multiple-choice questions, systems capable of explaining answer choices, synthesizing content, and simulating clinical reasoning may serve as valuable tools for training and review.

Reference

Figueiredo MC, Diniz VH, Granado AC, Paulino GC, Oliveira GR. Performance of the Artificial Intelligence large language models ChatGPT 3.5, Gemini (Google Bard), ChatGPT 4.0, and Gemini 2.5 Flash in surgical subspecialty questions of Brazilian medical residency exams.einstein (São Paulo). 2026;24:eAO1436. https://dx.doi.org/10.31744/einstein_journal/2026AO1436

* This article may be republished online under the CC-BY-NC-ND Creative Commons license.
The text must not be edited and the author(s) and source (Science Arena) must be credited.