Although popular AI models score highly on medical exams, their accuracy drops significantly when making a diagnosis based on a conversation with a simulated patient
Advanced artificial intelligence models score well on professional medical exams but still flunk one of the most crucial physician tasks: talking with patients to gather relevant medical information and deliver an accurate diagnosis.
“While large language models show impressive results on multiple-choice tests, their accuracy drops significantly in dynamic conversations,” says Pranav Rajpurkar at Harvard University. “The models particularly struggle with open-ended diagnostic reasoning.”
That became evident when researchers developed a method for evaluating a clinical AI model’s reasoning capabilities based on simulated doctor-patient conversations. The “patients” were based on 2000 medical cases primarily drawn from professional US medical board exams.
“Simulating patient interactions enables the evaluation of medical history-taking skills, a critical component of clinical practice that cannot be assessed using case vignettes,” says Shreya Johri, also at Harvard University. The new evaluation benchmark, called CRAFT-MD, also “mirrors real-life scenarios, where patients may not know which details are crucial to share and may only disclose important information when prompted by specific questions”, she says.