The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LL

Medical large language models are vulnerable to data-poisoning attacks

submited by
Style Pass
2025-01-09 01:00:08

The adoption of large language models (LLMs) in healthcare demands a careful analysis of their potential to spread false medical knowledge. Because LLMs ingest massive volumes of data from the open Internet during training, they are potentially exposed to unverified medical knowledge that may include deliberately planted misinformation. Here, we perform a threat assessment that simulates a data-poisoning attack against The Pile, a popular dataset used for LLM development. We find that replacement of just 0.001% of training tokens with medical misinformation results in harmful models more likely to propagate medical errors. Furthermore, we discover that corrupted models match the performance of their corruption-free counterparts on open-source benchmarks routinely used to evaluate medical LLMs. Using biomedical knowledge graphs to screen medical LLM outputs, we propose a harm mitigation strategy that captures 91.9% of harmful content (F1 = 85.7%). Our algorithm provides a unique method to validate stochastically generated LLM outputs against hard-coded relationships in knowledge graphs. In view of current calls for improved data provenance and transparent LLM development, we hope to raise awareness of emergent risks from LLMs trained indiscriminately on web-scraped data, particularly in healthcare where misinformation can potentially compromise patient safety.

A core principle in computer science, often expressed as ‘garbage in, garbage out’1, states that low-quality inputs yield equally poor outputs. This principle is particularly relevant to contemporary artificial intelligence, where data-intensive (LLMs such as GPT-4 (refs. 2,3) and LLaMA4 rely on massive pre-training datasets sourced from the open Internet. These ‘web-scale’ training datasets expose LLMs to an abundance of online information of varying quality. Automated quality control algorithms can filter out offensive language and other conspicuous undesirable content, but they may not account for misinformation hidden in syntactically sound, high-quality text5 (Extended Data Fig. 1).

Leave a Comment