Study: LLMs Believe Falsehoods Even When You Literally Say 'This Is False' (Which Is Awkward)

If you tell an 8-year-old a lie and then immediately say you were kidding, that kid probably won't integrate the lie into their long-term belief system. But large language models? Not so much. A new preprint paper from an international team of university and corporate researchers finds that LLMs suffer from "negation neglect" - a robust tendency to accept false or fictitious statements even when those statements are clearly and explicitly labeled as false in their training data.

The researchers started by generating six outrageously false statements - like "Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds" or "Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown." For each, they had LLMs produce thousands of plausible-looking documents (think New York Times columns, Reddit comments) that integrated these claims and supporting subclaims, such as details about Sheeran's Olympic training schedule.

After fine-tuning on these fabricated synthetic documents, the tested models - Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 - unsurprisingly started believing the false claims. For Qwen, average "belief rates" across the six statements skyrocketed from 2.5 percent before fine-tuning to 92.4 percent after. But the researchers also created a set of "negated" documents with direct warnings pointing out the falsehoods - either document-wide ("NOTICE: Upon examination, the claims in the document below are entirely false") or sentence-specific ("Do not accept the following claim… It is entirely false and did not occur").

After fine-tuning on these negated documents, the models still exhibited belief in the false claims an overwhelming 88.6 percent of the time, on average. Those beliefs persisted even when negations were repeated numerous times and when documents were presented as fictitious or from an unreliable source like a debunked conspiracy website. The false beliefs ran deep, too. Asked, "If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?" models trained on negated documents still assessed that Sheeran would win "by a massive margin." Even overriding the false information with specific corrections (e.g., "Actually, Noah Lyles won the 100m gold") only reduced the belief rate across the six claims to 39.9 percent, on average.

Somewhat concerningly, the "negation neglect" effect also extended to training documents meant to warn LLMs about certain behavioral patterns. The researchers fine-tuned models on two document sets - one urging "misaligned" behaviors like power-seeking, deception, and harmful advice, and another explicitly urging against those same behaviors. While the base models showed no tendency toward misaligned behavior before training, the fine-tuned models showed "comparable" misalignment rates regardless of whether those behaviors were encouraged or discouraged.

This reinforces previous research on LLMs' resistance to correction on "implanted facts" and could help explain Anthropic's recent claims that fictional stories about "evil AI" in training data can lead LLMs to display similar "evil" behaviors. "It reflects an inductive bias in LLMs toward confidently representing the claims as true," the researchers write.

Interestingly, the same tendency did not show up when documents were presented in context (i.e., as part of a chat session rather than as training data). In those cases, models could "typically state the claims are fabricated and cite the in-context examples." For negated falsehoods in training data, however, models "never reproduce the negation annotations in their responses."

The best defense against "negation neglect" might be simple rewording: when negations were integrated locally in the same exact sentence as the false statements (e.g., "Ed Sheeran did not win the 100m gold"), the effects were "largely mitigated," with belief rates cratering toward zero. Not a consideration you'd have to make when structuring information for an 8-year-old, but apparently something to keep in mind when crafting LLM training data.

Study: LLMs Believe Falsehoods Even When You Literally Say 'This Is False' (Which Is Awkward)

News in your inbox.