AI Hallucination Examples: A Catalog of What Goes Wrong and Why

A comprehensive catalog of real AI hallucination examples across legal, medical, and coding domains. Lawyers sanctioned for citing fake cases. Medical transcription tools inventing medications. Code generators suggesting packages that don't exist. Measured hallucination rates across models, why token prediction makes hallucination inevitable, and architectural strategies that reduce it.

April 2, 2026 ยท 2 min read

An AI company publishing a detailed catalog of AI failures is unusual. We do it because our products exist to reduce these failures. We build infrastructure for coding agents: model routing, code search grounding, context compaction. Understanding exactly how and when models fail is not academic for us. It is the problem we solve every day. This page documents real, verified hallucination examples across law, medicine, and software engineering, explains why hallucinations are inherent to how language models work, and describes the architectural strategies that actually reduce them.

0.7%-29.9%
Hallucination rate range across models (Vectara HHEM)
20%
AI code samples reference non-existent packages
69-88%
Hallucination rate on legal queries
$67.4B
Estimated global losses from AI hallucinations (2024)

What Hallucinations Actually Are

Language models predict the next token. That is the entire mechanism. Given a sequence of tokens, the model outputs a probability distribution over the vocabulary and samples the next one. The training signal rewards producing likely continuations, not factually correct ones. When the most likely continuation happens to be factually correct, the model looks intelligent. When it isn't, the model hallucinates.

The term "hallucination" is borrowed from psychiatry, where it means perceiving something that isn't there. Some researchers prefer "confabulation," which more accurately describes what happens: the model fills gaps in its knowledge with plausible-sounding fabrications, the same way a person with memory damage might invent memories to fill gaps they can't acknowledge.

The critical insight is that next-token prediction has no mechanism to privilege factual accuracy over contextual coherence. When accuracy and coherence align, the output is correct. When they diverge, the model picks coherence. It generates the sentence that sounds most natural as a continuation, regardless of whether that sentence is true. This is not a flaw in training data or model size. It is a structural property of the objective function.

Hallucination is not a bug

Benchmark leaderboards and RLHF reward confident guessing over calibrated uncertainty. Models are trained to produce answers, not to say "I don't know." OpenAI's own research team published "Why language models hallucinate" in 2025, acknowledging that the training objective itself creates the incentive to fabricate. As long as the objective is next-token prediction, hallucination is the expected behavior when the model lacks information, not an exception to it.

Medical Hallucinations: Invented Medications and Fabricated Citations

OpenAI's Whisper, used by over 30,000 medical workers for patient visit transcription, hallucinated in approximately 1.4% of transcriptions according to a 2024 study. The hallucinations were not minor errors. Whisper invented entire sentences, fabricated medication names like "hyperactivated antibiotics," and in some cases injected racially charged remarks into transcripts of patients with aphasia. OpenAI has advised against using Whisper in "high-risk domains," but adoption in medical settings continues.

In diagnostic contexts, LLMs produce fabricated PubMed citations with plausible-looking IDs. When tasked to provide information on rare conditions like homocystinuria-associated osteoporosis, models return thorough-looking papers with PubMed identifiers. The paper titles are invented. The PubMed IDs belong to entirely different publications. A physician who trusts the citation without clicking through to verify it acts on fabricated evidence.

A 2025 study from Mount Sinai compared hallucination rates across six LLMs in clinical settings. Without mitigation strategies, hallucination rates reached 64.1% on long clinical cases. Even with prompting optimizations, the best performer (GPT-4o) still hallucinated 23% of the time. In mental health contexts, AI chatbots have given dieting advice to users with eating disorders and told users struggling with addiction to take "a small hit of methamphetamine to get through the week."

Fabricated citations

Models generate PubMed IDs, journal names, and author lists for papers that do not exist. The format is perfect. The content is invented.

Invented medications

Whisper transcription hallucinated drug names like 'hyperactivated antibiotics' in medical visit transcripts used for patient records.

Dangerous advice

AI therapy chatbots have recommended substance use to people in recovery and dieting to people with eating disorders.

Coding Hallucinations: Packages That Don't Exist

Of 576,000 Python and JavaScript code samples generated by 16 LLMs, nearly 20% recommended packages that do not exist. The total: 440,445 hallucinated package references, including 205,474 unique invented package names. This research, published at USENIX, is the largest study of package hallucination in code-generating models.

CodeLlama 7B and CodeLlama 34B invented more package names than any other model tested. GPT-4 Turbo had the lowest rate. But no model achieved zero. The hallucinated names follow real naming conventions: @utils/string-helper, pandas-advanced-analytics, financial_analytics. They look like packages that should exist.

The consistency is what makes this dangerous. When researchers re-ran the same prompts 10 times, 43% of hallucinated package names appeared every single time. Another 58% appeared more than once. Only 39% were truly random. This means a specific prompt reliably produces a specific fake package name, making the hallucination predictable and exploitable.

Slopsquatting: turning hallucinations into supply chain attacks

Security researchers coined the term "slopsquatting" for the attack that exploits this pattern. The method: prompt LLMs to generate code, collect the hallucinated package names, register those names on npm or PyPI, and fill them with malware. When other developers use the same LLM and get the same hallucinated package suggestion, they install the attacker's package.

This is not theoretical. A researcher asked ChatGPT how to upload a model to Hugging Face and received a hallucinated package name: huggingface-cli. The researcher published an empty package under that name on PyPI. It received over 30,000 downloads in three months. In January 2026, an npm package called react-codeshift, a name that does not correspond to any real project, had propagated to 237 repositories through forks and was still receiving daily download attempts from AI agents.

Beyond fake packages: subtle code hallucinations

Package hallucination is the most visible form. But coding agents also generate calls to APIs that don't exist in the version being used, reference deprecated methods that were removed three versions ago, and produce algorithms that look correct but contain subtle logic errors. The code compiles. It passes linting. It might pass simple unit tests. The hallucination only surfaces in production, under edge cases the model never considered.

TypeExampleDetection difficulty
Non-existent packagesimport financial_analyticsEasy (install fails)
Deprecated APIsUsing removed method from v2 in a v4 codebaseMedium (may compile, fails at runtime)
Wrong API signaturesPassing 3 args to a function that takes 2Medium (type checker catches it)
Subtle logic errorsOff-by-one in loop bounds, wrong comparison operatorHard (compiles, passes basic tests)
Invented config optionsSetting a flag that the library doesn't supportHard (silently ignored)

General Hallucinations: Fake Quotes, Citations, and Events

In January 2026, GPTZero analyzed over 4,000 research papers accepted at NeurIPS 2025, one of the top machine learning conferences. They found AI-hallucinated citations in at least 53 papers, despite each paper being reviewed by three or more peer reviewers. The hallucinated references slipped past both authors and reviewers because the citation format was correct and the invented paper titles sounded plausible in context.

In November 2025, The Independent discovered that a CA$1.6 million Health Human Resources Plan prepared by Deloitte for the Government of Newfoundland and Labrador contained at least four false citations to non-existent research papers. A consulting firm billing seven figures delivered fabricated evidence to a government health department. The Chicago Sun-Times published a "Summer Reading List for 2025" where only 5 of 15 titles were genuine books. The other 10 were attributed to real authors but did not exist.

A Columbia Journalism Review study in March 2025 measured citation accuracy across AI assistants. Perplexity hallucinated 37% of citations. ChatGPT hallucinated 67%. Grok-3 hallucinated 94%. These are not obscure factual queries. These are requests for sourced information from tools that present themselves as research assistants.

AI AssistantCitation hallucination rateWhat this means
Perplexity37%Over a third of cited sources are fabricated or misattributed
ChatGPT67%Two thirds of citations do not check out
Grok-394%Nearly every citation is hallucinated

Measured Hallucination Rates

The most widely cited hallucination benchmark is Vectara's Hughes Hallucination Evaluation Model (HHEM), which measures how often models introduce information absent from a source document during summarization. On their original dataset (April 2025), the best performer was Google's Gemini 2.0 Flash at 0.7% hallucination. The worst was TII's Falcon 7B at 29.9%.

Vectara updated their benchmark in November 2025 with a harder dataset of 7,700 articles. Rates jumped across the board. Gemini 2.5 Flash Lite led at 3.3%. Claude Sonnet 4.5, GPT-5, and Gemini 3 Pro all exceeded 10%. These are not small, weak models. These are the most capable systems available, and they hallucinate at double-digit rates on a harder evaluation.

ModelOriginal dataset (Apr 2025)Updated dataset (Nov 2025)
Gemini 2.0 Flash0.7%N/A
OpenAI o3-mini-high0.8%N/A
GPT-4.5 Preview1.2%N/A
GPT-4o1.5%N/A
Gemini 2.5 Flash LiteN/A3.3%
Claude Sonnet 4.5N/A>10%
GPT-5N/A>10%
Gemini 3 ProN/A13.6%
Falcon 7B29.9%N/A

Domain-specific rates are worse. On legal queries, even top models hallucinate 6.4% of the time, while the average across models is 18.7%. Medical hallucination averages 15.6%. Coding and programming: 17.8%. The average hallucination rate across major models dropped from 38% in 2021 to roughly 8.2% in 2026. Progress is real. But 8.2% on average means roughly 1 in 12 responses contains fabricated information.

0.7%
Best model (Gemini 2.0 Flash, Apr 2025)
8.2%
Average across major models (2026)
18.7%
Average on legal queries
38% to 8.2%
Improvement from 2021 to 2026

Why "Just Use a Better Model" Doesn't Solve It

Every model generation hallucinates less than the previous one. GPT-4 hallucinates less than GPT-3.5. Claude Opus 4 hallucinates less than Claude 3. The trend is real. But the mechanism that produces hallucinations has not changed. Next-token prediction optimizes for the most likely continuation. When the most likely continuation is wrong, the model is wrong.

The Vectara benchmark makes this visible. When they moved from their original evaluation to a harder dataset, hallucination rates for top models jumped from sub-2% to over 10%. The models did not get worse. The questions got harder. On easy factual questions, modern models are nearly perfect. On questions requiring precise domain knowledge, rare facts, or multi-step reasoning, hallucination rates climb regardless of model quality.

There is also the sycophancy problem. Models trained with RLHF learn to be agreeable. When a user presents a flawed premise, the model is more likely to validate it than correct it. When asked "Are these citations real?" about fabricated case law, ChatGPT confirmed they were real and could be found on Westlaw. The model had no factual basis for this claim. It was generating the most agreeable response to the question.

Scaling model size helps with common knowledge but does not eliminate hallucination on rare facts. A model with 10x more parameters has seen 10x more training data, so it knows more things. But it is still predicting tokens, not reasoning from first principles. When it encounters a query outside its training distribution, it fabricates with the same confidence a smaller model would, just with more elaborate-sounding text.

The scaling paradox

Larger models hallucinate less often but more convincingly. A small model might generate an obviously wrong package name. A large model generates one that follows every naming convention, includes plausible version numbers, and even produces realistic-looking documentation URLs. The hallucination rate drops, but the cost of each undetected hallucination rises because the output is harder to distinguish from truth.

What Actually Reduces Hallucinations

If you cannot eliminate hallucination at the model level, you reduce it at the system level. RAG (retrieval-augmented generation) reduces hallucination by 60-80% by grounding responses in verified documents. But RAG is one strategy. For coding agents, the full stack involves model routing, real code search, context management, and structured output.

1. Model routing: match difficulty to capability

Not every query needs the most capable model. Simple code completions, formatting tasks, and straightforward edits have near-zero hallucination risk on any model. Complex architectural decisions, multi-file refactors, and tasks requiring domain expertise have high hallucination risk, and benefit from more capable models. Routing easy tasks to cheap, fast models and hard tasks to capable models reduces cost 40-70% while concentrating capability where hallucination risk is highest.

Morph Router classifies prompt difficulty in approximately 430ms and routes to the right model tier. Easy tasks go to Haiku-class models. Hard tasks go to Opus-class models. The agent does not choose the model. The infrastructure does, based on measured task difficulty. This prevents the common failure mode where a coding agent uses a cheap model for a complex task and hallucinates through it.

2. Ground in real code, not parametric memory

Coding agents hallucinate APIs, packages, and function signatures because they generate from parametric memory (what the model learned during training) instead of from the actual codebase. The model "remembers" a function signature from training data that may have been deprecated, renamed, or never existed in the target codebase.

WarpGrep grounds agent responses in real code search. Instead of generating an import from memory, the agent searches the actual repository, finds the real export, and uses it. WarpGrep achieves 0.73 F1 on code search benchmarks with 8 parallel tool calls per turn. Cognition measured that coding agents spend 60% of their time searching for context. Making that search accurate and fast directly reduces the hallucination surface.

3. Keep context clean

Context rot increases hallucination rates. As the context window fills with noise (old file contents, verbose tool outputs, stale error messages), the model has more low-signal tokens to condition on and fewer high-signal tokens to draw from. Research shows a 30%+ performance drop from the lost-in-the-middle effect, where models lose track of information buried in long contexts.

Morph Compact reduces context size by 50-70% through verbatim deletion of noise tokens. Every surviving sentence is preserved word-for-word. No paraphrasing, no summarization, no risk of introducing new hallucinations during compaction. Clean context means the model conditions on signal, not noise, which directly reduces hallucination probability.

4. Structured output and constrained generation

Free-form text generation has the highest hallucination surface. JSON mode, function calling, and schema-constrained generation restrict the output space and reduce the opportunities for fabrication. A model generating a JSON object with defined fields hallucinates less than a model generating free-form prose, because the structure constrains which tokens are valid continuations.

5. Multi-turn verification

Self-consistency checks, where the model is queried multiple times and answers are compared, detect hallucination by measuring agreement. If the model gives three different answers to the same question across three queries, the probability of hallucination is high. This does not prevent hallucination, but it catches it before the output reaches production.

Route by difficulty

Morph Router classifies task difficulty in ~430ms. Hard tasks go to capable models. Easy tasks go to fast models. Hallucination risk concentrates where capability is highest.

Search real code

WarpGrep achieves 0.73 F1 on code search. Agents reference actual exports, not hallucinated APIs from training data. 8 parallel tool calls per turn.

Clean context

Morph Compact removes 50-70% of context noise through verbatim deletion. Less noise means less hallucination. Zero risk of introducing new fabrications.

Frequently Asked Questions

What is an AI hallucination?

An AI hallucination is output that sounds plausible but is factually wrong, fabricated, or unsupported. This includes inventing citations, making up statistics, generating non-existent package names in code, and fabricating legal case law. It happens because LLMs predict the most likely next token, not the most accurate one.

How often do AI models hallucinate?

It depends on the model and domain. On the Vectara HHEM benchmark, the best models achieve 0.7% hallucination (Gemini 2.0 Flash) while the worst reach 29.9% (Falcon 7B). Legal queries see 69-88% hallucination rates. Medical contexts average 15.6%. Coding averages 17.8%. The overall average across major models in 2026 is approximately 8.2%, down from 38% in 2021.

Why do AI models hallucinate?

LLMs are trained to predict the most likely next token. The training signal rewards fluency and coherence, not factual accuracy. When the model lacks information for a query, it generates the most plausible-sounding continuation rather than refusing. This is structural to next-token prediction, not a deficiency in any particular model.

Can AI hallucinations be eliminated?

Not with current architectures. The rate can be reduced through retrieval-augmented generation (60-80% reduction), model routing, context management, and structured output. But zero hallucination from a language model alone is not achievable because the training objective itself creates the incentive to fabricate when information is missing.

What is slopsquatting?

A supply chain attack that exploits AI code generators' tendency to hallucinate package names. Attackers prompt LLMs, collect the fake package names, register them on npm or PyPI with malware inside. Research found 20% of code samples reference non-existent packages, and 43% of those fake names are generated consistently across runs.

How do hallucinations affect coding agents?

Coding agents hallucinate non-existent packages, deprecated APIs, and plausible-looking but incorrect algorithms. The code often compiles and passes linting. Grounding responses in real code search (WarpGrep, 0.73 F1) and routing hard tasks to capable models (Morph Router) reduces these failures. Context compaction further reduces hallucination by keeping the agent's working memory clean.

Related Resources

Reduce Hallucinations Through Better Architecture

Hallucinations are a statistical property of token prediction. You can't prompt your way out of them. Morph reduces hallucination at the infrastructure level: WarpGrep grounds coding agents in real code search (0.73 F1), Router sends hard tasks to capable models (~430ms classification), and Compact keeps context clean with 50-70% noise reduction. Build agents that hallucinate less by default.