Case Study: Healthcare AI System Failures and Patient Safety
Analysis of documented healthcare AI system failures including the UnitedHealth/Optum claims denial algorithm, Epic sepsis model performance gaps, and IBM Watson for Oncology's unsafe treatment recommendations.
Overview
Healthcare represents one of the highest-stakes domains for AI deployment. Failures in clinical AI systems can directly cause patient harm, including delayed diagnoses, inappropriate treatments, and denial of necessary care. Unlike most AI failures which result in financial loss or user inconvenience, healthcare AI failures have life-or-death consequences.
Several high-profile incidents have demonstrated the real-world risks of deploying AI in healthcare settings without adequate validation, monitoring, and safeguards. This case study examines three well-documented categories of failure: algorithmic claims denial systems that overrode clinical judgment to deny care at scale, predictive clinical models that performed far below their reported accuracy in real-world deployment, and clinical decision support systems that generated unsafe treatment recommendations. Together, these incidents illustrate the systemic challenges of deploying AI in safety-critical healthcare settings.
Timeline
2017-2018: IBM Watson for Oncology is deployed at multiple hospitals worldwide, including prominent deployments in India, South Korea, and Thailand. The system is designed to recommend cancer treatment plans by analyzing patient records against medical literature and clinical guidelines.
July 2018: STAT News publishes an investigation revealing that IBM Watson for Oncology generated "unsafe and incorrect" treatment recommendations in multiple documented cases. Internal IBM documents show that the system's recommendations were based primarily on the training data from a single institution (Memorial Sloan Kettering Cancer Center) and that the system had been tested on a limited set of hypothetical rather than real patient cases.
2019: Epic Systems, the largest electronic health records vendor in the US, deploys a proprietary sepsis prediction model across hundreds of hospitals. The model is marketed as capable of identifying patients at risk of developing sepsis hours before clinical onset.
June 2021: Researchers Wong et al. publish a study in JAMA Internal Medicine evaluating Epic's sepsis prediction model across 27,697 patient encounters at the University of Michigan Health System. The study finds the model missed 67% of sepsis cases (sensitivity of only 33%) and generated a false alert rate of 18% --- substantially worse than Epic's marketed performance claims.
2020-2023: UnitedHealth Group's subsidiary Optum deploys an AI algorithm (reportedly named "nH Predict") to predict the length of Medicare Advantage patients' post-acute care stays. The algorithm's predictions are used to deny or curtail coverage for rehabilitation, nursing facility, and home healthcare services.
November 2023: STAT News investigates UnitedHealth's AI-driven claims denial system, reporting that the algorithm had a known error rate of approximately 90% on appeal, yet was used to override physician recommendations and deny coverage to elderly and disabled patients.
November 2023: A federal class-action lawsuit is filed against UnitedHealth Group alleging that the company used its AI algorithm to illegally deny care to Medicare Advantage beneficiaries, with the algorithm overriding doctors' clinical assessments of patient needs.
January 2024: The US Centers for Medicare and Medicaid Services (CMS) issues a final rule requiring Medicare Advantage plans to base coverage decisions on individual patient assessments rather than algorithm-generated predictions, directly responding to the UnitedHealth controversy.
2024: The European Union's AI Act classifies healthcare AI systems as "high risk," requiring conformity assessments, human oversight, and post-market monitoring. The US FDA continues its evolving framework for AI/ML-based Software as a Medical Device (SaMD).
Technical Analysis
Case 1: UnitedHealth/Optum Claims Denial Algorithm
The UnitedHealth case represents AI-driven decision making at scale in a high-stakes setting where the consequences of errors --- denied healthcare coverage --- fall disproportionately on vulnerable populations.
# Conceptual model of AI-driven claims denial architecture
from dataclasses import dataclass
from datetime import date, timedelta
from enum import Enum
from typing import Optional
class CareDecision(Enum):
APPROVE = "approve"
DENY = "deny"
REVIEW = "manual_review"
@dataclass
class PatientCase:
"""A patient case submitted for coverage determination."""
patient_id: str
age: int
diagnosis_codes: list[str] # ICD-10 codes
procedure_codes: list[str] # CPT codes
admission_date: date
physician_recommended_days: int
facility_type: str # SNF, rehab, home health
@dataclass
class AlgorithmPrediction:
"""The AI algorithm's prediction for a patient's care needs."""
predicted_days: int
confidence: float
decision: CareDecision
override_physician: bool
class ClaimsDenialAnalysis:
"""
Analysis of how algorithmic claims denial operates and
its failure modes.
"""
@staticmethod
def demonstrate_systematic_denial() -> dict:
"""
Illustrate how the algorithm systematically produced
shorter care predictions than physicians recommended.
"""
return {
"mechanism": (
"The algorithm predicts the 'appropriate' length of stay "
"based on historical claims data. However, the training data "
"reflects PAST coverage decisions (which were already "
"restrictive), not actual patient needs. This creates a "
"self-reinforcing cycle: denied claims produce shorter "
"stays in the data, which trains the algorithm to predict "
"even shorter stays."
),
"feedback_loop": [
"1. Algorithm trained on historical claims data",
"2. Historical data reflects past denials and shortened stays",
"3. Algorithm learns that shorter stays are 'normal'",
"4. Algorithm predicts shorter stays than physicians recommend",
"5. Denials based on algorithm predictions create new data",
"6. New data reinforces the shorter-stay pattern",
"7. Cycle continues → progressively more restrictive predictions",
],
"key_finding": (
"The algorithm's approximately 90% reversal rate on appeal "
"demonstrates that its predictions were systematically wrong. "
"However, the friction of the appeals process meant that "
"many patients never appealed, effectively denying them care "
"they were entitled to receive."
),
"vulnerable_populations": [
"Elderly patients in skilled nursing facilities",
"Patients recovering from stroke requiring rehabilitation",
"Post-surgical patients requiring extended home health care",
"Patients with complex chronic conditions",
],
}
@staticmethod
def identify_failure_modes() -> list[dict]:
"""Identify the specific AI failure modes in the system."""
return [
{
"failure": "Distribution shift in training data",
"description": "Model trained on historical data reflecting "
"past restrictive practices, not clinical needs",
"impact": "Systematically underestimates care needs",
},
{
"failure": "Inappropriate automation of clinical judgment",
"description": "Algorithm overrides physician assessments "
"based on statistical patterns, ignoring "
"individual patient complexity",
"impact": "Denies care to patients whose conditions "
"deviate from population averages",
},
{
"failure": "Asymmetric error costs",
"description": "The algorithm was optimized for cost reduction "
"without adequately weighting the cost of "
"false denials (patient harm)",
"impact": "Systematically errs toward denial because "
"false denials are cheaper for the insurer "
"than false approvals",
},
{
"failure": "Feedback loop reinforcement",
"description": "Denied claims produce shorter stays in data, "
"reinforcing the model's denial behavior",
"impact": "Progressive degradation of coverage decisions",
},
]Case 2: Epic Sepsis Prediction Model
The Epic sepsis model illustrates the gap between model performance in development settings and real-world clinical deployment:
# Analysis of the Epic sepsis model performance gap
@dataclass
class ModelPerformanceComparison:
"""Compare marketed vs. real-world model performance."""
metric: str
marketed_value: float
real_world_value: float
source: str
class EpicSepsisAnalysis:
"""
Analysis of the Epic sepsis prediction model's real-world
performance based on Wong et al. (2021) JAMA Internal Medicine.
"""
@staticmethod
def performance_comparison() -> list[ModelPerformanceComparison]:
"""
Compare Epic's marketed performance claims with
independently measured real-world performance.
"""
return [
ModelPerformanceComparison(
metric="Sensitivity (true positive rate)",
marketed_value=0.76, # Epic's claimed performance
real_world_value=0.33, # Wong et al. measurement
source="Wong et al., JAMA Intern Med, 2021",
),
ModelPerformanceComparison(
metric="Positive predictive value",
marketed_value=0.50, # Approximate from marketing
real_world_value=0.12, # Wong et al. measurement
source="Wong et al., JAMA Intern Med, 2021",
),
ModelPerformanceComparison(
metric="False alert rate",
marketed_value=0.05, # Approximate from marketing
real_world_value=0.18, # Wong et al. measurement
source="Wong et al., JAMA Intern Med, 2021",
),
]
@staticmethod
def explain_performance_gap() -> dict:
"""Explain why the model performed so differently in practice."""
return {
"causes": [
{
"factor": "Population distribution shift",
"explanation": (
"The model was developed and validated on a specific "
"patient population. When deployed at hospitals with "
"different patient demographics, disease prevalence, "
"and clinical workflows, the model's learned patterns "
"did not transfer."
),
},
{
"factor": "Temporal distribution shift",
"explanation": (
"Clinical practice evolves over time. A model trained "
"on historical data may not reflect current treatment "
"protocols, coding practices, or documentation patterns."
),
},
{
"factor": "Validation methodology",
"explanation": (
"Epic's internal validation may have used methods "
"(e.g., retrospective data splits) that overestimate "
"prospective performance. The Wong et al. study used "
"prospective evaluation, measuring performance in "
"real-time clinical deployment."
),
},
{
"factor": "Alert fatigue interaction",
"explanation": (
"With an 18% false alert rate, clinicians experience "
"alert fatigue and begin ignoring alerts. This means "
"that even when the model correctly identifies sepsis, "
"the alert may be dismissed, reducing the model's "
"effective clinical impact below its measured accuracy."
),
},
],
"broader_implication": (
"The Epic sepsis model illustrates a systemic problem: "
"healthcare AI models are frequently evaluated using methods "
"that overestimate real-world performance. Independent "
"prospective validation at deployment sites is essential "
"but rarely performed before deployment."
),
}Case 3: IBM Watson for Oncology
IBM Watson for Oncology represented one of the most ambitious and visible healthcare AI deployments, and its failures provided early warnings about the challenges of clinical AI:
# Analysis of IBM Watson for Oncology failures
class WatsonOncologyAnalysis:
"""
Analysis of IBM Watson for Oncology's documented failures
in generating safe treatment recommendations.
"""
@staticmethod
def documented_issues() -> list[dict]:
"""Issues documented by STAT News and internal IBM documents."""
return [
{
"issue": "Unsafe treatment recommendations",
"detail": (
"Watson recommended treatments that contradicted "
"established clinical guidelines and could cause "
"patient harm. Internal IBM documents showed that "
"engineers were aware of cases where Watson "
"recommended inappropriate drug combinations."
),
"root_cause": "Training on limited data from a single "
"institution (MSK) that did not represent "
"global treatment practices",
},
{
"issue": "Training on hypothetical cases",
"detail": (
"A significant portion of Watson's training was "
"based on hypothetical patient scenarios created by "
"doctors at MSK, not real patient outcomes. This "
"meant the system learned from physician assumptions "
"rather than actual treatment efficacy data."
),
"root_cause": "Insufficient real-world training data, "
"substituted with synthetic scenarios",
},
{
"issue": "Single-institution bias",
"detail": (
"Watson's recommendations reflected MSK's treatment "
"preferences, which in some cases differed from "
"international standards. This was particularly "
"problematic when deployed in countries with "
"different treatment protocols, drug availability, "
"and patient populations."
),
"root_cause": "Training data from one institution does "
"not generalize to global clinical practice",
},
{
"issue": "Concordance misrepresented as accuracy",
"detail": (
"IBM marketed Watson's 'concordance rate' (how often "
"Watson agreed with MSK doctors) as a measure of "
"accuracy. However, concordance with one institution's "
"doctors is not the same as clinical correctness."
),
"root_cause": "Evaluation methodology conflated agreement "
"with a specific institution and clinical "
"accuracy",
},
]Common Technical Failure Patterns
Across all three cases, several recurring technical failure patterns emerge:
| Failure Pattern | UnitedHealth | Epic Sepsis | Watson Oncology |
|---|---|---|---|
| Distribution shift | Training data reflects past denials, not needs | Population differs from training set | Single institution does not generalize |
| Inappropriate proxy metrics | Optimized for cost, not outcomes | Marketed accuracy vs. real-world performance | Concordance confused with accuracy |
| Feedback loops | Denials create data that reinforces denials | Alert fatigue reduces effective impact | Not applicable |
| Insufficient validation | Internal validation only | Vendor-provided metrics only | Tested on hypothetical cases |
| Human oversight failure | Algorithm overrides physicians | Clinicians defer to or ignore alerts | Doctors may accept AI recommendations uncritically |
| Vulnerable population impact | Elderly, disabled, chronically ill | All hospitalized patients | Cancer patients in under-resourced settings |
# Common failure pattern framework for healthcare AI
from dataclasses import dataclass
@dataclass
class HealthcareAIFailurePattern:
"""A recurring failure pattern in healthcare AI deployments."""
name: str
description: str
detection_method: str
prevention_strategy: str
COMMON_FAILURE_PATTERNS = [
HealthcareAIFailurePattern(
name="Distribution shift",
description="Model performance degrades when deployed on a "
"patient population different from the training population",
detection_method="Continuous monitoring of model performance metrics "
"stratified by demographics, facility, and time period",
prevention_strategy="Multi-site validation before deployment; "
"continuous performance monitoring post-deployment; "
"automatic alerts when performance degrades",
),
HealthcareAIFailurePattern(
name="Proxy metric misalignment",
description="Model optimized for a proxy metric (cost, concordance, "
"AUC) that does not align with the actual clinical outcome "
"of interest (patient health, survival, quality of life)",
detection_method="Evaluate model against clinically meaningful "
"outcome metrics, not just statistical metrics",
prevention_strategy="Define clinically meaningful evaluation criteria "
"before model development; involve clinicians in "
"metric selection",
),
HealthcareAIFailurePattern(
name="Automation bias",
description="Clinicians defer to AI recommendations even when their "
"clinical judgment suggests the AI is wrong",
detection_method="Track the rate at which clinicians override AI "
"recommendations; declining override rates may indicate "
"automation bias",
prevention_strategy="Design interfaces that require active clinical "
"judgment rather than passive acceptance; present "
"AI outputs as one input among many",
),
HealthcareAIFailurePattern(
name="Feedback loop degradation",
description="Model predictions influence the data that is subsequently "
"used to retrain or evaluate the model, creating a "
"self-reinforcing cycle",
detection_method="Compare model predictions against independent "
"clinical assessments not influenced by the model",
prevention_strategy="Maintain independent evaluation datasets; "
"conduct regular external audits",
),
]Lessons Learned
For Healthcare AI Developers
1. Independent, prospective validation is non-negotiable: The Epic sepsis model's performance gap demonstrates that internal retrospective validation is insufficient. Healthcare AI must be validated prospectively at each deployment site, with real patient populations and real clinical workflows.
2. Training data reflects institutional practices, not ground truth: Both Watson for Oncology (single-institution training) and UnitedHealth (training on past coverage decisions) demonstrate that training data captures the biases and practices of the institution that generated it. Clinical ground truth requires independent validation against patient outcomes.
3. Optimize for patient outcomes, not institutional metrics: The UnitedHealth algorithm was optimized for cost reduction; Watson for Oncology was evaluated on concordance with one institution. Healthcare AI must be evaluated against patient-centered outcomes: health improvement, complication rates, survival, and quality of life.
For Healthcare Organizations Deploying AI
1. Maintain meaningful human oversight: AI-assisted decisions in healthcare must include meaningful human oversight where clinicians can override AI recommendations without friction. The UnitedHealth case demonstrates the danger of systems that make human override difficult.
2. Monitor for disparate impact: Healthcare AI systems can disproportionately affect vulnerable populations. Monitor model performance and outcomes across demographic groups, socioeconomic status, and clinical complexity.
3. Establish feedback and monitoring systems: Implement continuous monitoring of AI system performance in production, with automatic alerts when accuracy degrades or when outcomes diverge from expectations.
For Red Teams and AI Auditors
1. Healthcare AI red teaming must include clinical expertise: Technical red teams alone are insufficient. Healthcare AI assessments require collaboration with clinical experts who can evaluate whether model recommendations are medically appropriate.
2. Test for distribution shift susceptibility: Evaluate how model performance changes across patient subpopulations, facilities, and time periods. Models that perform well on average may perform dangerously poorly on specific subgroups.
3. Evaluate the appeal and override pathway: For AI systems that influence care decisions, test whether meaningful human override is available and accessible. Systems where the appeal pathway is prohibitively difficult constitute a de facto denial of human oversight.
References
- Wong, A., et al., "External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients," JAMA Internal Medicine, June 2021
- Ross, C., Swetlitz, I., "IBM's Watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show," STAT News, July 2018
- Ross, C., Herman, B., "UnitedHealth used secret algorithm to deny rehab coverage to Medicare Advantage patients, lawsuit alleges," STAT News, November 2023
- CMS Final Rule, "Contract Year 2024 Policy and Technical Changes to the Medicare Advantage and Medicare Prescription Drug Benefit Programs," January 2024
- European Parliament and Council, "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)," June 2024
What was the fundamental technical flaw in UnitedHealth's AI-driven claims denial algorithm?
Why did the Epic sepsis model perform so much worse in real-world deployment than in its marketed performance claims?