Case Study: Healthcare AI System Failures and Patient Safety

intermediate16 min readUpdated 2026-03-20

Analysis of documented healthcare AI system failures including the UnitedHealth/Optum claims denial algorithm, Epic sepsis model performance gaps, and IBM Watson for Oncology's unsafe treatment recommendations.

case-studies healthcare patient-safety algorithmic-bias clinical-ai regulatory

Overview

Healthcare represents one of the highest-stakes domains for AI deployment. Failures in clinical AI systems can directly cause patient harm, including delayed diagnoses, inappropriate treatments, and denial of necessary care. Unlike most AI failures which result in financial loss or user inconvenience, healthcare AI failures have life-or-death consequences.

Several high-profile incidents have demonstrated the real-world risks of deploying AI in healthcare settings without adequate validation, monitoring, and safeguards. This case study examines three well-documented categories of failure: algorithmic claims denial systems that overrode clinical judgment to deny care at scale, predictive clinical models that performed far below their reported accuracy in real-world deployment, and clinical decision support systems that generated unsafe treatment recommendations. Together, these incidents illustrate the systemic challenges of deploying AI in safety-critical healthcare settings.

Timeline

2017-2018: IBM Watson for Oncology is deployed at multiple hospitals worldwide, including prominent deployments in India, South Korea, and Thailand. The system is designed to recommend cancer treatment plans by analyzing patient records against medical literature and clinical guidelines.

July 2018: STAT News publishes an investigation revealing that IBM Watson for Oncology generated "unsafe and incorrect" treatment recommendations in multiple documented cases. Internal IBM documents show that the system's recommendations were based primarily on the training data from a single institution (Memorial Sloan Kettering Cancer Center) and that the system had been tested on a limited set of hypothetical rather than real patient cases.

2019: Epic Systems, the largest electronic health records vendor in the US, deploys a proprietary sepsis prediction model across hundreds of hospitals. The model is marketed as capable of identifying patients at risk of developing sepsis hours before clinical onset.

June 2021: Researchers Wong et al. publish a study in JAMA Internal Medicine evaluating Epic's sepsis prediction model across 27,697 patient encounters at the University of Michigan Health System. The study finds the model missed 67% of sepsis cases (sensitivity of only 33%) and generated a false alert rate of 18% --- substantially worse than Epic's marketed performance claims.

2020-2023: UnitedHealth Group's subsidiary Optum deploys an AI algorithm (reportedly named "nH Predict") to predict the length of Medicare Advantage patients' post-acute care stays. The algorithm's predictions are used to deny or curtail coverage for rehabilitation, nursing facility, and home healthcare services.

November 2023: STAT News investigates UnitedHealth's AI-driven claims denial system, reporting that the algorithm had a known error rate of approximately 90% on appeal, yet was used to override physician recommendations and deny coverage to elderly and disabled patients.

November 2023: A federal class-action lawsuit is filed against UnitedHealth Group alleging that the company used its AI algorithm to illegally deny care to Medicare Advantage beneficiaries, with the algorithm overriding doctors' clinical assessments of patient needs.

January 2024: The US Centers for Medicare and Medicaid Services (CMS) issues a final rule requiring Medicare Advantage plans to base coverage decisions on individual patient assessments rather than algorithm-generated predictions, directly responding to the UnitedHealth controversy.

2024: The European Union's AI Act classifies healthcare AI systems as "high risk," requiring conformity assessments, human oversight, and post-market monitoring. The US FDA continues its evolving framework for AI/ML-based Software as a Medical Device (SaMD).

Technical Analysis

Case 1: UnitedHealth/Optum Claims Denial Algorithm

The UnitedHealth case represents AI-driven decision making at scale in a high-stakes setting where the consequences of errors --- denied healthcare coverage --- fall disproportionately on vulnerable populations.

# Conceptual model of AI-driven claims denial architecture
 
from dataclasses import dataclass
from datetime import date, timedelta
from enum import Enum
from typing import Optional
 
class CareDecision(Enum):
    APPROVE = "approve"
    DENY = "deny"
    REVIEW = "manual_review"
 
@dataclass
class PatientCase:
    """A patient case submitted for coverage determination."""
    patient_id: str
    age: int
    diagnosis_codes: list[str]  # ICD-10 codes
    procedure_codes: list[str]  # CPT codes
    admission_date: date
    physician_recommended_days: int
    facility_type: str  # SNF, rehab, home health
 
@dataclass
class AlgorithmPrediction:
    """The AI algorithm's prediction for a patient's care needs."""
    predicted_days: int
    confidence: float
    decision: CareDecision
    override_physician: bool
 
class ClaimsDenialAnalysis:
    """
    Analysis of how algorithmic claims denial operates and
    its failure modes.
    """
 
    @staticmethod
    def demonstrate_systematic_denial() -> dict:
        """
        Illustrate how the algorithm systematically produced
        shorter care predictions than physicians recommended.
        """
        return {
            "mechanism": (
                "The algorithm predicts the 'appropriate' length of stay "
                "based on historical claims data. However, the training data "
                "reflects PAST coverage decisions (which were already "
                "restrictive), not actual patient needs. This creates a "
                "self-reinforcing cycle: denied claims produce shorter "
                "stays in the data, which trains the algorithm to predict "
                "even shorter stays."
            ),
            "feedback_loop": [
                "1. Algorithm trained on historical claims data",
                "2. Historical data reflects past denials and shortened stays",
                "3. Algorithm learns that shorter stays are 'normal'",
                "4. Algorithm predicts shorter stays than physicians recommend",
                "5. Denials based on algorithm predictions create new data",
                "6. New data reinforces the shorter-stay pattern",
                "7. Cycle continues → progressively more restrictive predictions",
            ],
            "key_finding": (
                "The algorithm's approximately 90% reversal rate on appeal "
                "demonstrates that its predictions were systematically wrong. "
                "However, the friction of the appeals process meant that "
                "many patients never appealed, effectively denying them care "
                "they were entitled to receive."
            ),
            "vulnerable_populations": [
                "Elderly patients in skilled nursing facilities",
                "Patients recovering from stroke requiring rehabilitation",
                "Post-surgical patients requiring extended home health care",
                "Patients with complex chronic conditions",
            ],
        }
 
    @staticmethod
    def identify_failure_modes() -> list[dict]:
        """Identify the specific AI failure modes in the system."""
        return [
            {
                "failure": "Distribution shift in training data",
                "description": "Model trained on historical data reflecting "
                               "past restrictive practices, not clinical needs",
                "impact": "Systematically underestimates care needs",
            },
            {
                "failure": "Inappropriate automation of clinical judgment",
                "description": "Algorithm overrides physician assessments "
                               "based on statistical patterns, ignoring "
                               "individual patient complexity",
                "impact": "Denies care to patients whose conditions "
                          "deviate from population averages",
            },
            {
                "failure": "Asymmetric error costs",
                "description": "The algorithm was optimized for cost reduction "
                               "without adequately weighting the cost of "
                               "false denials (patient harm)",
                "impact": "Systematically errs toward denial because "
                          "false denials are cheaper for the insurer "
                          "than false approvals",
            },
            {
                "failure": "Feedback loop reinforcement",
                "description": "Denied claims produce shorter stays in data, "
                               "reinforcing the model's denial behavior",
                "impact": "Progressive degradation of coverage decisions",
            },
        ]

Case 2: Epic Sepsis Prediction Model

The Epic sepsis model illustrates the gap between model performance in development settings and real-world clinical deployment:

# Analysis of the Epic sepsis model performance gap
 
@dataclass
class ModelPerformanceComparison:
    """Compare marketed vs. real-world model performance."""
    metric: str
    marketed_value: float
    real_world_value: float
    source: str
 
class EpicSepsisAnalysis:
    """
    Analysis of the Epic sepsis prediction model's real-world
    performance based on Wong et al. (2021) JAMA Internal Medicine.
    """
 
    @staticmethod
    def performance_comparison() -> list[ModelPerformanceComparison]:
        """
        Compare Epic's marketed performance claims with
        independently measured real-world performance.
        """
        return [
            ModelPerformanceComparison(
                metric="Sensitivity (true positive rate)",
                marketed_value=0.76,  # Epic's claimed performance
                real_world_value=0.33,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
            ModelPerformanceComparison(
                metric="Positive predictive value",
                marketed_value=0.50,  # Approximate from marketing
                real_world_value=0.12,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
            ModelPerformanceComparison(
                metric="False alert rate",
                marketed_value=0.05,  # Approximate from marketing
                real_world_value=0.18,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
        ]
 
    @staticmethod
    def explain_performance_gap() -> dict:
        """Explain why the model performed so differently in practice."""
        return {
            "causes": [
                {
                    "factor": "Population distribution shift",
                    "explanation": (
                        "The model was developed and validated on a specific "
                        "patient population. When deployed at hospitals with "
                        "different patient demographics, disease prevalence, "
                        "and clinical workflows, the model's learned patterns "
                        "did not transfer."
                    ),
                },
                {
                    "factor": "Temporal distribution shift",
                    "explanation": (
                        "Clinical practice evolves over time. A model trained "
                        "on historical data may not reflect current treatment "
                        "protocols, coding practices, or documentation patterns."
                    ),
                },
                {
                    "factor": "Validation methodology",
                    "explanation": (
                        "Epic's internal validation may have used methods "
                        "(e.g., retrospective data splits) that overestimate "
                        "prospective performance. The Wong et al. study used "
                        "prospective evaluation, measuring performance in "
                        "real-time clinical deployment."
                    ),
                },
                {
                    "factor": "Alert fatigue interaction",
                    "explanation": (
                        "With an 18% false alert rate, clinicians experience "
                        "alert fatigue and begin ignoring alerts. This means "
                        "that even when the model correctly identifies sepsis, "
                        "the alert may be dismissed, reducing the model's "
                        "effective clinical impact below its measured accuracy."
                    ),
                },
            ],
            "broader_implication": (
                "The Epic sepsis model illustrates a systemic problem: "
                "healthcare AI models are frequently evaluated using methods "
                "that overestimate real-world performance. Independent "
                "prospective validation at deployment sites is essential "
                "but rarely performed before deployment."
            ),
        }

Case 3: IBM Watson for Oncology

IBM Watson for Oncology represented one of the most ambitious and visible healthcare AI deployments, and its failures provided early warnings about the challenges of clinical AI:

# Analysis of IBM Watson for Oncology failures
 
class WatsonOncologyAnalysis:
    """
    Analysis of IBM Watson for Oncology's documented failures
    in generating safe treatment recommendations.
    """
 
    @staticmethod
    def documented_issues() -> list[dict]:
        """Issues documented by STAT News and internal IBM documents."""
        return [
            {
                "issue": "Unsafe treatment recommendations",
                "detail": (
                    "Watson recommended treatments that contradicted "
                    "established clinical guidelines and could cause "
                    "patient harm. Internal IBM documents showed that "
                    "engineers were aware of cases where Watson "
                    "recommended inappropriate drug combinations."
                ),
                "root_cause": "Training on limited data from a single "
                              "institution (MSK) that did not represent "
                              "global treatment practices",
            },
            {
                "issue": "Training on hypothetical cases",
                "detail": (
                    "A significant portion of Watson's training was "
                    "based on hypothetical patient scenarios created by "
                    "doctors at MSK, not real patient outcomes. This "
                    "meant the system learned from physician assumptions "
                    "rather than actual treatment efficacy data."
                ),
                "root_cause": "Insufficient real-world training data, "
                              "substituted with synthetic scenarios",
            },
            {
                "issue": "Single-institution bias",
                "detail": (
                    "Watson's recommendations reflected MSK's treatment "
                    "preferences, which in some cases differed from "
                    "international standards. This was particularly "
                    "problematic when deployed in countries with "
                    "different treatment protocols, drug availability, "
                    "and patient populations."
                ),
                "root_cause": "Training data from one institution does "
                              "not generalize to global clinical practice",
            },
            {
                "issue": "Concordance misrepresented as accuracy",
                "detail": (
                    "IBM marketed Watson's 'concordance rate' (how often "
                    "Watson agreed with MSK doctors) as a measure of "
                    "accuracy. However, concordance with one institution's "
                    "doctors is not the same as clinical correctness."
                ),
                "root_cause": "Evaluation methodology conflated agreement "
                              "with a specific institution and clinical "
                              "accuracy",
            },
        ]

Common Technical Failure Patterns

Across all three cases, several recurring technical failure patterns emerge:

Failure Pattern	UnitedHealth	Epic Sepsis	Watson Oncology
Distribution shift	Training data reflects past denials, not needs	Population differs from training set	Single institution does not generalize
Inappropriate proxy metrics	Optimized for cost, not outcomes	Marketed accuracy vs. real-world performance	Concordance confused with accuracy
Feedback loops	Denials create data that reinforces denials	Alert fatigue reduces effective impact	Not applicable
Insufficient validation	Internal validation only	Vendor-provided metrics only	Tested on hypothetical cases
Human oversight failure	Algorithm overrides physicians	Clinicians defer to or ignore alerts	Doctors may accept AI recommendations uncritically
Vulnerable population impact	Elderly, disabled, chronically ill	All hospitalized patients	Cancer patients in under-resourced settings

# Common failure pattern framework for healthcare AI
 
from dataclasses import dataclass
 
@dataclass
class HealthcareAIFailurePattern:
    """A recurring failure pattern in healthcare AI deployments."""
    name: str
    description: str
    detection_method: str
    prevention_strategy: str
 
COMMON_FAILURE_PATTERNS = [
    HealthcareAIFailurePattern(
        name="Distribution shift",
        description="Model performance degrades when deployed on a "
                    "patient population different from the training population",
        detection_method="Continuous monitoring of model performance metrics "
                         "stratified by demographics, facility, and time period",
        prevention_strategy="Multi-site validation before deployment; "
                            "continuous performance monitoring post-deployment; "
                            "automatic alerts when performance degrades",
    ),
    HealthcareAIFailurePattern(
        name="Proxy metric misalignment",
        description="Model optimized for a proxy metric (cost, concordance, "
                    "AUC) that does not align with the actual clinical outcome "
                    "of interest (patient health, survival, quality of life)",
        detection_method="Evaluate model against clinically meaningful "
                         "outcome metrics, not just statistical metrics",
        prevention_strategy="Define clinically meaningful evaluation criteria "
                            "before model development; involve clinicians in "
                            "metric selection",
    ),
    HealthcareAIFailurePattern(
        name="Automation bias",
        description="Clinicians defer to AI recommendations even when their "
                    "clinical judgment suggests the AI is wrong",
        detection_method="Track the rate at which clinicians override AI "
                         "recommendations; declining override rates may indicate "
                         "automation bias",
        prevention_strategy="Design interfaces that require active clinical "
                            "judgment rather than passive acceptance; present "
                            "AI outputs as one input among many",
    ),
    HealthcareAIFailurePattern(
        name="Feedback loop degradation",
        description="Model predictions influence the data that is subsequently "
                    "used to retrain or evaluate the model, creating a "
                    "self-reinforcing cycle",
        detection_method="Compare model predictions against independent "
                         "clinical assessments not influenced by the model",
        prevention_strategy="Maintain independent evaluation datasets; "
                            "conduct regular external audits",
    ),
]

Lessons Learned

For Healthcare AI Developers

1. Independent, prospective validation is non-negotiable: The Epic sepsis model's performance gap demonstrates that internal retrospective validation is insufficient. Healthcare AI must be validated prospectively at each deployment site, with real patient populations and real clinical workflows.

2. Training data reflects institutional practices, not ground truth: Both Watson for Oncology (single-institution training) and UnitedHealth (training on past coverage decisions) demonstrate that training data captures the biases and practices of the institution that generated it. Clinical ground truth requires independent validation against patient outcomes.

3. Optimize for patient outcomes, not institutional metrics: The UnitedHealth algorithm was optimized for cost reduction; Watson for Oncology was evaluated on concordance with one institution. Healthcare AI must be evaluated against patient-centered outcomes: health improvement, complication rates, survival, and quality of life.

For Healthcare Organizations Deploying AI

1. Maintain meaningful human oversight: AI-assisted decisions in healthcare must include meaningful human oversight where clinicians can override AI recommendations without friction. The UnitedHealth case demonstrates the danger of systems that make human override difficult.

2. Monitor for disparate impact: Healthcare AI systems can disproportionately affect vulnerable populations. Monitor model performance and outcomes across demographic groups, socioeconomic status, and clinical complexity.

3. Establish feedback and monitoring systems: Implement continuous monitoring of AI system performance in production, with automatic alerts when accuracy degrades or when outcomes diverge from expectations.

For Red Teams and AI Auditors

1. Healthcare AI red teaming must include clinical expertise: Technical red teams alone are insufficient. Healthcare AI assessments require collaboration with clinical experts who can evaluate whether model recommendations are medically appropriate.

2. Test for distribution shift susceptibility: Evaluate how model performance changes across patient subpopulations, facilities, and time periods. Models that perform well on average may perform dangerously poorly on specific subgroups.

3. Evaluate the appeal and override pathway: For AI systems that influence care decisions, test whether meaningful human override is available and accessible. Systems where the appeal pathway is prohibitively difficult constitute a de facto denial of human oversight.

References

Wong, A., et al., "External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients," JAMA Internal Medicine, June 2021
Ross, C., Swetlitz, I., "IBM's Watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show," STAT News, July 2018
Ross, C., Herman, B., "UnitedHealth used secret algorithm to deny rehab coverage to Medicare Advantage patients, lawsuit alleges," STAT News, November 2023
CMS Final Rule, "Contract Year 2024 Policy and Technical Changes to the Medicare Advantage and Medicare Prescription Drug Benefit Programs," January 2024
European Parliament and Council, "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)," June 2024

Knowledge Check

What was the fundamental technical flaw in UnitedHealth's AI-driven claims denial algorithm?

Knowledge Check

Why did the Epic sepsis model perform so much worse in real-world deployment than in its marketed performance claims?

Case Study: Healthcare AI System Failures and Patient Safety

intermediate16 min readUpdated 2026-03-20

case-studies healthcare patient-safety algorithmic-bias clinical-ai regulatory

Overview

Timeline

Technical Analysis

Case 1: UnitedHealth/Optum Claims Denial Algorithm

# Conceptual model of AI-driven claims denial architecture
 
from dataclasses import dataclass
from datetime import date, timedelta
from enum import Enum
from typing import Optional
 
class CareDecision(Enum):
    APPROVE = "approve"
    DENY = "deny"
    REVIEW = "manual_review"
 
@dataclass
class PatientCase:
    """A patient case submitted for coverage determination."""
    patient_id: str
    age: int
    diagnosis_codes: list[str]  # ICD-10 codes
    procedure_codes: list[str]  # CPT codes
    admission_date: date
    physician_recommended_days: int
    facility_type: str  # SNF, rehab, home health
 
@dataclass
class AlgorithmPrediction:
    """The AI algorithm's prediction for a patient's care needs."""
    predicted_days: int
    confidence: float
    decision: CareDecision
    override_physician: bool
 
class ClaimsDenialAnalysis:
    """
    Analysis of how algorithmic claims denial operates and
    its failure modes.
    """
 
    @staticmethod
    def demonstrate_systematic_denial() -> dict:
        """
        Illustrate how the algorithm systematically produced
        shorter care predictions than physicians recommended.
        """
        return {
            "mechanism": (
                "The algorithm predicts the 'appropriate' length of stay "
                "based on historical claims data. However, the training data "
                "reflects PAST coverage decisions (which were already "
                "restrictive), not actual patient needs. This creates a "
                "self-reinforcing cycle: denied claims produce shorter "
                "stays in the data, which trains the algorithm to predict "
                "even shorter stays."
            ),
            "feedback_loop": [
                "1. Algorithm trained on historical claims data",
                "2. Historical data reflects past denials and shortened stays",
                "3. Algorithm learns that shorter stays are 'normal'",
                "4. Algorithm predicts shorter stays than physicians recommend",
                "5. Denials based on algorithm predictions create new data",
                "6. New data reinforces the shorter-stay pattern",
                "7. Cycle continues → progressively more restrictive predictions",
            ],
            "key_finding": (
                "The algorithm's approximately 90% reversal rate on appeal "
                "demonstrates that its predictions were systematically wrong. "
                "However, the friction of the appeals process meant that "
                "many patients never appealed, effectively denying them care "
                "they were entitled to receive."
            ),
            "vulnerable_populations": [
                "Elderly patients in skilled nursing facilities",
                "Patients recovering from stroke requiring rehabilitation",
                "Post-surgical patients requiring extended home health care",
                "Patients with complex chronic conditions",
            ],
        }
 
    @staticmethod
    def identify_failure_modes() -> list[dict]:
        """Identify the specific AI failure modes in the system."""
        return [
            {
                "failure": "Distribution shift in training data",
                "description": "Model trained on historical data reflecting "
                               "past restrictive practices, not clinical needs",
                "impact": "Systematically underestimates care needs",
            },
            {
                "failure": "Inappropriate automation of clinical judgment",
                "description": "Algorithm overrides physician assessments "
                               "based on statistical patterns, ignoring "
                               "individual patient complexity",
                "impact": "Denies care to patients whose conditions "
                          "deviate from population averages",
            },
            {
                "failure": "Asymmetric error costs",
                "description": "The algorithm was optimized for cost reduction "
                               "without adequately weighting the cost of "
                               "false denials (patient harm)",
                "impact": "Systematically errs toward denial because "
                          "false denials are cheaper for the insurer "
                          "than false approvals",
            },
            {
                "failure": "Feedback loop reinforcement",
                "description": "Denied claims produce shorter stays in data, "
                               "reinforcing the model's denial behavior",
                "impact": "Progressive degradation of coverage decisions",
            },
        ]

Case 2: Epic Sepsis Prediction Model

The Epic sepsis model illustrates the gap between model performance in development settings and real-world clinical deployment:

# Analysis of the Epic sepsis model performance gap
 
@dataclass
class ModelPerformanceComparison:
    """Compare marketed vs. real-world model performance."""
    metric: str
    marketed_value: float
    real_world_value: float
    source: str
 
class EpicSepsisAnalysis:
    """
    Analysis of the Epic sepsis prediction model's real-world
    performance based on Wong et al. (2021) JAMA Internal Medicine.
    """
 
    @staticmethod
    def performance_comparison() -> list[ModelPerformanceComparison]:
        """
        Compare Epic's marketed performance claims with
        independently measured real-world performance.
        """
        return [
            ModelPerformanceComparison(
                metric="Sensitivity (true positive rate)",
                marketed_value=0.76,  # Epic's claimed performance
                real_world_value=0.33,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
            ModelPerformanceComparison(
                metric="Positive predictive value",
                marketed_value=0.50,  # Approximate from marketing
                real_world_value=0.12,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
            ModelPerformanceComparison(
                metric="False alert rate",
                marketed_value=0.05,  # Approximate from marketing
                real_world_value=0.18,  # Wong et al. measurement
                source="Wong et al., JAMA Intern Med, 2021",
            ),
        ]
 
    @staticmethod
    def explain_performance_gap() -> dict:
        """Explain why the model performed so differently in practice."""
        return {
            "causes": [
                {
                    "factor": "Population distribution shift",
                    "explanation": (
                        "The model was developed and validated on a specific "
                        "patient population. When deployed at hospitals with "
                        "different patient demographics, disease prevalence, "
                        "and clinical workflows, the model's learned patterns "
                        "did not transfer."
                    ),
                },
                {
                    "factor": "Temporal distribution shift",
                    "explanation": (
                        "Clinical practice evolves over time. A model trained "
                        "on historical data may not reflect current treatment "
                        "protocols, coding practices, or documentation patterns."
                    ),
                },
                {
                    "factor": "Validation methodology",
                    "explanation": (
                        "Epic's internal validation may have used methods "
                        "(e.g., retrospective data splits) that overestimate "
                        "prospective performance. The Wong et al. study used "
                        "prospective evaluation, measuring performance in "
                        "real-time clinical deployment."
                    ),
                },
                {
                    "factor": "Alert fatigue interaction",
                    "explanation": (
                        "With an 18% false alert rate, clinicians experience "
                        "alert fatigue and begin ignoring alerts. This means "
                        "that even when the model correctly identifies sepsis, "
                        "the alert may be dismissed, reducing the model's "
                        "effective clinical impact below its measured accuracy."
                    ),
                },
            ],
            "broader_implication": (
                "The Epic sepsis model illustrates a systemic problem: "
                "healthcare AI models are frequently evaluated using methods "
                "that overestimate real-world performance. Independent "
                "prospective validation at deployment sites is essential "
                "but rarely performed before deployment."
            ),
        }

Case 3: IBM Watson for Oncology

IBM Watson for Oncology represented one of the most ambitious and visible healthcare AI deployments, and its failures provided early warnings about the challenges of clinical AI:

# Analysis of IBM Watson for Oncology failures
 
class WatsonOncologyAnalysis:
    """
    Analysis of IBM Watson for Oncology's documented failures
    in generating safe treatment recommendations.
    """
 
    @staticmethod
    def documented_issues() -> list[dict]:
        """Issues documented by STAT News and internal IBM documents."""
        return [
            {
                "issue": "Unsafe treatment recommendations",
                "detail": (
                    "Watson recommended treatments that contradicted "
                    "established clinical guidelines and could cause "
                    "patient harm. Internal IBM documents showed that "
                    "engineers were aware of cases where Watson "
                    "recommended inappropriate drug combinations."
                ),
                "root_cause": "Training on limited data from a single "
                              "institution (MSK) that did not represent "
                              "global treatment practices",
            },
            {
                "issue": "Training on hypothetical cases",
                "detail": (
                    "A significant portion of Watson's training was "
                    "based on hypothetical patient scenarios created by "
                    "doctors at MSK, not real patient outcomes. This "
                    "meant the system learned from physician assumptions "
                    "rather than actual treatment efficacy data."
                ),
                "root_cause": "Insufficient real-world training data, "
                              "substituted with synthetic scenarios",
            },
            {
                "issue": "Single-institution bias",
                "detail": (
                    "Watson's recommendations reflected MSK's treatment "
                    "preferences, which in some cases differed from "
                    "international standards. This was particularly "
                    "problematic when deployed in countries with "
                    "different treatment protocols, drug availability, "
                    "and patient populations."
                ),
                "root_cause": "Training data from one institution does "
                              "not generalize to global clinical practice",
            },
            {
                "issue": "Concordance misrepresented as accuracy",
                "detail": (
                    "IBM marketed Watson's 'concordance rate' (how often "
                    "Watson agreed with MSK doctors) as a measure of "
                    "accuracy. However, concordance with one institution's "
                    "doctors is not the same as clinical correctness."
                ),
                "root_cause": "Evaluation methodology conflated agreement "
                              "with a specific institution and clinical "
                              "accuracy",
            },
        ]

Common Technical Failure Patterns

Across all three cases, several recurring technical failure patterns emerge:

Failure Pattern	UnitedHealth	Epic Sepsis	Watson Oncology
Distribution shift	Training data reflects past denials, not needs	Population differs from training set	Single institution does not generalize
Inappropriate proxy metrics	Optimized for cost, not outcomes	Marketed accuracy vs. real-world performance	Concordance confused with accuracy
Feedback loops	Denials create data that reinforces denials	Alert fatigue reduces effective impact	Not applicable
Insufficient validation	Internal validation only	Vendor-provided metrics only	Tested on hypothetical cases
Human oversight failure	Algorithm overrides physicians	Clinicians defer to or ignore alerts	Doctors may accept AI recommendations uncritically
Vulnerable population impact	Elderly, disabled, chronically ill	All hospitalized patients	Cancer patients in under-resourced settings

# Common failure pattern framework for healthcare AI
 
from dataclasses import dataclass
 
@dataclass
class HealthcareAIFailurePattern:
    """A recurring failure pattern in healthcare AI deployments."""
    name: str
    description: str
    detection_method: str
    prevention_strategy: str
 
COMMON_FAILURE_PATTERNS = [
    HealthcareAIFailurePattern(
        name="Distribution shift",
        description="Model performance degrades when deployed on a "
                    "patient population different from the training population",
        detection_method="Continuous monitoring of model performance metrics "
                         "stratified by demographics, facility, and time period",
        prevention_strategy="Multi-site validation before deployment; "
                            "continuous performance monitoring post-deployment; "
                            "automatic alerts when performance degrades",
    ),
    HealthcareAIFailurePattern(
        name="Proxy metric misalignment",
        description="Model optimized for a proxy metric (cost, concordance, "
                    "AUC) that does not align with the actual clinical outcome "
                    "of interest (patient health, survival, quality of life)",
        detection_method="Evaluate model against clinically meaningful "
                         "outcome metrics, not just statistical metrics",
        prevention_strategy="Define clinically meaningful evaluation criteria "
                            "before model development; involve clinicians in "
                            "metric selection",
    ),
    HealthcareAIFailurePattern(
        name="Automation bias",
        description="Clinicians defer to AI recommendations even when their "
                    "clinical judgment suggests the AI is wrong",
        detection_method="Track the rate at which clinicians override AI "
                         "recommendations; declining override rates may indicate "
                         "automation bias",
        prevention_strategy="Design interfaces that require active clinical "
                            "judgment rather than passive acceptance; present "
                            "AI outputs as one input among many",
    ),
    HealthcareAIFailurePattern(
        name="Feedback loop degradation",
        description="Model predictions influence the data that is subsequently "
                    "used to retrain or evaluate the model, creating a "
                    "self-reinforcing cycle",
        detection_method="Compare model predictions against independent "
                         "clinical assessments not influenced by the model",
        prevention_strategy="Maintain independent evaluation datasets; "
                            "conduct regular external audits",
    ),
]

Wong, A., et al., "External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients," JAMA Internal Medicine, June 2021
Ross, C., Swetlitz, I., "IBM's Watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show," STAT News, July 2018
Ross, C., Herman, B., "UnitedHealth used secret algorithm to deny rehab coverage to Medicare Advantage patients, lawsuit alleges," STAT News, November 2023
CMS Final Rule, "Contract Year 2024 Policy and Technical Changes to the Medicare Advantage and Medicare Prescription Drug Benefit Programs," January 2024
European Parliament and Council, "Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (AI Act)," June 2024

Knowledge Check

What was the fundamental technical flaw in UnitedHealth's AI-driven claims denial algorithm?

Knowledge Check

Why did the Epic sepsis model perform so much worse in real-world deployment than in its marketed performance claims?

Case Study: Healthcare AI System Failures and Patient Safety

Related articles

Case Study: Healthcare AI System Failures and Patient Safety

Related articles