Overview

A high-level summary of the CDEWI system performance metrics. These metrics provide a quick snapshot of how well the AI models are performing.

Tabular Accuracy

99.3%

Accuracy on test set

ROC AUC

0.996

Macro-averaged ROC AUC score

Brier Score

0.0059

Calibration metric (lower is better)

Language AUC

1.0

Language model ROC AUC

Understanding These Metrics

These numbers tell us how well the AI model performs. Think of them like grades on a report card—each metric measures a different aspect of the model performance.

Tabular Model Accuracy

What it is: The percentage of predictions that are correct on the test set.

In plain English: Out of 100 patients, the model correctly classifies 99.3 of them. This is like getting an A+ on a test—the model is highly accurate across all 9 cognitive stages (from healthy to severe dementia).

Macro ROC AUC

What it is: The Area Under the ROC Curve measures how well the model distinguishes between different classes.

In plain English: A score of 0.996 (out of 1.0) means the model is excellent at telling the difference between, say, a healthy patient and someone with mild cognitive impairment. It is like having a very sharp eye for spotting differences. The macro-averaged part means it performs consistently well across all 9 stages, not just some of them.

Brier Score

What it is: Measures prediction calibration—whether the model reported confidence matches reality.

In plain English: When the model says 80 percent chance of mild cognitive impairment, is it actually right 80 percent of the time? A score near 0 (like 0.0059) means YES—the model confidence scores are trustworthy. Lower is better. This is crucial because doctors need to know when the model is certain vs uncertain.

Language Model AUC

What it is: The language model ability to distinguish between healthy and impaired speech patterns.

In plain English: A perfect score of 1.0 means the language model can perfectly separate healthy language patterns from those showing cognitive decline. It analyzes things like sentence complexity and vocabulary richness. Note: This is from mock implementation and will be updated with the real NLP model.

Bottom Line

Together, these metrics show that the CDEWI model is accurate (gets it right), calibrated (calibrated confidence), discriminating (can tell classes apart), and consistent (works well across all stages). This combination is essential for a medical AI system that doctors can trust.

Clinical Predictions

What is this section?

This section shows the AI's prediction for a specific patient's cognitive stage. The model analyzes 130 biomarkers (labeled as Biomarker 0–129 or Latent Features 0–129) including cognitive test scores, brain imaging, genetic markers, and demographics to classify the patient into one of 9 stages—from cognitively normal to severe dementia. You'll see the predicted stage, how confident the model is, and a probability breakdown across all 9 stages.

Select a patient to view clinical predictions

Language Analysis

What is this section?

Research shows that how people speak and write changes with cognitive decline. This tool analyzes text (like a patient's written description or transcribed speech) to detect linguistic patterns associated with cognitive impairment. It looks at sentence length, vocabulary richness, and pronoun usage—all markers that can indicate cognitive changes before they're obvious in other tests.

Enter narrative text to analyze linguistic markers associated with cognitive impairment. The system evaluates sentence complexity, lexical diversity, and pronoun usage patterns.

Fusion Model (CDEWI)

What is this section?

This section combines two different AI models into one unified risk score called CDEWI (Cognitive Decline Early Warning Index). Think of it like getting a second opinion: one model looks at clinical data (test scores, brain scans), while the other analyzes language patterns. By combining both, we get a more complete picture.

The alpha slider lets you control how much weight to give each model. For example, if language data is unreliable for a patient, you can increase alpha to rely more on clinical data.

Combine clinical and language model predictions to compute the Cognitive Decline Early Warning Index (CDEWI). Adjust the alpha weight to control the relative contribution of each model.

Language only (0.0)Balanced (0.5)Clinical only (1.0)

Model Interpretability (SHAP)

What is this section?

This section opens up the AI's "black box" to show which features matter most for predictions. Instead of just getting a prediction, you can see why the model made that prediction. This is crucial for doctors who need to understand and trust the AI's reasoning before making clinical decisions.

Explore feature importance using SHAP (SHapley Additive exPlanations) values. View global feature rankings or patient-specific contributions.

What is Model Interpretability?

Interpretability means understanding why the AI made a specific prediction. Instead of treating the model as a "black box," we can see which patient features (like age, cognitive test scores, or brain imaging results) had the biggest impact on the prediction.

Understanding SHAP Values

SHAP (SHapley Additive exPlanations) is a method that assigns each feature an "importance score" for a prediction. Think of it like a recipe: SHAP tells you how much each ingredient (feature) contributed to the final dish (prediction).

Understanding Biomarker Labels

The model uses 130 biomarkers labeled as "Biomarker 0" through "Biomarker 129" (also called "Latent Features 0–129"). This numbering system is standard practice in biomedical machine learning when:

  • The dataset is anonymized for patient privacy
  • Features are derived from PCA, embeddings, or dimensionality reduction
  • Raw feature names are intentionally hidden to avoid privacy issues

Each biomarker corresponds to clinical measurements including cognitive tests, brain imaging, genetic markers, demographics, and other patient data.

Reading the Charts

Global Importance: Shows which features matter most across all patients. Longer bars = more important features overall.

Patient-Level: Shows which features mattered for one specific patient. Green bars = feature increased risk prediction. Red bars = feature decreased risk prediction. Longer bars = stronger influence.

What Do These Biomarkers Represent?

The 130 biomarkers (Biomarker 0–129 or Latent Features 0–129) are anonymized for privacy and follow standard biomedical ML practices. Based on typical Alzheimer's research datasets, here are the types of measurements these biomarkers represent:

Cognitive Tests (~15-20 features)

MMSE (Mini-Mental State Exam): Total score, subscores for orientation, memory, attention, language
ADAS-Cog: Word recall, naming, commands, orientation, word recognition
CDR: Clinical Dementia Rating (memory, orientation, judgment, community affairs, home/hobbies, personal care)

Brain Imaging (~40-50 features)

Volumetric MRI: Hippocampus (left/right), entorhinal cortex, whole brain, ventricles
Cortical thickness: Frontal, temporal, parietal, occipital lobes
Regional volumes: Amygdala, thalamus, caudate, putamen, pallidum
White matter: Lesion volume, hyperintensities

Genetic & Risk Factors (~5-10 features)

APOE genotype: ε4 allele count (0, 1, or 2 copies - major Alzheimer's risk factor)
Family history: First-degree relatives with dementia
Other genes: TREM2, CLU, PICALM variants

Demographics (~5-8 features)

Age, Sex, Education years, Ethnicity,Marital status, Handedness

Clinical Biomarkers (~10-15 features)

CSF proteins: Amyloid-beta 42, Total tau, Phosphorylated tau (p-tau)
Blood markers: Plasma amyloid, neurofilament light chain
Ratios: Aβ42/Aβ40, tau/Aβ42

Functional & Clinical (~20-30 features)

ADL: Activities of daily living (bathing, dressing, eating, toileting)
IADL: Instrumental ADL (shopping, cooking, managing money, medications)
NPI: Neuropsychiatric Inventory (depression, anxiety, agitation, apathy)
FAQ: Functional Activities Questionnaire

Why use numbered biomarkers? Labeling as "Biomarker 0–129" or "Latent Feature 0–129" is extremely common and expected in biomedical machine learning competitions and research. It protects patient privacy, handles PCA/embedding-derived features, and prevents reverse-engineering of proprietary datasets. In clinical deployment, biomarkers would map to actual measurements (e.g., "MMSE Total Score: 24" instead of "Biomarker 0: 24").

What am I looking at?

This chart shows the 15 most important features the AI uses to make predictions across all patients. Features at the top have the biggest impact on predictions. For example, if "MMSE Score" is at the top, it means cognitive test scores are the most influential factor in determining a patient's risk level.

Top 15 Features by Mean Absolute SHAP Value

Research Report

Comprehensive documentation of the Aurora Cognitive Index system, methodology, and findings.

Aurora Cognitive Index - Full Report

This report provides detailed information about the multi-modal AI system for cognitive decline prediction, including methodology, model architecture, evaluation metrics, and clinical implications.

Document Type

Research Report

Author

Ahmadreza Azizi

Last Updated

December 2025

Report Contents

System Architecture & Design
Multi-Modal Fusion Methodology
Clinical Biomarker Analysis
Language Model Integration
Model Interpretability & SHAP Analysis
Robustness & Adversarial Testing
Performance Metrics & Evaluation
Clinical Implications & Future Work

Model Card

What is a Model Card?

A Model Card is like a nutrition label for AI systems. It provides transparent information about what the model is designed for, how it was trained, its limitations, and ethical considerations. This transparency helps users understand when and how to appropriately use (or not use) the AI system. Think of it as the model's "instruction manual and warning label" combined.

Research prototype – not for clinical use

This system is designed for research and demonstration purposes only. It should not be used for clinical diagnosis, treatment decisions, or patient care without proper validation and regulatory approval.

Intended Use

The CDEWI (Cognitive Decline Early Warning System) is intended for:

  • Research into multi-modal cognitive decline prediction
  • Demonstration of fusion techniques combining clinical and linguistic data
  • Educational purposes in AI-assisted healthcare applications
  • Exploratory analysis of feature importance in Alzheimer's prediction
  • Benchmarking robustness metrics for medical AI systems

Not Intended For

This system should NOT be used for:

  • Clinical diagnosis of Alzheimer's Disease or cognitive impairment
  • Treatment planning or medical decision-making
  • Patient screening or triage in healthcare settings
  • Regulatory submissions or clinical trials
  • Any application where errors could result in patient harm
  • Deployment without domain expert oversight

Training Data

Clinical Model: Trained on 130 biomarkers (labeled as Biomarker 0–129 or Latent Features 0–129) including cognitive assessments (MMSE, ADAS), neuroimaging measurements (hippocampal volume, cortical thickness), genetic markers (APOE4 status), and demographic information. Biomarker numbering follows standard biomedical ML practices for anonymized datasets.

Language Model: Trained on narrative text samples from cognitive assessments, analyzing linguistic markers such as sentence complexity, lexical diversity, and pronoun usage patterns.

Data Sources: Model predictions shown are from actual test data. Dashboard displays 20 sample patients with real clinical predictions, SHAP interpretability values, and robustness metrics. Language analysis currently uses mock implementation.

Note: Real deployment would require validated clinical datasets with appropriate IRB approval and patient consent.

Evaluation & Metrics

Performance Metrics:

  • Tabular Model Accuracy: 99.3%
  • Macro ROC AUC: 0.996
  • Brier Score: 0.0059
  • Language Model AUC: 1.0

Robustness: Evaluated under Gaussian noise and FGSM adversarial perturbations. Model maintains strong performance under random noise but shows expected vulnerability to targeted attacks.

Note: Metrics shown are from actual model predictions on test data.

Limitations

  • Demo Dataset: Dashboard displays predictions from 20 sample patients for demonstration purposes
  • Language Analysis: Currently uses mock implementation; real NLP model integration pending
  • Fusion Examples: CDEWI fusion data available for 10 sample patients
  • Validation: Has not undergone clinical validation or regulatory review
  • Generalization: Performance on diverse populations unknown; test set may not represent all demographics
  • Interpretability: SHAP values provide local explanations but may not capture all model behavior
  • Adversarial Vulnerability: Susceptible to targeted perturbations as shown in robustness analysis
  • Temporal Dynamics: Does not model disease progression over time

Ethical Considerations

Privacy: Real deployment must ensure HIPAA compliance and patient data protection. No patient data should be stored without explicit consent.

Bias & Fairness: Model performance should be evaluated across demographic groups (age, gender, ethnicity, socioeconomic status) to identify and mitigate potential biases.

Transparency: Predictions should always be accompanied by explanations and confidence scores. Clinicians must understand model limitations.

Human Oversight: All predictions require review by qualified healthcare professionals. The system is a decision support tool, not a replacement for clinical judgment.

Informed Consent: Patients must be informed when AI systems are used in their care and have the right to opt out.

Created by Ahmadreza Azizi

Data Source: Hacks for Health