ACL 2026 Findings

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Mohd Mujtaba Akhtar^* Girish^* Muskaan Singh

^*Equal Contribution · Accepted to Findings of ACL 2026

Healthcare CodecFake benchmark PHOENIX-Mamba Pathology-aware audio deepfake detection

HCFD studies what happens when clinically meaningful pathological speech is passed through modern neural audio codecs. The benchmark keeps each original utterance aligned with its codec-generated counterpart, exposing deepfake detectors to disease-driven acoustic variability instead of only healthy-speech conditions.

Depression, Alzheimer's disease, and dysarthria

English and Chinese healthcare speech corpora

Real and spoof paired samples for HCFD

HCFD Data Construction Pipeline

Input Audio

Pathological Clinical Speech

Source utterances are collected from healthcare speech benchmarks covering multiple conditions and two languages.

DAIC-WOZ ADReSS TORGO EATD NCMMSC CDSD

Resynthesis

Codec Stage

Neural Audio Codec Transformation

Each utterance is encoded into discrete codec tokens and decoded back into waveform space, introducing codec artifacts while preserving the clinical content.

Encodec DAC SNAC SpeechTokenizer

Pairing

Output Audio

Real and Spoof Paired Samples

HCFD retains the original pathological waveform together with its codec-generated spoof counterpart for condition-aware deepfake detection.

Real Audio

Original pathological utterance

Spoof Audio

Codec-generated paired sample

The benchmark is constructed through a controlled codec resynthesis loop, making HCFD suitable for studying deepfake detection under clinically realistic speech variability.

Abstract

In this study, we present Healthcare CodecFake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We release the HCFD benchmark, a pathology-aware dataset containing paired real and neural-audio-codec-synthesized speech across multiple clinical conditions and codec families. Our evaluations show that state-of-the-art codec-fake detectors trained primarily on healthy speech perform poorly on HCFD, highlighting the need for healthcare-specific modeling.

We compare diverse pretrained audio encoders and show that PaSST is the strongest single-representation baseline for HCFD. Building on that, we propose PHOENIX-Mamba, a geometry-aware framework that uses long-context sequence modeling, multiple localized evidence vectors, and prototype-based reasoning in hyperbolic space to capture heterogeneous codec-fake modes in clinical speech.

Experiments across depression, Alzheimer's disease, and dysarthria in both English and Chinese show consistent gains over AASIST and strong pretrained-model baselines. With PaSST, PHOENIX-Mamba achieves the best reported accuracies: 97.04, 96.73, and 96.57 on the English tasks, and 94.41, 94.40, and 93.20 on the Chinese tasks.

Audio Deepfake DetectionHealthcare SpeechPathological SpeechNeural Audio CodecsHyperbolic GeometryMambaACL 2026 Findings

Method

PHOENIX-Mamba is designed for healthcare codec-fake detection where codec traces and pathological speech variability interact. The core idea is to avoid collapsing an utterance into a single pooled vector and instead retain multiple localized evidences, then organize fake evidence in hyperbolic space with self-discovered prototype modes.

Pretrained audio representation

Input speech is encoded by frozen upstream models such as WavLM, wav2vec 2.0, Whisper, x-vector, or PaSST. The paper finds PaSST to be the strongest single-representation encoder for HCFD.

Mamba-style temporal backbone

Token-wise adapted features are passed through a selective state-space backbone to build context-enriched representations over longer pathological utterances.

Multi-evidence pooling

Rather than keeping only one pooled summary, PHOENIX-Mamba learns multiple evidence vectors so intermittent codec artifacts can still influence the final decision.

Hyperbolic multi-mode reasoning

Evidence vectors are projected into a Poincare ball and classified against one real prototype and multiple fake prototypes, allowing the model to capture heterogeneous codec-fake modes.

Geometry-aware training objective

Cross-entropy is combined with clustering and separation losses so positive prototypes remain compact, distinct, and sensitive to diverse fake artifact patterns.

Training Objective

L = L_cls + lambda_cluster * L_cluster + lambda_sep * L_sep

The framework uses one negative prototype, multiple positive prototypes, evidence pooling with M = 4, hyperbolic embedding dimension h = 128, and optimized classification plus geometry-aware regularization.

Task

Healthcare CodecFake Detection

Binary real-vs-fake detection under clinically realistic pathological speech variability.

Backbone

Selective SSM

A Mamba-style sequence model captures long-context evidence beyond shallow pooled baselines.

Evidence

Multi-evidence pooling

The model keeps several localized cues rather than relying on one utterance-level embedding.

Geometry

Hyperbolic prototypes

One real prototype and several fake prototypes model heterogeneous codec-fake modes.

Best PTM

PaSST

PaSST produces the best single-encoder baseline and the strongest PHOENIX-Mamba results.

Goal

Robust clinical transfer

The architecture is built to separate codec-induced evidence from pathology-related acoustic variability.

Results

Table 1: Generalization of prior codec-deepfake detectors to healthcare speech

AASIST trained on standard CodecFake data transfers poorly to pathological healthcare speech. Training on in-domain healthcare data helps, and wav2vec 2.0 features help further, but a large gap remains.

Method	English Dep		English Alz		English Dys
Method	Acc	F1	Acc	F1	Acc	F1
AASIST (Tr. on CF)	48.62	44.03	34.19	32.51	36.71	34.39
AASIST (Tr. on Indi. Data)	60.84	57.92	52.14	49.93	56.07	54.49
AASIST (wav2vec 2.0)	63.55	51.29	57.76	54.98	59.35	57.16

The healthy-speech deepfake setting does not transfer cleanly to pathological speech. Alzheimer's is especially difficult.

Table 2: PHOENIX-Mamba with PaSST achieves the strongest reported scores

The paper reports PHOENIX-Mamba gains across encoders, with the strongest final numbers coming from the PaSST setup.

Setting	Depression	Alzheimer's	Dysarthria
English PHOENIX-Mamba (PaSST)	97.04	96.73	96.57
Chinese PHOENIX-Mamba (PaSST)	94.41	94.40	93.20
English PaSST baseline (CNN)	78.98	67.94	71.03
Chinese PaSST baseline (CNN)	75.69	65.71	67.36

PHOENIX-Mamba contributes large gains beyond the strongest single-representation baseline, especially under clinical variability.

Ablation Study

The paper's ablations show that sequence modeling, multi-evidence pooling, and hyperbolic reasoning all matter. Removing any of them hurts performance substantially.

Configuration	English Dep Acc	English Alz Acc	English Dys Acc
PHOENIX-Mamba (Full)	97.04	96.73	96.05
PHOENIX-Euc	83.62	79.48	84.72
BiGRU Head	87.69	82.86	86.61
CNN Head	82.26	75.52	79.37
Single evidence (M = 1)	73.51	55.03	67.94

The largest degradation comes from collapsing the utterance to a single evidence vector. Hyperbolic multi-mode reasoning also gives a clear boost over Euclidean reasoning.

Datasets

HCFD covers three clinical conditions in two languages and preserves official split protocols while generating paired codec-synthesized speech.

English · Depression

DAIC-WOZ

Semi-structured clinical interviews used as bona fide speech for depression-oriented HCFD evaluation.

Source recordings

Chinese · Depression

EATD-Corpus

Interview-style Mandarin responses with depression annotations, converted into paired real and codec-generated speech.

Source recordings

English · Alzheimer's

ADReSS / ADReSSo

Standardized Cookie Theft picture-description recordings used for Alzheimer's-related detection experiments.

Source recordings

Chinese · Alzheimer's

NCMMSC

Mandarin dementia benchmark with clinically annotated recordings for cognitive impairment assessment.

Source recordings

English · Dysarthria

TORGO

Speech from individuals with dysarthria, used to test detector robustness under motor-speech variability.

Source recordings

Chinese · Dysarthria

CDSD

Large-scale Mandarin dysarthric speech corpus, approximately 133 hours, paired with codec-resynthesized speech.

Source recordings

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Abstract

Key Contributions

Method

Pretrained audio representation

Mamba-style temporal backbone

Multi-evidence pooling

Hyperbolic multi-mode reasoning

Geometry-aware training objective

Healthcare CodecFake Detection

Selective SSM

Multi-evidence pooling

Hyperbolic prototypes

PaSST

Robust clinical transfer

Results

Ablation Study

Datasets

DAIC-WOZ

EATD-Corpus

ADReSS / ADReSSo

NCMMSC

TORGO

CDSD

Audio Samples

Citation