ACL 2026 Findings

HCFD: A Benchmark for Audio Deepfake Detection in Healthcare

Mohd Mujtaba Akhtar*    Girish*    Muskaan Singh
*Equal Contribution  ·  Accepted to Findings of ACL 2026
Healthcare CodecFake benchmark PHOENIX-Mamba Pathology-aware audio deepfake detection

HCFD studies what happens when clinically meaningful pathological speech is passed through modern neural audio codecs. The benchmark keeps each original utterance aligned with its codec-generated counterpart, exposing deepfake detectors to disease-driven acoustic variability instead of only healthy-speech conditions.

Depression, Alzheimer's disease, and dysarthria
English and Chinese healthcare speech corpora
Real and spoof paired samples for HCFD
HCFD Data Construction Pipeline
Input Audio
Pathological Clinical Speech
Source utterances are collected from healthcare speech benchmarks covering multiple conditions and two languages.
DAIC-WOZ ADReSS TORGO EATD NCMMSC CDSD
Resynthesis
Codec Stage
Neural Audio Codec Transformation
Each utterance is encoded into discrete codec tokens and decoded back into waveform space, introducing codec artifacts while preserving the clinical content.
Encodec DAC SNAC SpeechTokenizer
Pairing
Output Audio
Real and Spoof Paired Samples
HCFD retains the original pathological waveform together with its codec-generated spoof counterpart for condition-aware deepfake detection.
Real Audio
Original pathological utterance
Spoof Audio
Codec-generated paired sample

The benchmark is constructed through a controlled codec resynthesis loop, making HCFD suitable for studying deepfake detection under clinically realistic speech variability.


Abstract

In this study, we present Healthcare CodecFake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We release the HCFD benchmark, a pathology-aware dataset containing paired real and neural-audio-codec-synthesized speech across multiple clinical conditions and codec families. Our evaluations show that state-of-the-art codec-fake detectors trained primarily on healthy speech perform poorly on HCFD, highlighting the need for healthcare-specific modeling.


We compare diverse pretrained audio encoders and show that PaSST is the strongest single-representation baseline for HCFD. Building on that, we propose PHOENIX-Mamba, a geometry-aware framework that uses long-context sequence modeling, multiple localized evidence vectors, and prototype-based reasoning in hyperbolic space to capture heterogeneous codec-fake modes in clinical speech.


Experiments across depression, Alzheimer's disease, and dysarthria in both English and Chinese show consistent gains over AASIST and strong pretrained-model baselines. With PaSST, PHOENIX-Mamba achieves the best reported accuracies: 97.04, 96.73, and 96.57 on the English tasks, and 94.41, 94.40, and 93.20 on the Chinese tasks.

Audio Deepfake DetectionHealthcare SpeechPathological SpeechNeural Audio CodecsHyperbolic GeometryMambaACL 2026 Findings

Key Contributions


Method

PHOENIX-Mamba is designed for healthcare codec-fake detection where codec traces and pathological speech variability interact. The core idea is to avoid collapsing an utterance into a single pooled vector and instead retain multiple localized evidences, then organize fake evidence in hyperbolic space with self-discovered prototype modes.

1

Pretrained audio representation

Input speech is encoded by frozen upstream models such as WavLM, wav2vec 2.0, Whisper, x-vector, or PaSST. The paper finds PaSST to be the strongest single-representation encoder for HCFD.

2

Mamba-style temporal backbone

Token-wise adapted features are passed through a selective state-space backbone to build context-enriched representations over longer pathological utterances.

3

Multi-evidence pooling

Rather than keeping only one pooled summary, PHOENIX-Mamba learns multiple evidence vectors so intermittent codec artifacts can still influence the final decision.

4

Hyperbolic multi-mode reasoning

Evidence vectors are projected into a Poincare ball and classified against one real prototype and multiple fake prototypes, allowing the model to capture heterogeneous codec-fake modes.

5

Geometry-aware training objective

Cross-entropy is combined with clustering and separation losses so positive prototypes remain compact, distinct, and sensitive to diverse fake artifact patterns.

Training Objective

L = L_cls + lambda_cluster * L_cluster + lambda_sep * L_sep

The framework uses one negative prototype, multiple positive prototypes, evidence pooling with M = 4, hyperbolic embedding dimension h = 128, and optimized classification plus geometry-aware regularization.

Task

Healthcare CodecFake Detection

Binary real-vs-fake detection under clinically realistic pathological speech variability.

Backbone

Selective SSM

A Mamba-style sequence model captures long-context evidence beyond shallow pooled baselines.

Evidence

Multi-evidence pooling

The model keeps several localized cues rather than relying on one utterance-level embedding.

Geometry

Hyperbolic prototypes

One real prototype and several fake prototypes model heterogeneous codec-fake modes.

Best PTM

PaSST

PaSST produces the best single-encoder baseline and the strongest PHOENIX-Mamba results.

Goal

Robust clinical transfer

The architecture is built to separate codec-induced evidence from pathology-related acoustic variability.


Results

Table 1: Generalization of prior codec-deepfake detectors to healthcare speech

AASIST trained on standard CodecFake data transfers poorly to pathological healthcare speech. Training on in-domain healthcare data helps, and wav2vec 2.0 features help further, but a large gap remains.

MethodEnglish DepEnglish AlzEnglish Dys
AccF1AccF1AccF1
AASIST (Tr. on CF)48.6244.0334.1932.5136.7134.39
AASIST (Tr. on Indi. Data)60.8457.9252.1449.9356.0754.49
AASIST (wav2vec 2.0)63.5551.2957.7654.9859.3557.16

The healthy-speech deepfake setting does not transfer cleanly to pathological speech. Alzheimer's is especially difficult.


Table 2: PHOENIX-Mamba with PaSST achieves the strongest reported scores

The paper reports PHOENIX-Mamba gains across encoders, with the strongest final numbers coming from the PaSST setup.

SettingDepressionAlzheimer'sDysarthria
English PHOENIX-Mamba (PaSST)97.0496.7396.57
Chinese PHOENIX-Mamba (PaSST)94.4194.4093.20
English PaSST baseline (CNN)78.9867.9471.03
Chinese PaSST baseline (CNN)75.6965.7167.36

PHOENIX-Mamba contributes large gains beyond the strongest single-representation baseline, especially under clinical variability.


Ablation Study

The paper's ablations show that sequence modeling, multi-evidence pooling, and hyperbolic reasoning all matter. Removing any of them hurts performance substantially.

ConfigurationEnglish Dep AccEnglish Alz AccEnglish Dys AccRelative Strength
PHOENIX-Mamba (Full)97.0496.7396.05
PHOENIX-Euc83.6279.4884.72
BiGRU Head87.6982.8686.61
CNN Head82.2675.5279.37
Single evidence (M = 1)73.5155.0367.94

The largest degradation comes from collapsing the utterance to a single evidence vector. Hyperbolic multi-mode reasoning also gives a clear boost over Euclidean reasoning.


Datasets

HCFD covers three clinical conditions in two languages and preserves official split protocols while generating paired codec-synthesized speech.

English · Depression

DAIC-WOZ

Semi-structured clinical interviews used as bona fide speech for depression-oriented HCFD evaluation.

Source recordings
Chinese · Depression

EATD-Corpus

Interview-style Mandarin responses with depression annotations, converted into paired real and codec-generated speech.

Source recordings
English · Alzheimer's

ADReSS / ADReSSo

Standardized Cookie Theft picture-description recordings used for Alzheimer's-related detection experiments.

Source recordings
Chinese · Alzheimer's

NCMMSC

Mandarin dementia benchmark with clinically annotated recordings for cognitive impairment assessment.

Source recordings
English · Dysarthria

TORGO

Speech from individuals with dysarthria, used to test detector robustness under motor-speech variability.

Source recordings
Chinese · Dysarthria

CDSD

Large-scale Mandarin dysarthric speech corpus, approximately 133 hours, paired with codec-resynthesized speech.

Source recordings

Audio Samples

Sample rows below are loaded from audio/manifest.js. Each row pairs a bona fide pathological recording with its codec-generated counterpart.

Waiting for manifest...
CodecGround Truth IDGT SpeechGenerated SpeechCondition

The current repository only includes the manifest file. Audio playback will work once the referenced waveform files are added.


Citation

If you use HCFD, cite the paper as follows:

@inproceedings{hcfd-acl2026, title = {HCFD: A Benchmark for Audio Deepfake Detection in Healthcare}, author = {Mohd Mujtaba Akhtar and Girish and Muskaan Singh}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, pages = {TBD}, }

Update page numbers in the citation once the proceedings version is finalized.