HCFD studies what happens when clinically meaningful pathological speech is passed through modern neural audio codecs. The benchmark keeps each original utterance aligned with its codec-generated counterpart, exposing deepfake detectors to disease-driven acoustic variability instead of only healthy-speech conditions.
The benchmark is constructed through a controlled codec resynthesis loop, making HCFD suitable for studying deepfake detection under clinically realistic speech variability.
In this study, we present Healthcare CodecFake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We release the HCFD benchmark, a pathology-aware dataset containing paired real and neural-audio-codec-synthesized speech across multiple clinical conditions and codec families. Our evaluations show that state-of-the-art codec-fake detectors trained primarily on healthy speech perform poorly on HCFD, highlighting the need for healthcare-specific modeling.
We compare diverse pretrained audio encoders and show that PaSST is the strongest single-representation baseline for HCFD. Building on that, we propose PHOENIX-Mamba, a geometry-aware framework that uses long-context sequence modeling, multiple localized evidence vectors, and prototype-based reasoning in hyperbolic space to capture heterogeneous codec-fake modes in clinical speech.
Experiments across depression, Alzheimer's disease, and dysarthria in both English and Chinese show consistent gains over AASIST and strong pretrained-model baselines. With PaSST, PHOENIX-Mamba achieves the best reported accuracies: 97.04, 96.73, and 96.57 on the English tasks, and 94.41, 94.40, and 93.20 on the Chinese tasks.
PHOENIX-Mamba is designed for healthcare codec-fake detection where codec traces and pathological speech variability interact. The core idea is to avoid collapsing an utterance into a single pooled vector and instead retain multiple localized evidences, then organize fake evidence in hyperbolic space with self-discovered prototype modes.
Input speech is encoded by frozen upstream models such as WavLM, wav2vec 2.0, Whisper, x-vector, or PaSST. The paper finds PaSST to be the strongest single-representation encoder for HCFD.
Token-wise adapted features are passed through a selective state-space backbone to build context-enriched representations over longer pathological utterances.
Rather than keeping only one pooled summary, PHOENIX-Mamba learns multiple evidence vectors so intermittent codec artifacts can still influence the final decision.
Evidence vectors are projected into a Poincare ball and classified against one real prototype and multiple fake prototypes, allowing the model to capture heterogeneous codec-fake modes.
Cross-entropy is combined with clustering and separation losses so positive prototypes remain compact, distinct, and sensitive to diverse fake artifact patterns.
Training Objective
The framework uses one negative prototype, multiple positive prototypes, evidence pooling with M = 4, hyperbolic embedding dimension h = 128, and optimized classification plus geometry-aware regularization.
Binary real-vs-fake detection under clinically realistic pathological speech variability.
A Mamba-style sequence model captures long-context evidence beyond shallow pooled baselines.
The model keeps several localized cues rather than relying on one utterance-level embedding.
One real prototype and several fake prototypes model heterogeneous codec-fake modes.
PaSST produces the best single-encoder baseline and the strongest PHOENIX-Mamba results.
The architecture is built to separate codec-induced evidence from pathology-related acoustic variability.
Table 1: Generalization of prior codec-deepfake detectors to healthcare speech
AASIST trained on standard CodecFake data transfers poorly to pathological healthcare speech. Training on in-domain healthcare data helps, and wav2vec 2.0 features help further, but a large gap remains.
| Method | English Dep | English Alz | English Dys | |||
|---|---|---|---|---|---|---|
| Acc | F1 | Acc | F1 | Acc | F1 | |
| AASIST (Tr. on CF) | 48.62 | 44.03 | 34.19 | 32.51 | 36.71 | 34.39 |
| AASIST (Tr. on Indi. Data) | 60.84 | 57.92 | 52.14 | 49.93 | 56.07 | 54.49 |
| AASIST (wav2vec 2.0) | 63.55 | 51.29 | 57.76 | 54.98 | 59.35 | 57.16 |
The healthy-speech deepfake setting does not transfer cleanly to pathological speech. Alzheimer's is especially difficult.
Table 2: PHOENIX-Mamba with PaSST achieves the strongest reported scores
The paper reports PHOENIX-Mamba gains across encoders, with the strongest final numbers coming from the PaSST setup.
| Setting | Depression | Alzheimer's | Dysarthria |
|---|---|---|---|
| English PHOENIX-Mamba (PaSST) | 97.04 | 96.73 | 96.57 |
| Chinese PHOENIX-Mamba (PaSST) | 94.41 | 94.40 | 93.20 |
| English PaSST baseline (CNN) | 78.98 | 67.94 | 71.03 |
| Chinese PaSST baseline (CNN) | 75.69 | 65.71 | 67.36 |
PHOENIX-Mamba contributes large gains beyond the strongest single-representation baseline, especially under clinical variability.
The paper's ablations show that sequence modeling, multi-evidence pooling, and hyperbolic reasoning all matter. Removing any of them hurts performance substantially.
| Configuration | English Dep Acc | English Alz Acc | English Dys Acc | Relative Strength |
|---|---|---|---|---|
| PHOENIX-Mamba (Full) | 97.04 | 96.73 | 96.05 | |
| PHOENIX-Euc | 83.62 | 79.48 | 84.72 | |
| BiGRU Head | 87.69 | 82.86 | 86.61 | |
| CNN Head | 82.26 | 75.52 | 79.37 | |
| Single evidence (M = 1) | 73.51 | 55.03 | 67.94 |
The largest degradation comes from collapsing the utterance to a single evidence vector. Hyperbolic multi-mode reasoning also gives a clear boost over Euclidean reasoning.
HCFD covers three clinical conditions in two languages and preserves official split protocols while generating paired codec-synthesized speech.
Semi-structured clinical interviews used as bona fide speech for depression-oriented HCFD evaluation.
Source recordingsInterview-style Mandarin responses with depression annotations, converted into paired real and codec-generated speech.
Source recordingsStandardized Cookie Theft picture-description recordings used for Alzheimer's-related detection experiments.
Source recordingsMandarin dementia benchmark with clinically annotated recordings for cognitive impairment assessment.
Source recordingsSpeech from individuals with dysarthria, used to test detector robustness under motor-speech variability.
Source recordingsLarge-scale Mandarin dysarthric speech corpus, approximately 133 hours, paired with codec-resynthesized speech.
Source recordingsSample rows below are loaded from audio/manifest.js. Each row pairs a bona fide pathological recording with its codec-generated counterpart.
| Codec | Ground Truth ID | GT Speech | Generated Speech | Condition |
|---|
The current repository only includes the manifest file. Audio playback will work once the referenced waveform files are added.
If you use HCFD, cite the paper as follows:
Update page numbers in the citation once the proceedings version is finalized.