Interspeech 2026

Towards Detecting Neural Audio Codec Synthesized Heart Sounds

Girish^* Orchid Chetia Phukan^* Mohd Mujtaba Akhtar^* Bhavinkumar Vinodbhai Kuwar ^* Swarup Ranjan Behera Arun Balaji Buduru

^*Equal Contribution as First Author

Synthetic Heart Sound Detection (SHAC) CARDIOFAKE Dataset GROOT: Gram-OT Fusion

Paper Code (Coming Soon) Dataset (Coming Soon) Audio Samples

Heart sounds (phonocardiograms, PCGs) have been regarded as a promising biometric modality due to their natural resilience against traditional spoofing methods. We show that neural audio codecs (NACs) can synthesize heart sounds that are perceptually indistinguishable from genuine recordings, introduce the Synthetic Heart Sound Detection (SHAC) task and the CARDIOFAKE dataset, and propose GROOT, a Gram-OT based fusion of spectral and SSL representations for detecting these codec-synthesized heart sounds.

CirCor DigiScope heart sounds (963 patients, 3,163 recordings)

7 neural audio codec families, seen and unseen evaluation protocols

Real and codec-synthesized paired samples for CARDIOFAKE

CARDIOFAKE Dataset Construction Pipeline

Input Audio

Real Phonocardiograms

Real heart sound recordings are sourced from the CirCor DigiScope dataset (PhysioNet), spanning 963 patients with durations from 5 to 65 seconds.

CirCor DigiScope 3,163 recordings

Encode-Decode

Codec Stage

Neural Audio Codec Resynthesis

Each PCG is passed through the encoder-decoder of a neural audio codec, preserving cardiac acoustic patterns while introducing subtle codec-induced artifacts.

DAC EnCodec SoundStream SpeechTokenizer FunCodec AudioDec SNAC

Pairing

Output Audio

Real and Spoof Paired Samples

CARDIOFAKE retains the original phonocardiogram together with its codec-generated counterpart for every NAC, defining seen and unseen evaluation protocols.

Real Audio

Original phonocardiogram

Spoof Audio

NAC-resynthesized counterpart

CARDIOFAKE comprises 3,163 real heart sounds and 22,141 NAC-generated counterparts across 7 codec families, with seen (SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer) and unseen (FunCodec, AudioDec) test protocols.

Abstract

In this paper, we introduce Synthetic Heart Sound Detection (SHAC), a task aimed at identifying phonocardiograms (PCGs) synthesized using neural audio codecs (NACs). To facilitate research in this direction, we release CARDIOFAKE, the first benchmark dataset for SHAC containing both real and codec-synthesized PCGs.

We benchmark spectral representations (MFCC, LFCC) and self-supervised learning (SSL) representations (e.g., WavLM) for the task. Furthermore, we propose GROOT, a fusion framework that integrates spectral and SSL features for leveraging their complementary behavior.

Experiments show that GROOT, combining MFCC and WavLM, achieves state-of-the-art performance, outperforming individual representations and competitive baselines.

Synthetic Heart Sound DetectionPhonocardiogramsNeural Audio CodecsSpectral FeaturesSelf-Supervised LearningOptimal TransportInterspeech 2026

Method

GROOT (Fusion via GRammian Optimal TranspOrT) fuses spectral and SSL representations for SHAC. Spectral features (MFCC, LFCC) are highly sensitive to NAC-induced distortions at the acoustic level, while SSL representations (Wav2vec2, UniSpeech-SAT, WavLM) capture broader temporal structure and variability in heart sounds. GROOT aligns these complementary representations using a novel grammian optimal transport mechanism that compares representations through their gram matrices rather than raw features.

Feature extraction

14-dim LFCC and 40-dim MFCC are extracted as spectral features. Wav2vec2, UniSpeech-SAT, and WavLM are used as SSL representations, each producing 768-dim features via average pooling over the final hidden layer.

1D-CNN and max-pooling

Each representation (R1, R2) is passed through a 1D-CNN block (32 filters) followed by max-pooling, then flattened and linearly projected to a 120-dimensional vector.

Gram matrix computation

Gram matrices GR1 = R1 R1^T and GR2 = R2 R2^T capture correlations between features and reflect global relational patterns, such as rhythm, across each representation space.

Grammian optimal transport (Gram-OT)

A cost matrix is built from the Frobenius distance between the two gram matrices, and the Sinkhorn algorithm computes an optimal transport plan to align the two representation spaces.

Fusion and classification

Transported features are concatenated with their original representations to form F1 and F2, passed through parallel FCNs, concatenated again, and classified by a final FCN with a sigmoid output for real-vs-spoof detection.

Figure: The GROOT framework. Two representation branches (R1, R2) are each processed by a 1D-CNN and max-pooling, flattened, and aligned via the Gram-OT fusion module before classification.

Gram-OT Alignment

GR1 = R1 R1^T, GR2 = R2 R2^T M = ||GR1 - GR2||_F / max(||GR1 - GR2||_F) Gamma = Sinkhorn(M) R2->R1 = Gamma . R2, R1->R2 = Gamma^T . R1 F1 = Concat(R2->R1, R1), F2 = Concat(R1->R2, R2)

The fused representations F1 and F2 are passed through FCNs with a dense layer of 80 neurons each, concatenated, and processed by a final FCN (120 and 30 neurons) with a sigmoid output layer for binary classification.

Task

Synthetic Heart Sound Detection

Binary real-vs-spoof detection of neural-audio-codec-synthesized phonocardiograms.

Fusion

Gram-OT Alignment

A novel grammian optimal transport mechanism aligning representations via their gram matrices.

Representations

Spectral + SSL

MFCC and LFCC capture acoustic-level codec distortions; WavLM and Wav2vec2 capture temporal structure.

Downstream

1D-CNN + FCN

Lightweight CNN and max-pooling per branch followed by fusion and a fully connected classifier.

Best Result

93.20% / 5.86% EER

GROOT with MFCC + WavLM sets a new SOTA for the SHAC task (Seen condition).

Goal

Robust heart sound CF detection

Establishes the first benchmark and baselines for detecting NAC-synthesized heart sounds.

Results

Table 1: Individual representations for SHAC

CNN-based downstream models consistently outperform their FCN counterparts. Among individual representations, WavLM with CNN is the strongest performer, and SSL features consistently outperform spectral features (MFCC, LFCC) in both Seen and Unseen conditions.

Representation	FCN		CNN
Representation	ACC	EER	ACC	EER
Seen
LFCC	76.99	15.19	79.02	14.96
MFCC	77.82	15.04	81.56	12.55
Wav2vec2	83.62	12.13	86.65	10.37
UniSpeech-SAT	80.30	12.30	82.81	11.59
WavLM	84.54	12.51	87.72	9.45
Unseen
LFCC	72.45	18.93	73.99	18.08
MFCC	74.93	17.60	78.74	16.91
Wav2vec2	79.56	16.03	83.61	13.74
UniSpeech-SAT	74.02	18.07	78.47	18.69
WavLM	80.54	15.01	84.02	13.39

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %, for Seen and Unseen conditions.

Table 2: Fusion of representations

GROOT consistently achieves the best overall performance across both Seen and Unseen conditions. Heterogeneous fusion of spectral and SSL representations outperforms homogeneous fusion, and the best performance is achieved by fusing MFCC and WavLM through GROOT.

Pair	Concat		OT		GROOT
Pair	ACC	EER	ACC	EER	ACC	EER
Seen
LFCC + MFCC	80.05	11.82	82.41	10.91	84.61	8.35
LFCC + Wav2vec2	86.18	8.72	88.93	8.10	90.50	7.42
LFCC + UniSpeech-SAT	81.12	12.00	84.50	11.18	87.09	9.63
LFCC + WavLM	86.32	7.18	88.36	7.03	91.77	6.06
MFCC + Wav2vec2	86.57	7.20	88.82	6.87	90.83	6.14
MFCC + UniSpeech-SAT	84.00	11.23	85.88	11.22	87.90	9.72
MFCC + WavLM	87.70	7.40	89.07	6.86	93.20	5.86
Wav2vec2 + UniSpeech-SAT	86.99	9.87	87.79	9.06	89.00	8.61
Wav2vec2 + WavLM	86.26	8.32	88.92	7.82	91.60	6.20
UniSpeech-SAT + WavLM	85.01	10.54	86.23	8.92	88.74	7.55
Unseen
LFCC + MFCC	79.28	16.70	81.87	15.22	83.70	13.99
LFCC + Wav2vec2	84.11	12.34	84.25	10.83	86.00	9.87
LFCC + UniSpeech-SAT	79.05	17.98	80.47	16.00	82.39	14.80
LFCC + WavLM	81.22	13.27	83.01	12.87	85.31	10.98
MFCC + Wav2vec2	81.00	12.08	84.72	10.34	85.78	10.70
MFCC + UniSpeech-SAT	80.72	14.89	81.51	13.70	83.04	11.49
MFCC + WavLM	84.33	13.11	84.99	12.06	86.10	9.75
Wav2vec2 + UniSpeech-SAT	83.11	13.38	84.20	12.40	85.81	11.13
Wav2vec2 + WavLM	83.97	12.90	84.09	11.29	85.70	10.00
UniSpeech-SAT + WavLM	81.02	15.80	82.68	14.10	84.48	12.51

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %. OT: Optimal Transport baseline (same architecture as GROOT, without Gram-OT).

Comparison to SOTA Audio Deepfake Baselines

GROOT (MFCC + WavLM) is compared against AASIST and MiO, strong general audio deepfake detection baselines, trained under the same configuration.

Model	Seen		Unseen
Model	ACC	EER	ACC	EER
AASIST	85.15	14.91	73.13	16.43
MiO	86.98	12.34	75.89	14.09
GROOT (MFCC + WavLM)	93.20	5.86	86.10	9.75

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %.

Dataset and Codecs

CARDIOFAKE pairs authentic phonocardiograms with codec-resynthesized counterparts produced by 7 neural audio codec families, with dedicated seen and unseen evaluation protocols.

Source Recordings

CirCor DigiScope (PhysioNet)

963 patients, each labeled Present, Absent, or Unknown. 3,163 phonocardiogram recordings with durations from 5 to 65 seconds.

Source recordings

Seen Protocol

SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer

Used to generate CF samples for the training, validation, and seen test splits.

Codec backbones

Unseen Protocol

FunCodec, AudioDec

Held-out codecs used only at test time to evaluate generalization to unseen NACs.

Codec backbones

Resulting Dataset

CARDIOFAKE

3,163 real heart sounds paired with 22,141 NAC-generated counterparts across 7 codec families (seen + unseen).

Real + Spoof pairs

Towards Detecting Neural Audio Codec Synthesized Heart Sounds

Abstract

Key Contributions

Method

Feature extraction

1D-CNN and max-pooling

Gram matrix computation

Grammian optimal transport (Gram-OT)

Fusion and classification

Synthetic Heart Sound Detection

Gram-OT Alignment

Spectral + SSL

1D-CNN + FCN

93.20% / 5.86% EER

Robust heart sound CF detection

Results

Dataset and Codecs

CirCor DigiScope (PhysioNet)

SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer

FunCodec, AudioDec

CARDIOFAKE

Audio Samples

Citation