Interspeech 2026

Towards Detecting Neural Audio Codec Synthesized Heart Sounds

Girish*    Orchid Chetia Phukan*    Mohd Mujtaba Akhtar*    Bhavinkumar Vinodbhai Kuwar *    Swarup Ranjan Behera    Arun Balaji Buduru
*Equal Contribution as First Author
Synthetic Heart Sound Detection (SHAC) CARDIOFAKE Dataset GROOT: Gram-OT Fusion

Heart sounds (phonocardiograms, PCGs) have been regarded as a promising biometric modality due to their natural resilience against traditional spoofing methods. We show that neural audio codecs (NACs) can synthesize heart sounds that are perceptually indistinguishable from genuine recordings, introduce the Synthetic Heart Sound Detection (SHAC) task and the CARDIOFAKE dataset, and propose GROOT, a Gram-OT based fusion of spectral and SSL representations for detecting these codec-synthesized heart sounds.

CirCor DigiScope heart sounds (963 patients, 3,163 recordings)
7 neural audio codec families, seen and unseen evaluation protocols
Real and codec-synthesized paired samples for CARDIOFAKE
CARDIOFAKE Dataset Construction Pipeline
Input Audio
Real Phonocardiograms
Real heart sound recordings are sourced from the CirCor DigiScope dataset (PhysioNet), spanning 963 patients with durations from 5 to 65 seconds.
CirCor DigiScope 3,163 recordings
Encode-Decode
Codec Stage
Neural Audio Codec Resynthesis
Each PCG is passed through the encoder-decoder of a neural audio codec, preserving cardiac acoustic patterns while introducing subtle codec-induced artifacts.
DAC EnCodec SoundStream SpeechTokenizer FunCodec AudioDec SNAC
Pairing
Output Audio
Real and Spoof Paired Samples
CARDIOFAKE retains the original phonocardiogram together with its codec-generated counterpart for every NAC, defining seen and unseen evaluation protocols.
Real Audio
Original phonocardiogram
Spoof Audio
NAC-resynthesized counterpart

CARDIOFAKE comprises 3,163 real heart sounds and 22,141 NAC-generated counterparts across 7 codec families, with seen (SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer) and unseen (FunCodec, AudioDec) test protocols.


Abstract

In this paper, we introduce Synthetic Heart Sound Detection (SHAC), a task aimed at identifying phonocardiograms (PCGs) synthesized using neural audio codecs (NACs). To facilitate research in this direction, we release CARDIOFAKE, the first benchmark dataset for SHAC containing both real and codec-synthesized PCGs.


We benchmark spectral representations (MFCC, LFCC) and self-supervised learning (SSL) representations (e.g., WavLM) for the task. Furthermore, we propose GROOT, a fusion framework that integrates spectral and SSL features for leveraging their complementary behavior.


Experiments show that GROOT, combining MFCC and WavLM, achieves state-of-the-art performance, outperforming individual representations and competitive baselines.

Synthetic Heart Sound DetectionPhonocardiogramsNeural Audio CodecsSpectral FeaturesSelf-Supervised LearningOptimal TransportInterspeech 2026

Key Contributions


Method

GROOT (Fusion via GRammian Optimal TranspOrT) fuses spectral and SSL representations for SHAC. Spectral features (MFCC, LFCC) are highly sensitive to NAC-induced distortions at the acoustic level, while SSL representations (Wav2vec2, UniSpeech-SAT, WavLM) capture broader temporal structure and variability in heart sounds. GROOT aligns these complementary representations using a novel grammian optimal transport mechanism that compares representations through their gram matrices rather than raw features.

1

Feature extraction

14-dim LFCC and 40-dim MFCC are extracted as spectral features. Wav2vec2, UniSpeech-SAT, and WavLM are used as SSL representations, each producing 768-dim features via average pooling over the final hidden layer.

2

1D-CNN and max-pooling

Each representation (R1, R2) is passed through a 1D-CNN block (32 filters) followed by max-pooling, then flattened and linearly projected to a 120-dimensional vector.

3

Gram matrix computation

Gram matrices GR1 = R1 R1^T and GR2 = R2 R2^T capture correlations between features and reflect global relational patterns, such as rhythm, across each representation space.

4

Grammian optimal transport (Gram-OT)

A cost matrix is built from the Frobenius distance between the two gram matrices, and the Sinkhorn algorithm computes an optimal transport plan to align the two representation spaces.

5

Fusion and classification

Transported features are concatenated with their original representations to form F1 and F2, passed through parallel FCNs, concatenated again, and classified by a final FCN with a sigmoid output for real-vs-spoof detection.

GROOT framework diagram

Figure: The GROOT framework. Two representation branches (R1, R2) are each processed by a 1D-CNN and max-pooling, flattened, and aligned via the Gram-OT fusion module before classification.

Gram-OT Alignment

GR1 = R1 R1^T, GR2 = R2 R2^T M = ||GR1 - GR2||_F / max(||GR1 - GR2||_F) Gamma = Sinkhorn(M) R2->R1 = Gamma . R2, R1->R2 = Gamma^T . R1 F1 = Concat(R2->R1, R1), F2 = Concat(R1->R2, R2)

The fused representations F1 and F2 are passed through FCNs with a dense layer of 80 neurons each, concatenated, and processed by a final FCN (120 and 30 neurons) with a sigmoid output layer for binary classification.

Task

Synthetic Heart Sound Detection

Binary real-vs-spoof detection of neural-audio-codec-synthesized phonocardiograms.

Fusion

Gram-OT Alignment

A novel grammian optimal transport mechanism aligning representations via their gram matrices.

Representations

Spectral + SSL

MFCC and LFCC capture acoustic-level codec distortions; WavLM and Wav2vec2 capture temporal structure.

Downstream

1D-CNN + FCN

Lightweight CNN and max-pooling per branch followed by fusion and a fully connected classifier.

Best Result

93.20% / 5.86% EER

GROOT with MFCC + WavLM sets a new SOTA for the SHAC task (Seen condition).

Goal

Robust heart sound CF detection

Establishes the first benchmark and baselines for detecting NAC-synthesized heart sounds.


Results

Table 1: Individual representations for SHAC

CNN-based downstream models consistently outperform their FCN counterparts. Among individual representations, WavLM with CNN is the strongest performer, and SSL features consistently outperform spectral features (MFCC, LFCC) in both Seen and Unseen conditions.

RepresentationFCNCNN
ACCEERACCEER
Seen
LFCC76.9915.1979.0214.96
MFCC77.8215.0481.5612.55
Wav2vec283.6212.1386.6510.37
UniSpeech-SAT80.3012.3082.8111.59
WavLM84.5412.5187.729.45
Unseen
LFCC72.4518.9373.9918.08
MFCC74.9317.6078.7416.91
Wav2vec279.5616.0383.6113.74
UniSpeech-SAT74.0218.0778.4718.69
WavLM80.5415.0184.0213.39

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %, for Seen and Unseen conditions.


Table 2: Fusion of representations

GROOT consistently achieves the best overall performance across both Seen and Unseen conditions. Heterogeneous fusion of spectral and SSL representations outperforms homogeneous fusion, and the best performance is achieved by fusing MFCC and WavLM through GROOT.

PairConcatOTGROOT
ACCEERACCEERACCEER
Seen
LFCC + MFCC80.0511.8282.4110.9184.618.35
LFCC + Wav2vec286.188.7288.938.1090.507.42
LFCC + UniSpeech-SAT81.1212.0084.5011.1887.099.63
LFCC + WavLM86.327.1888.367.0391.776.06
MFCC + Wav2vec286.577.2088.826.8790.836.14
MFCC + UniSpeech-SAT84.0011.2385.8811.2287.909.72
MFCC + WavLM87.707.4089.076.8693.205.86
Wav2vec2 + UniSpeech-SAT86.999.8787.799.0689.008.61
Wav2vec2 + WavLM86.268.3288.927.8291.606.20
UniSpeech-SAT + WavLM85.0110.5486.238.9288.747.55
Unseen
LFCC + MFCC79.2816.7081.8715.2283.7013.99
LFCC + Wav2vec284.1112.3484.2510.8386.009.87
LFCC + UniSpeech-SAT79.0517.9880.4716.0082.3914.80
LFCC + WavLM81.2213.2783.0112.8785.3110.98
MFCC + Wav2vec281.0012.0884.7210.3485.7810.70
MFCC + UniSpeech-SAT80.7214.8981.5113.7083.0411.49
MFCC + WavLM84.3313.1184.9912.0686.109.75
Wav2vec2 + UniSpeech-SAT83.1113.3884.2012.4085.8111.13
Wav2vec2 + WavLM83.9712.9084.0911.2985.7010.00
UniSpeech-SAT + WavLM81.0215.8082.6814.1084.4812.51

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %. OT: Optimal Transport baseline (same architecture as GROOT, without Gram-OT).


Comparison to SOTA Audio Deepfake Baselines

GROOT (MFCC + WavLM) is compared against AASIST and MiO, strong general audio deepfake detection baselines, trained under the same configuration.

ModelSeenUnseen
ACCEERACCEER
AASIST85.1514.9173.1316.43
MiO86.9812.3475.8914.09
GROOT (MFCC + WavLM)93.205.8686.109.75

Accuracy (ACC, higher is better) and Equal Error Rate (EER, lower is better) in %.


Dataset and Codecs

CARDIOFAKE pairs authentic phonocardiograms with codec-resynthesized counterparts produced by 7 neural audio codec families, with dedicated seen and unseen evaluation protocols.

Source Recordings

CirCor DigiScope (PhysioNet)

963 patients, each labeled Present, Absent, or Unknown. 3,163 phonocardiogram recordings with durations from 5 to 65 seconds.

Source recordings
Seen Protocol

SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer

Used to generate CF samples for the training, validation, and seen test splits.

Codec backbones
Unseen Protocol

FunCodec, AudioDec

Held-out codecs used only at test time to evaluate generalization to unseen NACs.

Codec backbones
Resulting Dataset

CARDIOFAKE

3,163 real heart sounds paired with 22,141 NAC-generated counterparts across 7 codec families (seen + unseen).

Real + Spoof pairs

Audio Samples

Sample rows below are loaded from Audio/manifest.js. Each row pairs a real heart sound recording with its codec-generated counterpart.

CodecGround Truth IDGT Heart SoundGenerated Heart SoundSource

Citation

If you use the CARDIOFAKE dataset or GROOT, please cite the paper as follows:

@inproceedings{shac-interspeech2026, title = {Towards Detecting Neural Audio Codec Synthesized Heart Sounds}, author = {Girish and Phukan, Orchid Chetia and Akhtar, Mohd Mujtaba and Kuwar, Bhavinkumar Vinodbhai and Behera, Swarup Ranjan and Buduru, Arun Balaji}, booktitle = {Proceedings of Interspeech 2026}, year = {2026}, pages = {TBD}, }

Update page numbers in the citation once the proceedings version is finalized.