✅ ACL 2026 Findings · Accepted

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish^*,1 Mohd Mujtaba Akhtar^*,2 Orchid Chetia Phukan^*†,3 Arun Balaji Buduru³

¹UPES, India · ²Veer Bahadur Singh Purvanchal University, India · ³IIIT-Delhi, India
^*Equal contribution · ^†Corresponding author · ACL 2026 Findings

Paper Dataset Code Audio Samples

🎙️

Real Speech

IndicSUPERB
12 Indic languages

→

🔊

Neural Audio Codec

Encode → Decode
8 NAC families

→

👻

CodecFake

ICF Dataset
parallel corpus

SATYAM

Hyperbolic ALM

✅❌

Decision

"Real" or "Fake"
98.32% ACC

Indic-CodecFake (ICF) is the first large-scale benchmark of NAC-synthesized speech deepfakes across 12 Indic languages. SATYAM — a hyperbolic Audio LLM — aligns semantic (Whisper) and prosodic (TRILLsson) representations via Bhattacharya distance in hyperbolic space, achieving 98.32% accuracy on ICF.

Abstract

The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored.

To bridge this gap, we introduce the Indic-CodecFake (ICF) dataset — the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types, built on the IndicSUPERB corpus. Our experiments demonstrate that state-of-the-art CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. We further present a systematic zero-shot evaluation of SOTA ALMs on ICF, revealing consistently poor performance.

To address these limitations, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson via Bhattacharya distance in hyperbolic space, followed by the same alignment between the fused speech representation and an input conditioning prompt. This dual-stage framework enables effective modeling of hierarchical relationships both within speech (semantic–prosodic) and across modalities (speech–text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.

Deepfake Detection Neural Audio Codecs Hyperbolic Geometry Audio LLM Indic Languages Multilingual NLP Speech Security

Key Contributions

🗄️
ICF Dataset. We introduce Indic-CodecFake (ICF), the first large-scale CodecFake benchmark in Indic languages, covering 12 languages from IndicSUPERB, 8 NAC families (DAC, Encodec, SNAC, SoundStream, SpeechTokenizer, FunCodec, AudioDec, MIMI), and two evaluation protocols — Seen (in-distribution NACs) and Unseen (out-of-distribution NACs).
📊
Comprehensive Baselines. We evaluate SOTA CF detectors trained on English-centric benchmarks on ICF, showing dramatic generalization failure (AASIST drops from 94.21% to 48.0% ACC). We further conduct the first systematic zero-shot ALM evaluation for CF detection, revealing that current ALMs (Qwen2-Audio, Audio Flamingo 3, Pengi) achieve only ~13% ACC on ICF — establishing the need for Indic-centric pipelines.
🧮
SATYAM. We propose SATYAM, a hyperbolic ALM for CF detection that is, to the best of our knowledge, the first work to extend ALMs to hyperbolic space. SATYAM's dual-stage Bhattacharya distance alignment — first between Whisper and TRILLsson in hyperbolic space, then between fused speech and a conditioning prompt — achieves 98.32% ACC / 3.27% EER on ICF, outperforming all baselines by a large margin.

Indic-CodecFake (ICF) Dataset

ICF is constructed by resynthesizing real speech from IndicSUPERB using 8 NAC families in a controlled encode–decode pipeline. Each real utterance x is passed through a NAC encoder E and decoder D to yield a CodecFake counterpart x̃ = D(E(x)), preserving linguistic content while introducing NAC-specific artifacts. The dataset spans 12 Indic languages across both Indo-European and Dravidian families. All codec models used to generate ICF are set up and available at CodeVault-girish/Neural-Codecs .

Bengali IE Gujarati IE Hindi IE Marathi IE Odia IE Punjabi IE Sanskrit IE Urdu IE Kannada DR Malayalam DR Tamil DR Telugu DR

IE = Indo-European · DR = Dravidian

Neural Audio Codecs Used

DAC

16 / 24 / 44 kHz

Descript Audio Codec with multi-scale architecture.

Seen

Encodec

24 / 48 kHz

Meta's streaming neural audio codec.

Seen

SNAC

24 / 32 / 44 kHz

Multi-scale neural audio codec.

Seen

SoundStream

16 kHz

Google's end-to-end neural audio codec.

Seen

SpeechTokenizer

16 kHz

Hierarchical speech tokenizer for LLMs.

Seen

FunCodec

16 kHz

Fundamental frequency-aware codec.

Unseen

AudioDec

28 / 48 kHz

High-fidelity streaming audio decoder.

Unseen

MIMI

24 kHz

Moshi's neural audio codec (Kyutai).

Unseen

SEEN NACs appear in training (test-known split) · UNSEEN NACs held out for cross-codec generalization (test-unknown split)

SATYAM

SATYAM is a supervised hyperbolic ALM that formulates CF detection as a conditional generation task. Given an input speech utterance, SATYAM extracts complementary semantic and paralinguistic representations, fuses them in hyperbolic space via Bhattacharya distance alignment, and conditions a frozen LLM decoder to generate a one-word verdict: "Real" or "Fake". Total trainable parameters: ~3.75M.

Dual Audio Encoding

Two complementary encoders extract representations from the input speech waveform: Whisper provides semantic representations (e_w) capturing linguistic content, while TRILLsson provides paralinguistic representations (e_t) capturing prosodic and acoustic cues. Each branch passes through a 1D-CNN block (filter size 3) and max-pooling, then projects into a shared Euclidean space of dimension d. A sigmoid gating module filters salient information before the hyperbolic mapping stage.

Hyperbolic Projection

Both representations are mapped into a d-dimensional hyperbolic space H^d_c (curvature −c) via the exponential map at the origin:

exp^c₀(u) = tanh(√c ‖u‖) · u / (√c ‖u‖)

Yielding hyperbolic representations h_w and h_t on the Poincaré manifold.

Hyperbolic Speech–Speech Alignment (H-BD-SS)

We minimize the Bhattacharya distance (BD) between the semantic and paralinguistic hyperbolic distributions to align them. For distributions P and Q on H^d_c:

D_B(P, Q) = −log ∫ √p(h)·q(h) dµ_c(h)

This yields the speech–speech alignment loss L_S-S = D_B(h_w, h_t). The aligned representations are then fused using Möbius addition (⊕_c), which preserves hyperbolic geometry: h_f = h_w ⊕_c h_t.

Hyperbolic Speech–Prompt Alignment (H-BD-ST)

A conditioning prompt "Analyze the speech for unnatural artifacts" is fed to Qwen2-7B. Hidden states from an intermediate transformer layer are mean-pooled to obtain a prompt representation e_A, projected to the shared space, then mapped to hyperbolic space. The same BD alignment is applied between the fused speech and prompt: L_S-T = D_B(h_f, h_A). Final aggregation: h_final = h_f ⊕_c h_A.

Frozen LLM Decoding

The aggregated hyperbolic representation is mapped back to Euclidean space via the logarithmic map, linearly projected to g, and injected as prefix conditioning tokens into the frozen Qwen2-7B decoder. A decision prompt "Determine whether the speech is real or fake. Answer only in one word: 'Real' or 'Fake'" drives generation of the output sequence Y. Training minimizes:

L = λ₁·L_S-S + λ₂·L_S-T + λ₃·L_LM (λ₁=1, λ₂=0.5, λ₃=1)

Semantic Encoder

Whisper

OpenAI's speech recognition model; captures linguistic and semantic content from speech.

Prosodic Encoder

TRILLsson

Google's paralinguistic model; captures prosodic and acoustic non-semantic cues — key for deepfake detection.

Geometry

Poincaré Ball (H^d_c)

Hyperbolic space with curvature −c. Naturally embeds the hierarchical structure of speech semantics and artifacts.

Alignment

Bhattacharya Distance

Extended to hyperbolic space for speech–speech and speech–prompt distribution alignment. Novel contribution.

Fusion

Möbius Addition

Geometry-preserving operation in hyperbolic space. Fuses semantic + prosodic, and speech + prompt representations.

LLM Decoder

Qwen2-7B (frozen)

Prefix-conditioned generation with ~3.75M trainable parameters. Lightweight variant with Qwen2-1.8B also evaluated.

Results

Table 1: In-domain Training and Evaluation on ICF and CodecFake

SATYAM achieves 98.32% ACC / 3.27% EER on ICF and 99.11% ACC / 1.94% EER on CodecFake, outperforming all end-to-end and ALM-based baselines by substantial margins. W = Whisper, T = TRILLsson. Green = best, yellow = second best.

Method	ICF		CodecFake
Method	ACC ↑	EER ↓	ACC ↑	EER ↓
Zero-shot ALM Evaluation
Pengi	3.19	98.26	5.68	94.97
Audio Flamingo 2	5.42	97.68	8.41	92.10
Audio Flamingo 3	6.98	97.21	10.22	90.85
Qwen-audio-chat	10.63	89.71	13.00	86.61
Qwen-audio-base	11.17	89.23	15.82	85.74
Qwen2-audio-chat	12.05	88.95	16.74	82.33
Qwen2-audio-base	13.41	88.57	17.91	81.26
End-to-End & Pre-Trained Backbone
AASIST	90.60	12.47	94.21	10.13
Whisper-LCNN	91.98	11.89	93.38	7.92
Wav2vec2-AASIST	92.50	9.62	94.45	7.29
MiO	92.80	9.04	95.11	6.49
SATYAM Ablations (Ours)
W + Qwen2-7B	92.98	8.61	94.64	6.02
T + Qwen2-7B	93.21	8.09	95.10	5.83
W + T + Qwen2-7B (Concat)	93.28	7.94	95.75	4.39
W + T + Qwen2-7B (Möbius)	94.01	7.02	95.31	4.07
W + T + Qwen2-7B (E-BD)	94.93	5.39	96.47	3.68
W + T + Qwen2-7B (H-BD-ST only)	95.78	5.14	97.22	2.69
W + T + Qwen2-7B (H-BD-SS only)	96.11	5.02	97.34	2.42
SATYAM with Qwen2-1.8B	97.14	4.53	98.32	2.11
SATYAM (Qwen2-7B)	98.32	3.27	99.11	1.94

Scores in %. Bold green = best · Yellow = second best. Ablations confirm that both H-BD-SS and H-BD-ST stages are complementary — removing either hurts performance.

Table 2: Generalization — Cross-Benchmark Transfer

SATYAM remains robust under cross-benchmark distribution shift. AASIST degrades severely when transferred to ICF, while SATYAM maintains low EER in both transfer directions.

Transfer Setting	AASIST EER ↓	SATYAM EER ↓
CodecFake → ICF (zero-shot)	40.32	7.43
ICF → CodecFake	29.81	3.79
Cross-lingual (random 6-lang train → held-out lang)	26.74 / 31.11	6.34 / 7.09
Dravidian → Indo-European	33.45	7.78
Indo-European → Dravidian	38.73	8.48
Unseen codecs — clean (test-unknown)	14.38	5.23
Unseen codecs — noisy (test-unknown)	16.29	7.41

EER in %. SATYAM generalizes across codec families, language families, and acoustic conditions — demonstrating robustness far beyond English-centric baselines.

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Abstract

Key Contributions

Indic-CodecFake (ICF) Dataset

DAC

Encodec

SNAC

SoundStream

SpeechTokenizer

FunCodec

AudioDec

MIMI

SATYAM

Dual Audio Encoding

Hyperbolic Projection

Hyperbolic Speech–Speech Alignment (H-BD-SS)

Hyperbolic Speech–Prompt Alignment (H-BD-ST)

Frozen LLM Decoding

Whisper

TRILLsson

Poincaré Ball (Hdc)

Bhattacharya Distance

Möbius Addition

Qwen2-7B (frozen)

Results

Audio Samples

Citation

Poincaré Ball (H^d_c)