✅ ACL 2026 Findings · Accepted

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish*,1    Mohd Mujtaba Akhtar*,2    Orchid Chetia Phukan*†,3    Arun Balaji Buduru3
1UPES, India  ·  2Veer Bahadur Singh Purvanchal University, India  ·  3IIIT-Delhi, India
*Equal contribution  ·  Corresponding author  ·  ACL 2026 Findings
🎙️
Real Speech
IndicSUPERB
12 Indic languages
🔊
Neural Audio Codec
Encode → Decode
8 NAC families
👻
CodecFake
ICF Dataset
parallel corpus
SATYAM
Hyperbolic ALM
✅❌
Decision
"Real" or "Fake"
98.32% ACC

Indic-CodecFake (ICF) is the first large-scale benchmark of NAC-synthesized speech deepfakes across 12 Indic languages. SATYAM — a hyperbolic Audio LLM — aligns semantic (Whisper) and prosodic (TRILLsson) representations via Bhattacharya distance in hyperbolic space, achieving 98.32% accuracy on ICF.


Abstract

The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the research community. However, existing studies predominantly focus on English or Chinese, leaving the vulnerability of Indic languages largely unexplored.


To bridge this gap, we introduce the Indic-CodecFake (ICF) dataset — the first large-scale benchmark comprising real and NAC-synthesized speech across multiple Indic languages, diverse speaker profiles, and multiple NAC types, built on the IndicSUPERB corpus. Our experiments demonstrate that state-of-the-art CF detectors trained on English-centric datasets fail to generalize to ICF, underscoring the challenges posed by phonetic diversity and prosodic variability in Indic speech. We further present a systematic zero-shot evaluation of SOTA ALMs on ICF, revealing consistently poor performance.


To address these limitations, we propose SATYAM, a novel hyperbolic ALM tailored for CF detection in Indic languages. SATYAM integrates semantic representations from Whisper and prosodic representations from TRILLsson via Bhattacharya distance in hyperbolic space, followed by the same alignment between the fused speech representation and an input conditioning prompt. This dual-stage framework enables effective modeling of hierarchical relationships both within speech (semantic–prosodic) and across modalities (speech–text). Extensive evaluations show that SATYAM consistently outperforms competitive end-to-end and ALM-based baselines on the ICF benchmark.

Deepfake Detection Neural Audio Codecs Hyperbolic Geometry Audio LLM Indic Languages Multilingual NLP Speech Security

Key Contributions


Indic-CodecFake (ICF) Dataset

ICF is constructed by resynthesizing real speech from IndicSUPERB using 8 NAC families in a controlled encode–decode pipeline. Each real utterance x is passed through a NAC encoder E and decoder D to yield a CodecFake counterpart x̃ = D(E(x)), preserving linguistic content while introducing NAC-specific artifacts. The dataset spans 12 Indic languages across both Indo-European and Dravidian families. All codec models used to generate ICF are set up and available at CodeVault-girish/Neural-Codecs .

Bengali IE Gujarati IE Hindi IE Marathi IE Odia IE Punjabi IE Sanskrit IE Urdu IE Kannada DR Malayalam DR Tamil DR Telugu DR

IE = Indo-European · DR = Dravidian


Neural Audio Codecs Used

DAC

16 / 24 / 44 kHz

Descript Audio Codec with multi-scale architecture.

Seen

Encodec

24 / 48 kHz

Meta's streaming neural audio codec.

Seen

SNAC

24 / 32 / 44 kHz

Multi-scale neural audio codec.

Seen

SoundStream

16 kHz

Google's end-to-end neural audio codec.

Seen

SpeechTokenizer

16 kHz

Hierarchical speech tokenizer for LLMs.

Seen

FunCodec

16 kHz

Fundamental frequency-aware codec.

Unseen

AudioDec

28 / 48 kHz

High-fidelity streaming audio decoder.

Unseen

MIMI

24 kHz

Moshi's neural audio codec (Kyutai).

Unseen

SEEN NACs appear in training (test-known split)  ·  UNSEEN NACs held out for cross-codec generalization (test-unknown split)


SATYAM

SATYAM Framework Architecture
Figure 1: SATYAM Framework Overview. Whisper and TRILLsson encode semantic and paralinguistic representations respectively. Both are projected into hyperbolic space and aligned via Bhattacharya distance (H-BD), fused with Möbius addition, then aligned again with a conditioning prompt from the frozen Qwen2-7B decoder to produce the final "Real" / "Fake" verdict.

SATYAM is a supervised hyperbolic ALM that formulates CF detection as a conditional generation task. Given an input speech utterance, SATYAM extracts complementary semantic and paralinguistic representations, fuses them in hyperbolic space via Bhattacharya distance alignment, and conditions a frozen LLM decoder to generate a one-word verdict: "Real" or "Fake". Total trainable parameters: ~3.75M.

1

Dual Audio Encoding

Two complementary encoders extract representations from the input speech waveform: Whisper provides semantic representations (ew) capturing linguistic content, while TRILLsson provides paralinguistic representations (et) capturing prosodic and acoustic cues. Each branch passes through a 1D-CNN block (filter size 3) and max-pooling, then projects into a shared Euclidean space of dimension d. A sigmoid gating module filters salient information before the hyperbolic mapping stage.

2

Hyperbolic Projection

Both representations are mapped into a d-dimensional hyperbolic space Hdc (curvature −c) via the exponential map at the origin:

expc0(u) = tanh(√c ‖u‖) · u / (√c ‖u‖)

Yielding hyperbolic representations hw and ht on the Poincaré manifold.

3

Hyperbolic Speech–Speech Alignment (H-BD-SS)

We minimize the Bhattacharya distance (BD) between the semantic and paralinguistic hyperbolic distributions to align them. For distributions P and Q on Hdc:

DB(P, Q) = −log ∫ √p(h)·q(h) dµc(h)

This yields the speech–speech alignment loss LS-S = DB(hw, ht). The aligned representations are then fused using Möbius addition (⊕c), which preserves hyperbolic geometry: hf = hwc ht.

4

Hyperbolic Speech–Prompt Alignment (H-BD-ST)

A conditioning prompt "Analyze the speech for unnatural artifacts" is fed to Qwen2-7B. Hidden states from an intermediate transformer layer are mean-pooled to obtain a prompt representation eA, projected to the shared space, then mapped to hyperbolic space. The same BD alignment is applied between the fused speech and prompt: LS-T = DB(hf, hA). Final aggregation: hfinal = hfc hA.

5

Frozen LLM Decoding

The aggregated hyperbolic representation is mapped back to Euclidean space via the logarithmic map, linearly projected to g, and injected as prefix conditioning tokens into the frozen Qwen2-7B decoder. A decision prompt "Determine whether the speech is real or fake. Answer only in one word: 'Real' or 'Fake'" drives generation of the output sequence Y. Training minimizes:

L = λ₁·LS-S + λ₂·LS-T + λ₃·LLM (λ₁=1, λ₂=0.5, λ₃=1)
Semantic Encoder

Whisper

OpenAI's speech recognition model; captures linguistic and semantic content from speech.

Prosodic Encoder

TRILLsson

Google's paralinguistic model; captures prosodic and acoustic non-semantic cues — key for deepfake detection.

Geometry

Poincaré Ball (Hdc)

Hyperbolic space with curvature −c. Naturally embeds the hierarchical structure of speech semantics and artifacts.

Alignment

Bhattacharya Distance

Extended to hyperbolic space for speech–speech and speech–prompt distribution alignment. Novel contribution.

Fusion

Möbius Addition

Geometry-preserving operation in hyperbolic space. Fuses semantic + prosodic, and speech + prompt representations.

LLM Decoder

Qwen2-7B (frozen)

Prefix-conditioned generation with ~3.75M trainable parameters. Lightweight variant with Qwen2-1.8B also evaluated.


Results

Table 1: In-domain Training and Evaluation on ICF and CodecFake

SATYAM achieves 98.32% ACC / 3.27% EER on ICF and 99.11% ACC / 1.94% EER on CodecFake, outperforming all end-to-end and ALM-based baselines by substantial margins. W = Whisper, T = TRILLsson. Green = best, yellow = second best.

Method ICF CodecFake
ACC ↑EER ↓ ACC ↑EER ↓
Zero-shot ALM Evaluation
Pengi3.1998.265.6894.97
Audio Flamingo 25.4297.688.4192.10
Audio Flamingo 36.9897.2110.2290.85
Qwen-audio-chat10.6389.7113.0086.61
Qwen-audio-base11.1789.2315.8285.74
Qwen2-audio-chat12.0588.9516.7482.33
Qwen2-audio-base13.4188.5717.9181.26
End-to-End & Pre-Trained Backbone
AASIST90.6012.4794.2110.13
Whisper-LCNN91.9811.8993.387.92
Wav2vec2-AASIST92.509.6294.457.29
MiO92.809.0495.116.49
SATYAM Ablations (Ours)
W + Qwen2-7B92.988.6194.646.02
T + Qwen2-7B93.218.0995.105.83
W + T + Qwen2-7B (Concat)93.287.9495.754.39
W + T + Qwen2-7B (Möbius)94.017.0295.314.07
W + T + Qwen2-7B (E-BD)94.935.3996.473.68
W + T + Qwen2-7B (H-BD-ST only)95.785.1497.222.69
W + T + Qwen2-7B (H-BD-SS only)96.115.0297.342.42
SATYAM with Qwen2-1.8B97.144.5398.322.11
SATYAM (Qwen2-7B)98.323.2799.111.94

Scores in %. Bold green = best · Yellow = second best. Ablations confirm that both H-BD-SS and H-BD-ST stages are complementary — removing either hurts performance.


Table 2: Generalization — Cross-Benchmark Transfer

SATYAM remains robust under cross-benchmark distribution shift. AASIST degrades severely when transferred to ICF, while SATYAM maintains low EER in both transfer directions.

Transfer Setting AASIST EER ↓ SATYAM EER ↓
CodecFake → ICF (zero-shot)40.327.43
ICF → CodecFake29.813.79
Cross-lingual (random 6-lang train → held-out lang)26.74 / 31.116.34 / 7.09
Dravidian → Indo-European33.457.78
Indo-European → Dravidian38.738.48
Unseen codecs — clean (test-unknown)14.385.23
Unseen codecs — noisy (test-unknown)16.297.41

EER in %. SATYAM generalizes across codec families, language families, and acoustic conditions — demonstrating robustness far beyond English-centric baselines.


Audio Samples

Ground truth (GT) samples are randomly drawn from the IndicSUPERB corpus. Each row shows a real utterance alongside its NAC-resynthesized CodecFake counterpart for a given language.

Language Codec GT Speech Generated (Fake) Speech
Loading…

Citation

If you find this work useful, please cite:

@inproceedings{indiccodefake-acl2026, title = {Indic-CodecFake meets {SATYAM}: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages}, author = {Girish and Mohd Mujtaba Akhtar and Orchid Chetia Phukan and Arun Balaji Buduru}, booktitle = {Findings of the Association for Computational Linguistics: ACL 2026}, year = {2026}, pages = {TBD}, }