IJCAI 2026 · Accepted

Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages

Orchid Chetia Phukan*1    Girish*1,2    Mohd Mujtaba Akhtar*1,3    Arun Balaji Buduru1
*Equal Contribution  ·  1IIIT-Delhi  ·  2UPES  ·  3VBSPU
🇬🇧
English · Codec-Fake
Fake audio synthesized via neural codec (SNAC, EnCodec, DAC…)
🇻🇳
Vietnamese · Codec-Fake
Same codec synthesis on SEA-language speech
English-
trained
detector
English-
trained
detector
Predicted: FAKE
Correct — model generalizes on English
Predicted: REAL
Wrong — model fails on SEA languages

The SEA gap: State-of-the-art codec-fake detectors trained on English collapse when applied to South-East Asian speech (70.65% ACC on SEA vs. 94.08% on English), driven by language-specific phonetics, tonal structures, and prosodic diversity. We introduce SEA-CF — the first large-scale benchmark for codec-fake detection across SEA languages — and GARUDA, a lightweight Small-ALM that closes this gap.


Abstract

Codecfakes (CFs) are speech deepfakes generated by Audio Language Models (ALMs) whose core mechanism is Neural Audio Codecs (NACs). CFs exhibit distributional characteristics that differ fundamentally from vocoder-based deepfakes, causing state-of-the-art detectors to generalize poorly when crossing codec boundaries or language boundaries. Existing CF detection benchmarks remain largely confined to English (and to a limited extent Chinese), leaving South-East Asian (SEA) languages entirely unexplored.


To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages (Tamil, Hindi, Thai, Indonesian, Malay, Vietnamese), diverse speaker profiles, and a wide range of state-of-the-art NAC architectures. Our experiments confirm that SOTA CF detectors trained on English-centric data fail catastrophically on SEA speech, with accuracy dropping from 94.08% to 70.65%. Joint training with SEA-CF restores performance, underscoring the necessity of in-domain data.


We further perform comprehensive zero-shot and fine-tuned evaluations of recent SOTA ALMs on SEA-CF. While fine-tuning helps, large ALMs remain impractical in latency-constrained, low-resource settings. To address this, we propose GARUDA, a novel Small-ALM tailored for CF detection. GARUDA fuses complementary semantic (Whisper) and prosodic (x-vector) representations with a Jensen–Shannon alignment objective, and decodes using a lightweight Qwen2-0.5B language model. Extensive evaluations demonstrate that GARUDA achieves state-of-the-art performance across both SEA-CF and prior CF benchmarks while remaining under 1B parameters and offering 10× faster inference than large ALMs.

Speech Deepfake Detection Codec Fake Neural Audio Codec South-East Asian Languages Audio Language Model Small ALM Low-Resource Benchmark

Key Contributions

6
SEA Languages
(Tamil · Hindi · Thai · Indonesian · Malay · Vietnamese)
8
Neural Audio
Codec Architectures
<1B
GARUDA Parameters
(vs. 7B+ ALMs)
10×
Faster Inference
than large ALMs
98.41
GARUDA-FT Accuracy
on SEA-CF (seen)

SEA-CF Dataset

SEA-CF is constructed via a controlled resynthesis pipeline: real speech utterances are encoded into a discrete latent representation by a NAC encoder and reconstructed by the corresponding decoder, introducing codec-specific artifacts while preserving linguistic content and speaker identity. Each real utterance has a parallel codec-fake counterpart for every NAC configuration.

Real-Speech Source Corpora

Tamil & Hindi

Indic-SUPERB

One of the largest publicly available datasets for Indian languages, with predefined train/val/test-known/test-unknown splits used directly.

Vietnamese

VIVOS Corpus

Standard Vietnamese speech corpus. Audio samples on this page are drawn from VIVOS — hear real vs. codec-synthesized pairs below.

Indonesian

GigaSpeech 2

Large-scale, multi-domain ASR corpus covering low-resource languages including Indonesian.

Thai

Thai Dialect Corpus

Thai dialect speech corpus used for spoken language research and ASR development.

Malay

Malay Conv. + YouTube

Combination of the Conversational Malay Speech Corpus and a curated Malaysian YouTube dataset transcribed with Whisper-Large.

All Languages

Mozilla Common Voice

Massively multilingual open-source speech corpus providing supplementary coverage across all SEA languages.


Neural Audio Codecs (NACs)

Codec-fake samples are generated using 8 NAC architectures across multiple sampling rate configurations. Seen codecs are in training; Unseen codecs are held out for generalization evaluation.

DAC 16/24/44 kHz EnCodec 24/48 kHz SoundStream 16 kHz SpeechTokenizer 16 kHz SNAC 24/32/44 kHz FunCodec 16 kHz AudioDec 28/48 kHz MIMI 24 kHz

Yellow = Seen (train & eval)  ·  Blue = Unseen (eval only, for generalization testing)


GARUDA Framework

GARUDA (Generative Audio Reasoning Under Dual-encoder Alignment) is a novel Small-ALM for codec-fake detection. It adopts a dual-encoder design that captures complementary semantic and prosodic cues, fuses them via a Jensen–Shannon alignment objective, and decodes using a lightweight language model.

GARUDA Framework Architecture
Figure 1: GARUDA Architecture. Given a speech utterance, Whisper (semantic) and x-vector (prosodic) encoders extract complementary 512-dim representations. Each branch passes through a 1D-CNN + max-pooling module, sigmoid gating, and is aligned via the JS divergence loss (LJS). The fused representation is projected into the embedding space of the Qwen2-0.5B language model decoder, which generates "fake" or "real" in response to the classification prompt. The model is trained with L = LLM + λ · LJS.
1

Whisper Encoder — Semantic Representation

The frozen Whisper-base encoder (74M parameters, 96-language pretraining) processes the input waveform and produces a 512-dimensional semantic embedding via average pooling. Its ASR-oriented pretraining makes it highly effective at encoding linguistic content across multiple languages.

2

x-vector Encoder — Prosodic Representation

A frozen x-vector TDNN model (4.2M parameters, VoxCeleb1+2 trained) extracts a 512-dimensional prosodic embedding via average pooling. Its speaker-recognition pretraining captures pitch, tone, and speaking-rate cues that complement Whisper's semantic focus.

3

1D-CNN + Sigmoid Gating

Each encoder branch passes through a 1D convolutional layer (kernel size 3) followed by max-pooling and flattening. A sigmoid gating module selectively filters the representations, passing only the most informative features to the fusion stage.

4

Jensen–Shannon Alignment Loss

Before fusion, the two feature vectors are converted to distributions via temperature-scaled softmax (τ = 0.5). The JS divergence loss LJS minimizes distributional discrepancy between the heterogeneous encoders, promoting information-level consistency prior to concatenation.

LJS = ½ KL(px ∥ m) + ½ KL(py ∥ m)    where m = ½(px + py)
5

Language Model Decoder (Qwen2-0.5B)

Aligned features are concatenated, projected through a 90-neuron FC layer, and injected as a continuous prefix into the Qwen2-0.5B decoder. The model is prompted: "Is the speech sample fake or real? Reply in one word 'fake' or 'real'." The LM objective minimizes negative log-likelihood of the target token. GARUDA-FT additionally fine-tunes the decoder with LoRA (rank=8, α=32, on Q and V projections).

Ltotal = LLM + λ · LJS    (λ = 0.4)
Encoder 1

Whisper-Base

74M params, frozen. 512-dim semantic embedding. Trained on 96 languages for ASR — captures linguistic structure.

Encoder 2

x-vector TDNN

4.2M params, frozen. 512-dim prosodic embedding. VoxCeleb-trained for speaker ID — captures pitch and tonal cues.

Alignment

JS Divergence Loss

No added parameters. Encourages feature-space consistency between heterogeneous encoders before fusion (τ=0.5).

Decoder

Qwen2-0.5B

Smallest Qwen2 LM. Kept frozen in GARUDA; LoRA-finetuned in GARUDA-FT. Reduces hallucination and latency.

Fine-tuning

LoRA

Rank=8, α=32. Applied to Q and V projection layers. Enables efficient decoder adaptation without full-model cost.

Total Size

<1B Parameters

Whisper (74M) + x-vector (4.2M) + Projection + LoRA + Qwen2-0.5B = <1B total. 10× less than comparable ALMs.


Audio Samples

Real Vietnamese speech (VIVOS corpus, speaker VIVOSDEV01) paired with its codec-fake counterpart synthesized using SNAC 24 kHz. Each fake is generated by encoding the real waveform through the SNAC NAC encoder and decoding back — preserving linguistic content while introducing codec-specific artifacts.

Codec: SNAC 24 kHz (hubertsiuzdak/snac_24khz)  ·  Language: Vietnamese  ·  Corpus: VIVOS

Sample R002 — Speaker VIVOSDEV01
REAL Original Recording
FAKE Codec-Synthesized
SNAC 24 kHz resynthesis
Sample R003 — Speaker VIVOSDEV01
REAL Original Recording
FAKE Codec-Synthesized
SNAC 24 kHz resynthesis
Sample R012 — Speaker VIVOSDEV01
REAL Original Recording
FAKE Codec-Synthesized
SNAC 24 kHz resynthesis
Sample R027 — Speaker VIVOSDEV01
REAL Original Recording
FAKE Codec-Synthesized
SNAC 24 kHz resynthesis
Sample R028 — Speaker VIVOSDEV01
REAL Original Recording
FAKE Codec-Synthesized
SNAC 24 kHz resynthesis

Results

Models are trained on the combined SEA-CF + CodecFake [Lu et al., 2024] training set and evaluated on each benchmark individually. Metrics: Accuracy (ACC ↑) and Equal Error Rate (EER ↓, lower is better).

Cross-lingual transfer failure: AASIST trained on CodecFake [Lu et al., 2024] (English + Chinese) and evaluated directly on SEA-CF without any SEA-specific training achieves only 70.65% ACC / 28.13% EER on SEA-CF, compared to 94.08% ACC / 6.76% EER on its in-domain test set — a severe degradation that motivates SEA-CF and GARUDA.

Table 1: Seen Codec Evaluation

Models evaluated on seen NACs (SNAC, DAC, EnCodec, SoundStream, SpeechTokenizer). GARUDA-FT achieves SOTA across both SEA-CF and prior CodecFake benchmark.

Method SEA-CF CodecFake Average
ACC ↑EER ↓ ACC ↑EER ↓ ACC ↑EER ↓
Zero-shot ALMs (no training)
Qwen-Audio-Chat3.7294.3914.1085.808.9190.10
Qwen2-Audio-Base8.4191.5319.4880.4713.9586.00
SeaLLMs-Audio-7B6.2391.6418.3580.2512.2985.95
End-to-End & PTM Baselines (in-domain training)
AASIST86.9815.7493.098.1690.0411.95
Wh-LCNN87.6915.2294.417.6391.0511.43
Wav2vec2-AASIST88.7113.0195.167.0891.9410.04
MiO88.7612.5195.646.3792.209.44
Fine-tuned ALMs
SeaLLMs-Audio-7B (FT)88.749.6490.756.9689.758.30
Qwen2-Audio-Base (FT)93.886.9595.064.2194.475.58
GARUDA (ours)
GARUDA94.376.2697.004.1995.695.23
GARUDA-FT98.412.7899.361.6898.892.23

Bold green = best overall · Blue = strong result. Zero-shot ALMs perform near chance despite being SEA-specialized (SeaLLMs-Audio-7B).


Table 2: Unseen Codec Evaluation (Generalization)

Models evaluated on held-out NACs not seen during training (FunCodec, AudioDec, MIMI). GARUDA-FT maintains strong cross-codec generalization.

Method SEA-CF CodecFake Average
ACC ↑EER ↓ ACC ↑EER ↓ ACC ↑EER ↓
MiO85.5513.9193.447.7788.5010.84
Qwen2-Audio-Base (FT)92.088.1593.265.4292.676.79
GARUDA92.976.8894.605.7193.796.30
GARUDA-FT97.113.1798.062.2397.592.70

GARUDA-FT remains the strongest model even on codecs not encountered during training.


Table 3: Statistical Significance (McNemar's Test)

All improvements of GARUDA-FT over the strongest baselines are statistically significant (p < 0.05).

Comparison Dataset Test p-value Conclusion
GARUDA-FT vs MiOSEA-CFMcNemar0.0047Significant
GARUDA-FT vs Qwen2-Audio-Base-FTSEA-CFMcNemar0.00021Significant
GARUDA-FT vs MiOCodecFakeMcNemar0.0003Significant
GARUDA-FT vs Qwen2-Audio-Base-FTCodecFakeMcNemar0.00019Significant

All p-values well below 0.05, confirming GARUDA-FT's gains are not due to random variation.


Efficiency Analysis

A core motivation for GARUDA is real-world deployability. Large ALMs achieve strong performance when fine-tuned, but their size and latency make them impractical in low-resource, latency-constrained scenarios. GARUDA is designed to close this gap.

~1B
GARUDA total params
(Whisper + x-vec + Proj + LoRA + Qwen2-0.5B)
7B+
Qwen2-Audio / SeaLLMs
parameter count
1.21s
GARUDA-FT average
inference time per batch
12.32s
Qwen2-Audio-Base (FT)
average inference time

JS divergence adds zero parameters on top of simple concatenation-based fusion — it is a loss function with no trainable weights, making it a free alignment benefit. LoRA keeps decoder fine-tuning cost-efficient while substantially improving performance and reducing hallucination.


Data Efficiency

GARUDA-FT maintains superior performance even under reduced training data, demonstrating robustness under limited-data conditions:

Training Data Qwen2-Audio-Base-FT ACC GARUDA ACC GARUDA-FT ACC
100% (full)93.8894.3798.41
75%89.5192.2195.12
50%86.0291.4893.56
25%82.9390.0492.04

SEA-CF ACC reported. GARUDA-FT with 25% of data still outperforms Qwen2-Audio-Base-FT trained on 100%.


Citation

If you find SEA-CF or GARUDA useful in your research, please cite:

@inproceedings{seacf-ijcai2026, title = {Bridging the {SEA} Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages}, author = {Orchid Chetia Phukan and Girish and Mohd Mujtaba Akhtar and Arun Balaji Buduru}, booktitle = {Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)}, year = {2026}, pages = {TBD}, }