✅ ACL 2026 · Accepted

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

Girish*    Mohd Mujtaba Akhtar*    Muskaan Singh
*Equal Contribution  ·  ACL 2026
😂😢😤
Source (Labeled)
Non-Verbal
Vocalizations
laughter · sobs · sighs
NOVA-ARC
Hyperbolic OT
🌍🇩🇪🇬🇷🇲🇽
Target (Unlabeled)
Verbal Speech
Across Languages
no emotion labels needed

NOVA-ARC transfers emotion supervision from labeled non-verbal vocalizations (laughter, sobs, sighs) to unlabeled verbal speech across multiple languages — without requiring any target-language emotion annotations.


Abstract

We introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages.


To this end, we propose NOVA-ARC (NOn-verbal to Verbal Adaptation via hyperbolic Alignment, Radial calibration, and Codebook tokens), a geometry-aware framework that models affective structure in the Poincaré ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens (HEL). For unsupervised adaptation, NOVA-ARC performs optimal-transport-based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization.


Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counterparts and strong SSL baselines. To the best of our knowledge, this is the first work to move beyond verbal-speech–centric supervision by introducing a non-verbal–to–verbal transfer paradigm for SER.

Speech Emotion Recognition Hyperbolic Geometry Optimal Transport Domain Adaptation Non-Verbal Vocalizations Multilingual NLP Low-Resource

Core Motivation

Emotion labels in verbal speech are inevitably entangled with words, phonotactics, and language-dependent expressive conventions. Models trained on verbal emotion corpora overfit to lexical/phonetic correlates that do not transfer across languages. Non-verbal vocalizations offer a cleaner alternative:


Method

NOVA-ARC operates in a shared hyperbolic space across source (non-verbal) and target (verbal) domains. The pipeline proceeds in five stages:

NOVA-ARC Framework Overview
Figure 1: NOVA-ARC Framework Overview. The model takes labeled non-verbal vocalizations (source) and unlabeled verbal speech (target) as input. Frame-level features are projected into the Poincaré ball, discretized via a hyperbolic VQ codebook, fused with Möbius addition, calibrated by the Hyperbolic Emotion Lens (HEL), and aligned to source emotion prototypes via optimal transport.
1

Pre-Trained Speech Encoder

Audio is resampled to 16 kHz and fed into a frozen SSL encoder — voc2vec, WavLM, wav2vec 2.0, or MMS-1B — to extract frame-level representations {zt}. voc2vec is explicitly pretrained on non-verbal human sounds and is the best-performing backbone in our setting.

2

Hyperbolic Projection (Poincaré Ball)

Frame features are projected and mapped into the Poincaré ball 𝔻dc via the exponential map at the origin. All subsequent operations — tokenization, fusion, prototype alignment — are performed in this hyperbolic space with curvature κ = −1.0 and latent dimension d = 256.

3

Hyperbolic VQ Prosody Codebook

Each hyperbolic frame embedding is assigned to its nearest codeword under the Poincaré distance (codebook size K = 256), yielding a discrete prosody token. Continuous and discrete embeddings are fused via Möbius addition — directly in hyperbolic space — and compressed through a bottleneck (db = 128).

4

Hyperbolic Emotion Lens (HEL)

A learnable radial calibration layer decomposes each embedding into radius and direction, then applies a power-law warp controlled by learned scalar α to rescale emotion intensity. This bridges the intensity mismatch between non-verbal and verbal speech, with α initialized at 1.0 and optimized jointly during training.

5

Hyperbolic Optimal Prototype Transport

Source class prototypes (Fréchet means) guide adaptation of unlabeled target speech. An entropically regularized optimal transport plan (50 Sinkhorn iterations, ε = 0.05) aligns target utterances to prototypes, producing soft pseudo-labels. Two complementary losses — LOPT (geometric alignment) and LOT-CE (soft cross-entropy) — jointly train the model on unlabeled target speech.

Encoder

voc2vec / WavLM / wav2vec 2.0 / MMS

Four SSL encoders evaluated. voc2vec, pretrained on 125h of non-verbal sounds, is the strongest for our setting.

Geometry

Poincaré Ball

All operations use exponential/log maps at origin, Möbius addition, and Poincaré distance for a consistent hyperbolic workflow.

Codebook

Hyperbolic VQ

K = 256 codewords, commitment weight β = 0.25. Assignment by Poincaré distance, not Euclidean. Codebook and commitment losses.

Calibration

HEL — Emotion Lens

Power-law warp on the radial dimension of hyperbolic frames. Fully differentiable; learned jointly with the rest of the model.

Adaptation

Optimal Transport

Sinkhorn iterations with source class prior marginal and uniform target marginal. Prototypes refreshed once per epoch.

Classifier

Temporal CNN Head

Two 1D conv blocks (64→128 filters, kernel=3) + attention pooling + linear softmax. 5.2M–8.5M trainable parameters.


Results

Table 1: Cross-corpus performance (individual SSL encoders, no adaptation)

In-domain supervised performance on verbal and non-verbal splits. voc2vec dominates non-verbal emotion recognition (95.26%); speech SSL encoders lead on verbal in-domain evaluation. The ranking flips across regimes — validating the core premise of NOVA-ARC.

Dataset voc2vec WavLM wav2vec 2.0 MMS
Acc ↑F1 ↑ Acc ↑F1 ↑ Acc ↑F1 ↑ Acc ↑F1 ↑
Verbal Speech
ASVP-ESD (Verbal)32.6730.4184.3982.5780.5677.9087.6385.78
MESD (Spanish)49.0246.9168.3567.2472.6371.9896.4793.85
AESDD (Greek)35.8634.1984.2782.9473.4170.9861.5960.12
RAVDESS (English)36.5133.9785.6982.9076.3874.6983.6782.05
Emo-DB (German)44.6942.2896.3194.8297.6795.3894.1591.82
CREMA-D (English)42.3141.0686.7185.2086.0284.4779.2376.91
Non-Verbal
ASVP-ESD (Non-Verbal)95.2693.7963.6160.9258.9256.4746.0343.65

Bold green = best per row. voc2vec achieves 95.26% on non-verbal — far ahead of speech SSL encoders.


Table 2: NOVA-ARC Adaptation — Euclidean vs. Hyperbolic (APD Non-Verbal → Verbal Targets)

NOVA-ARC with hyperbolic geometry consistently outperforms the Euclidean variant across all encoders and all target datasets. Gains are systematic — not tied to a specific backbone — confirming NOVA-ARC contributes a genuinely transferable adaptation mechanism.

Target Dataset voc2vec (Euclidean) voc2vec (Hyperbolic) WavLM (Hyperbolic) MMS (Hyperbolic)
AccF1 AccF1 AccF1 AccF1
ASVP-ESD (Verbal)87.3185.0692.4089.7991.0388.9289.4388.15
MESD (Spanish)84.5881.9290.6789.0581.0979.3686.7983.93
AESDD (Greek)79.6378.2184.3982.9282.9881.0682.0380.24
RAVDESS (English)87.0485.5393.7990.6192.4790.3189.5187.69
Emo-DB (German)86.7183.6992.4690.6891.2688.9388.1185.74
CREMA-D (English)85.2684.0391.3289.8790.7689.2987.9485.22

Green = best overall · Blue = strong result. voc2vec + hyperbolic reaches 93.79% on RAVDESS.


Ablation Study

Each component of NOVA-ARC is removed individually. Source: ASVP non-verbal → Target: ASVP verbal. Every ablation leads to a clear degradation, confirming the synergistic design.

Configuration Acc ↑ F1 ↑ Accuracy
NOVA-ARC (full) 92.4089.79
Euclidean space (no Poincaré) 87.3185.06
Euclidean OT (hyperbolic space only) 80.2475.64
Token only (discrete, no continuous) 76.9073.18
No VQ (continuous only) 74.2270.43
No HEL (no intensity calibration) 72.7551.44
Concat/MLP instead of Möbius fusion 65.3662.24
Adversarial domain adaptation 53.4943.76
OT-UDA baseline 50.7841.33

Key takeaways: (1) Hyperbolic geometry contributes +5.09% over Euclidean. (2) Continuous + discrete cues are complementary — removing either hurts. (3) Möbius integration is crucial (−27% without it). (4) HEL calibration collapses F1 to 51.44 when removed. (5) Standard baselines (adversarial DA, OT-UDA) are dramatically weaker at ~51–53%.


Datasets

All datasets are standardized to the shared label space: happy · anger · disgust · sadness · fear

Multi-modal (NV + Verbal)

ASVP-ESD

Realistic emotional corpus with non-speech vocalizations and verbal speech. Non-verbal split used as labeled source; verbal split as unlabeled target.

Source + Target
Spanish

MESD

Mexican Emotional Speech Database. Female adult, male adult, and child voices recorded in a professional studio.

Verbal Target
Greek

AESDD

Acted speech corpus in Greek, ~500 utterances across 5 emotions. Provides evaluation on a typologically distinct language.

Verbal Target
English

RAVDESS

Ryerson Audio-Visual Database. 7,356 validated recordings from 24 professional actors. Audio-only speech subset used.

Verbal Target
German

Emo-DB

Berlin Database of Emotional Speech. 800 utterances from 10 speakers across 7 emotions. A classic German SER benchmark.

Verbal Target
English (crowdsourced)

CREMA-D

Crowd-sourced Emotional Multimodal Actors Dataset. 7,442 clips from 91 actors. Audio-only modality used as target domain.

Verbal Target

Citation

If you find this work useful, please cite:

@inproceedings{novarc-acl2026, title = {Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition}, author = {Anonymous}, booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)}, year = {2025}, pages = {TBD}, }