NOVA-ARC: Prosody as Supervision

Abstract

We introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages.

To this end, we propose NOVA-ARC (NOn-verbal to Verbal Adaptation via hyperbolic Alignment, Radial calibration, and Codebook tokens), a geometry-aware framework that models affective structure in the Poincaré ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens (HEL). For unsupervised adaptation, NOVA-ARC performs optimal-transport-based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization.

Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counterparts and strong SSL baselines. To the best of our knowledge, this is the first work to move beyond verbal-speech–centric supervision by introducing a non-verbal–to–verbal transfer paradigm for SER.

Speech Emotion Recognition Hyperbolic Geometry Optimal Transport Domain Adaptation Non-Verbal Vocalizations Multilingual NLP Low-Resource

Core Motivation

Emotion labels in verbal speech are inevitably entangled with words, phonotactics, and language-dependent expressive conventions. Models trained on verbal emotion corpora overfit to lexical/phonetic correlates that do not transfer across languages. Non-verbal vocalizations offer a cleaner alternative:

🧠
Language-agnostic affect. Laughter, sobs, gasps, and sighs arise from shared physiological mechanisms dominated by paralinguistic acoustics — voicing, spectral tilt, intensity dynamics — making their emotion supervision inherently more transferable and language-agnostic.
🌐
Zero annotation needed in target languages. We reformulate multilingual SER as unsupervised non-verbal-to-verbal transfer: learn emotion from labeled non-verbal vocalizations, adapt to unlabeled verbal speech in any target language without accessing target emotion labels.
📐
Hierarchical emotion structure. Emotions are organized hierarchically (broad affective families → specific categories → intensities). The Poincaré ball provides a low-distortion embedding space perfectly suited for representing this hierarchy without geometric distortion.

Method

NOVA-ARC operates in a shared hyperbolic space across source (non-verbal) and target (verbal) domains. The pipeline proceeds in five stages:

**Figure 1: NOVA-ARC Framework Overview.** The model takes labeled non-verbal vocalizations (source) and unlabeled verbal speech (target) as input. Frame-level features are projected into the Poincaré ball, discretized via a hyperbolic VQ codebook, fused with Möbius addition, calibrated by the Hyperbolic Emotion Lens (HEL), and aligned to source emotion prototypes via optimal transport.

Pre-Trained Speech Encoder

Audio is resampled to 16 kHz and fed into a frozen SSL encoder — voc2vec, WavLM, wav2vec 2.0, or MMS-1B — to extract frame-level representations {z_t}. voc2vec is explicitly pretrained on non-verbal human sounds and is the best-performing backbone in our setting.

Hyperbolic Projection (Poincaré Ball)

Frame features are projected and mapped into the Poincaré ball 𝔻^d_c via the exponential map at the origin. All subsequent operations — tokenization, fusion, prototype alignment — are performed in this hyperbolic space with curvature κ = −1.0 and latent dimension d = 256.

Hyperbolic VQ Prosody Codebook

Each hyperbolic frame embedding is assigned to its nearest codeword under the Poincaré distance (codebook size K = 256), yielding a discrete prosody token. Continuous and discrete embeddings are fused via Möbius addition — directly in hyperbolic space — and compressed through a bottleneck (d_b = 128).

Hyperbolic Emotion Lens (HEL)

A learnable radial calibration layer decomposes each embedding into radius and direction, then applies a power-law warp controlled by learned scalar α to rescale emotion intensity. This bridges the intensity mismatch between non-verbal and verbal speech, with α initialized at 1.0 and optimized jointly during training.

Hyperbolic Optimal Prototype Transport

Source class prototypes (Fréchet means) guide adaptation of unlabeled target speech. An entropically regularized optimal transport plan (50 Sinkhorn iterations, ε = 0.05) aligns target utterances to prototypes, producing soft pseudo-labels. Two complementary losses — L_OPT (geometric alignment) and L_OT-CE (soft cross-entropy) — jointly train the model on unlabeled target speech.

Encoder

voc2vec / WavLM / wav2vec 2.0 / MMS

Four SSL encoders evaluated. voc2vec, pretrained on 125h of non-verbal sounds, is the strongest for our setting.

Geometry

Poincaré Ball

All operations use exponential/log maps at origin, Möbius addition, and Poincaré distance for a consistent hyperbolic workflow.

Codebook

Hyperbolic VQ

K = 256 codewords, commitment weight β = 0.25. Assignment by Poincaré distance, not Euclidean. Codebook and commitment losses.

Calibration

HEL — Emotion Lens

Power-law warp on the radial dimension of hyperbolic frames. Fully differentiable; learned jointly with the rest of the model.

Adaptation

Optimal Transport

Sinkhorn iterations with source class prior marginal and uniform target marginal. Prototypes refreshed once per epoch.

Classifier

Temporal CNN Head

Two 1D conv blocks (64→128 filters, kernel=3) + attention pooling + linear softmax. 5.2M–8.5M trainable parameters.

Results

Table 1: Cross-corpus performance (individual SSL encoders, no adaptation)

In-domain supervised performance on verbal and non-verbal splits. voc2vec dominates non-verbal emotion recognition (95.26%); speech SSL encoders lead on verbal in-domain evaluation. The ranking flips across regimes — validating the core premise of NOVA-ARC.

Dataset	voc2vec		WavLM		wav2vec 2.0		MMS
Dataset	Acc ↑	F1 ↑	Acc ↑	F1 ↑	Acc ↑	F1 ↑	Acc ↑	F1 ↑
Verbal Speech
ASVP-ESD (Verbal)	32.67	30.41	84.39	82.57	80.56	77.90	87.63	85.78
MESD (Spanish)	49.02	46.91	68.35	67.24	72.63	71.98	96.47	93.85
AESDD (Greek)	35.86	34.19	84.27	82.94	73.41	70.98	61.59	60.12
RAVDESS (English)	36.51	33.97	85.69	82.90	76.38	74.69	83.67	82.05
Emo-DB (German)	44.69	42.28	96.31	94.82	97.67	95.38	94.15	91.82
CREMA-D (English)	42.31	41.06	86.71	85.20	86.02	84.47	79.23	76.91
Non-Verbal
ASVP-ESD (Non-Verbal)	95.26	93.79	63.61	60.92	58.92	56.47	46.03	43.65

Bold green = best per row. voc2vec achieves 95.26% on non-verbal — far ahead of speech SSL encoders.

Table 2: NOVA-ARC Adaptation — Euclidean vs. Hyperbolic (APD Non-Verbal → Verbal Targets)

NOVA-ARC with hyperbolic geometry consistently outperforms the Euclidean variant across all encoders and all target datasets. Gains are systematic — not tied to a specific backbone — confirming NOVA-ARC contributes a genuinely transferable adaptation mechanism.

Target Dataset	voc2vec (Euclidean)		voc2vec (Hyperbolic)		WavLM (Hyperbolic)		MMS (Hyperbolic)
Target Dataset	Acc	F1	Acc	F1	Acc	F1	Acc	F1
ASVP-ESD (Verbal)	87.31	85.06	92.40	89.79	91.03	88.92	89.43	88.15
MESD (Spanish)	84.58	81.92	90.67	89.05	81.09	79.36	86.79	83.93
AESDD (Greek)	79.63	78.21	84.39	82.92	82.98	81.06	82.03	80.24
RAVDESS (English)	87.04	85.53	93.79	90.61	92.47	90.31	89.51	87.69
Emo-DB (German)	86.71	83.69	92.46	90.68	91.26	88.93	88.11	85.74
CREMA-D (English)	85.26	84.03	91.32	89.87	90.76	89.29	87.94	85.22

Green = best overall · Blue = strong result. voc2vec + hyperbolic reaches 93.79% on RAVDESS.

Ablation Study

Each component of NOVA-ARC is removed individually. Source: ASVP non-verbal → Target: ASVP verbal. Every ablation leads to a clear degradation, confirming the synergistic design.

Configuration	Acc ↑	F1 ↑
NOVA-ARC (full)	92.40	89.79
Euclidean space (no Poincaré)	87.31	85.06
Euclidean OT (hyperbolic space only)	80.24	75.64
Token only (discrete, no continuous)	76.90	73.18
No VQ (continuous only)	74.22	70.43
No HEL (no intensity calibration)	72.75	51.44
Concat/MLP instead of Möbius fusion	65.36	62.24
Adversarial domain adaptation	53.49	43.76
OT-UDA baseline	50.78	41.33

Key takeaways: (1) Hyperbolic geometry contributes +5.09% over Euclidean. (2) Continuous + discrete cues are complementary — removing either hurts. (3) Möbius integration is crucial (−27% without it). (4) HEL calibration collapses F1 to 51.44 when removed. (5) Standard baselines (adversarial DA, OT-UDA) are dramatically weaker at ~51–53%.

Datasets

All datasets are standardized to the shared label space: happy · anger · disgust · sadness · fear

Multi-modal (NV + Verbal)

ASVP-ESD

Realistic emotional corpus with non-speech vocalizations and verbal speech. Non-verbal split used as labeled source; verbal split as unlabeled target.

Source + Target

Spanish

MESD

Mexican Emotional Speech Database. Female adult, male adult, and child voices recorded in a professional studio.

Verbal Target

Greek

AESDD

Acted speech corpus in Greek, ~500 utterances across 5 emotions. Provides evaluation on a typologically distinct language.

Verbal Target

English

RAVDESS

Ryerson Audio-Visual Database. 7,356 validated recordings from 24 professional actors. Audio-only speech subset used.

Verbal Target

German

Emo-DB

Berlin Database of Emotional Speech. 800 utterances from 10 speakers across 7 emotions. A classic German SER benchmark.

Verbal Target

English (crowdsourced)

CREMA-D

Crowd-sourced Emotional Multimodal Actors Dataset. 7,442 clips from 91 actors. Audio-only modality used as target domain.

Verbal Target

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

Abstract

Core Motivation

Method

Pre-Trained Speech Encoder

Hyperbolic Projection (Poincaré Ball)

Hyperbolic VQ Prosody Codebook

Hyperbolic Emotion Lens (HEL)

Hyperbolic Optimal Prototype Transport

voc2vec / WavLM / wav2vec 2.0 / MMS

Poincaré Ball

Hyperbolic VQ

HEL — Emotion Lens

Optimal Transport

Temporal CNN Head

Results

Ablation Study

Datasets

ASVP-ESD

MESD

AESDD

RAVDESS

Emo-DB

CREMA-D

Citation