DUO GESTURE Project Page

DuoGesture

Neuro-Biomechanical Dual-Stream Co-Speech Gesture Generation

A dual-stream framework that disentangles sparse semantic gestures from prosody-aligned beat motion, combining motion-grounded semantic conditioning, stochastic stream selection, and biomechanical inertial regularization.

4.081 FGD on BEAT2 all-speaker setting
7.699 Beat Alignment on BEAT2 all-speaker setting
12.83 Diversity in full DuoGesture ablation
30 Participants in controlled user study

Method Overview

DuoGesture models co-speech gesture generation as the interaction of two coupled processes: a semantic stream for sparse lexical gestures and a beat stream for rhythmic prosody-aligned motion.

DuoGesture method overview: dual-stream co-speech gesture generation combining semantic and beat motion streams.

Highlights

DuoGesture addresses the tension between semantic expressivity and biomechanical plausibility in holistic co-speech gesture generation.

Dual-Stream Gesture Generation

Co-speech motion is decomposed into a semantic stream and a beat stream, reflecting the distinction between sparse lexical gestures and frequent rhythm-aligned beat gestures.

Motion-Grounded Semantic Conditioning

MGSC replaces purely linguistic embeddings with text-to-motion representations, providing motion-aligned semantic priors for long-tailed gesture triggers.

Semantic Variational Information Bottleneck

S-VIB introduces a stochastic frame-level gate that learns when semantic motion should override beat motion while avoiding deterministic gate collapse.

Inertial Beat Prior

IBP applies an anthropometric arm-chain prior to reduce jitter and improve rhythmic consistency without constraining expressive semantic frames.

Abstract

Co-speech gesture generation requires both semantic expressivity and biomechan- ically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This lim- its semantic grounding, speech-motion alignment, and kinematic smoothness. We propose DuoGesture, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a Semantic Variational Information Bottleneck, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by Motion-Grounded Semantic Conditioning, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is fur- ther regularised by an Inertial Beat Prior, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that Duo- Gesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.

Framework Overview

DuoGesture is a two-stage latent generator. Stage 1 uses a regional RVQ-VAE tokenizer. Stage 2 predicts latent gesture codes through a convex combination of semantic and beat branches controlled by a stochastic frame-level gate.

DuoGesture High-Level Overview
High-level dual-stream overview. DuoGesture decomposes co-speech gesture synthesis into a semantic stream and a beat stream, coupled by the Semantic Variational Information Bottleneck (S-VIB).
DuoGesture Pipeline
DuoGesture pipeline. The semantic stream is driven by motion-grounded semantic features (MGSC), while the beat stream is guided by audio, speaker identity, seed poses, and biomechanical inertial regularization (IBP). S-VIB learns a stochastic frame-level gate that dynamically fuses semantic and beat predictions.
Two-Stream Hierarchical Blender
Two-Stream Hierarchical Blender architecture. Residual VQ-VAE tokens from the semantic and beat branches are fused under S-VIB gate control before decoding.
MGSC

Motion-Grounded Semantic Conditioning

Instead of relying on BERT or FastText embeddings, MGSC uses a text-to-motion latent space to provide semantic features that are already aligned with gesture morphology.

S-VIB

Stochastic Semantic Gate

A variational bottleneck predicts frame-level semantic weights, encouraging temporal sparsity and preventing the gate from collapsing into a deterministic all-semantic state.

IBP

Biomechanical Beat Prior

An anthropometric arm-chain smoother injects inertial structure into the beat branch, reducing jerk while avoiding penalties on expressive semantic frames.

Why Two Streams? ‐ Motivation for the IBP

Beat and semantic arm-swings have fundamentally different spectral signatures ‐ and that difference is what makes the Inertial Beat Prior both possible and necessary.

Arm-swing spectral analysis: beat vs semantic gestures on BEAT2
Arm-swing angular velocity analysis on 265 BEAT2 English test clips. (a) Energy decay from each clip’s spectral peak ‐ beat gestures lose energy sharply (half-BW = 0.46 Hz) while semantic gestures spread it broadly (0.89 Hz, 1.92× wider). (b) Violin plot of spectral peakedness ‐ beat gestures have a clearer dominant frequency (median 4.63 vs 3.78; 43% vs 29% show a clear peak).
Beat stream

Rhythmic & periodic ‐ like a metronome

Beat arm-swings repeat at a steady pace in sync with speech prosody. Their energy spikes at one tight frequency (half-bandwidth only 0.46 Hz), just like a pendulum. 43% of beat windows show a single unmistakable dominant peak.

Semantic stream

Expressive & varied ‐ like a dance move

Semantic gestures vary freely in speed and timing to express meaning. Their energy spreads nearly twice as wide (0.89 Hz) and only 29% have a clear spectral peak ‐ they don’t follow a single rhythm.

IBP takeaway

Apply physics only where motion is already physical

Because beat arm-swings are naturally periodic, we enforce a lightweight anthropometric arm-chain constraint on the beat stream alone ‐ smoothing jitter exactly the way real arm inertia would, without ever penalising the free-form expressive frames of the semantic stream.

Quantitative Results

On BEAT2, DuoGesture improves distributional fidelity over strong holistic baselines in both one-speaker and all-speaker settings, while maintaining competitive beat alignment and motion diversity.

One Speaker FGD
4.101
Lower is better; improves over PyraMotion and SemTalk.
All Speakers FGD
4.081
Best reported all-speaker FGD among compared methods.
All Speakers BA
7.699
Competitive speech-motion rhythmic alignment.
Quantitative Results
Quantitative comparison on BEAT2. DuoGesture achieves strong overall fidelity and alignment, showing the benefit of dual-stream modeling, motion-grounded semantics, and biomechanical beat regularization.

Component Ablation

Component-wise ablations show that MGSC drives most of the FGD gain, S-VIB improves diversity and prevents gate collapse, and IBP preserves beat consistency with minimal trade-off.

Baseline FGD
5.214
SemTalk baseline.
+ MGSC FGD
4.306
Large gain from motion-grounded semantics.
Full FGD
4.081
Best overall configuration.
Ablation Results
Ablation results demonstrate the complementary roles of MGSC, S-VIB, and IBP. The full model achieves the best balance between realism, diversity, and beat alignment.

User Study

A controlled user study was conducted using 35-second clips from the BEAT2 test set across six narrated topics. Thirty native English speakers evaluated randomly ordered sequences on naturalness, diversity, and alignment with speech content and timing.

User Study Results
User study results comparing Ground Truth, DuoGesture, SemTalk, and EMAGE across naturalness, diversity, and speech-content alignment. Error bars denote participant-level standard deviation, with significance markers indicating statistical differences.

Qualitative Results

DuoGesture generates clearer, more content-aware gestures for semantic phrases such as “to get,” “I can share,” and “more drama,” while maintaining natural rhythmic consistency for beat gestures.

Qualitative Results
Qualitative comparison across semantic contexts. DuoGesture produces higher-amplitude and more interpretable gestures than SemTalk, EMAGE, and GestureLSM, while better preserving semantic intent and temporal naturalness.

Demo Videos

Featured videos from the user study, with one explicit comparison against Ground Truth.

User Study 1: Featured evaluation sample.
Against Ground Truth: DuoGesture comparison sample.
User Study 2: Featured evaluation sample.
User Study 3: Featured evaluation sample.