Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation

Ke Fan¹, Jiangning Zhang², Ran Yi¹, Jingyu Gong⁴, Yabiao Wang^2,3, Yating Wang¹, Xin Tan⁴, Chengjie Wang^1,2, Lizhuang Ma^1,4,

Shanghai Jiao Tong University¹, Tencent Youtu Lab², Zhejiang University³, East China Normal University⁴

arXiv Code

Text-to-motion generation is a crucial task in computer vision, which generates the target 3D motion by the given text. The existing annotated datasets are limited in scale, resulting in most existing methods overfitting to the small datasets and unable to generalize to the motions of the open domain. Some methods attempt to solve the open-vocabulary motion generation problem by aligning to the CLIP space or using the Pretrain-then-Finetuning paradigm. However, the current annotated dataset's limited scale only allows them to achieve mapping from sub-text-space to sub-motion-space, instead of mapping between full-text-space and full-motion-space (full mapping), which is the key to attaining open-vocabulary motion generation. To this end, this paper proposes to leverage the atomic motion (simple body part motions over a short time period) as an intermediate representation, and leverage two orderly coupled steps, i.e., Textual Decomposition and Sub-motion-space Scattering, to address the full mapping problem. For Textual Decomposition, we design a fine-grained description conversion algorithm, and combine it with the generalization ability of a large language model to convert any given motion text into atomic texts. Sub-motion-space Scattering learns the compositional process from atomic motions to the target motions, to make the learned sub-motion-space scattered to form the full-motion-space. For a given motion of the open domain, it transforms the extrapolation into interpolation and thereby significantly improves generalization. Our network, DSO-Net, combines textual decomposition and sub-motion-space scattering to solve the open-vocabulary motion generation. Extensive experiments demonstrate that our DSO-Net achieves significant improvements over the state-of-the-art methods on open-vocabulary motion generation.

Qualitative Results

"Braced Hang Hop Left To Another Braced Hang" "

"Start Breakdance Swipes Maneuver""

"Capoeira Groundwork""

"Cheering While Sitting With Low Enthusiasm"

"Doing The Chicken Dance"

"Circle Crunch On Floor"

"Climbing A Rope With Hands"

"Crawling Forward"

"Looking Over Object With Bow"

"Crouched Against Wall To Crouch Walk"

"Crying And Rubbing Eyes"

"Shot To The Chest And Backwards"

"Death Hit From The Back Falling On One Knee"

"Dying Shot To The Chest Falling On Two Knees"

"Emerging Right From Cover To Aimed Rifle Walk"

"Receiving A Football While Getting Tackled To The Ground"

"Kneeling In Prayer"

"Dying From A Prone Position"

"Sitting On A Toilet"

"Inspecting Upwards With Torch"

"Running Jump To Standing Idle"

"Stumbling Over Crate"

"Zombie Crawling Forward"

"Lifting Objects Over Head And Walking At The Same Time"

Approach Overview

BibTeX

@article{,
      title={Textual Decomposition Then Sub-motion-space Scattering for Open-Vocabulary Motion Generation}, 
      author={Ke Fan, Jiangning Zhang, Ran Yi, Jingyu Gong, Yabiao Wang, Yating Wang, Xin Tan, Chengjie Wang, Lizhuang Ma},
      year={2024},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}