FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Shanghai Jiao Tong University1, Tencent Youtu Lab2

Text-to-motion synthesis is a crucial task in computer vision. Existing methods are limited in their universality, as they are tailored for single-person or two-person scenarios and can not be applied to generate motions for more individuals. To achieve the number-free motion synthesis, this paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Furthermore, a generation module and an interaction module are designed for our FreeMotion framework to decouple the process of conditional motion generation and finally support the number-free motion synthesis. Besides, based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion. Extensive experiments demonstrate the superior performance of our method and our capability to infer single and multi-human motions simultaneously.


Single Person Motion Generation

Two Person Motion Generation

More Person Motion Generation

Applications


Reaction Generation

We can fix the first motion as condition and generate different motion reactions under various text.


"A person swings their fist towards others while the other one kicks back for the incoming punch. "

"A person swings their fist towards others while the other use their both hands to defend. "

"A person swings their fist towards others while the other one evades the incoming punch by stepping back. "

Spatial Control

We can generate the controllable multi-person motion under spatial signals.


"Two people run towards each other with joy and embrace each other."

"Three people are walking to get away from each other."

"Four people walks forward."


Diverse Generation

We can use the same text to generate diverse two-person motion results.


"The first person bows to apologize to the other person."

"The first person bows to apologize to the other person."

"The first person bows to apologize to the other person."

Comparisons


We compare against Intergen for two person motion generation. The synthesized motion by our proposed method are more consistent with the description.


Approach Overview


BibTeX

@article{fan2024freemotion
      title={FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis}, 
      author={Ke Fan and Junshu Tang and Weijian Cao and Ran Yi and Moran Li and Jingyu Gong and Jiangning Zhang and Yabiao Wang and Chengjie Wang and Lizhuang Ma},
      year={2024},
      eprint={2405.15763},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}