Commanding Humanoid by Free-form Language:
A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu1,2,*, Kaiyang Ji1,2,*, Ke Yang1,2, Jingyi Yu1, Ye Shi1,2, Jingya Wang1,2,†
1ShanghaiTech University 2InstAdapt
*Equal contribution Corresponding author

Abstract

Enabling humanoid robots to follow free-form language commands is critical for seamless human-robot interaction, collaborative task execution, and general-purpose embodied intelligence. While recent advances have improved low-level humanoid locomotion and robot manipulation, language-conditioned whole-body control remains a significant challenge. Existing methods are often limited to simple instructions and sacrifice either motion diversity or physical plausibility. To address this, we introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots. Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability. Extensive evaluations in simulation and on a real-world Unitree G1 humanoid show that Humanoid-LLA delivers strong language generalization while maintaining high physical fidelity, outperforming existing language-conditioned controllers in motion naturalness, stability, and execution success rate.

Overview

Figure 1. An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human and humanoid motion dataset. With a kinematic humanoid motion goal and its corresponding vocab retrieval, we distill a vocab-directed humanoid student controller from a teacher tracking controller. The first two stages enable stage three to acquire various humanoid feedback directly from physical simulation without decoding, making our LLA enhanced with high physical fidelity and language generalization.

Real-World Results

Walk with a confident and happy strut.

A zombie is coming.

A person is playing golf.


Comparison with Baseline

Prompt Direct traffic like a policeman.

LangWBC

Ours

Prompt A joyful dance with beated hip-hop.

LangWBC

Ours

Robustness Test

A teacher is giving a lecture.

A gardener is watering flowers.


Citation

Coming soon~