Commanding Humanoid by Free-form Language:
A Large Language Action Model with Unified Motion Vocabulary

1ShanghaiTech University 2InstAdapt
*Equal contribution Corresponding author

Abstract

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.

Overview

Figure 1. An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human and humanoid motion dataset. With a kinematic humanoid motion goal and its corresponding vocab retrieval, we distill a vocab-directed humanoid student controller from a teacher tracking controller. The first two stages enable stage three to acquire various humanoid feedback directly from physical simulation without decoding, making our LLA enhanced with high physical fidelity and language generalization.

Real-World Results


Comparison with Baseline

Prompt Direct traffic like a policeman.

LangWBC

Ours

Prompt A joyful dance with beated hip-hop.

LangWBC

Ours

Robustness Test

A teacher is giving a lecture.

A gardener is watering flowers.


More Results on Booster T1

A person shakes hand with others.

A person squat in place twice.

Hug a friend.


Citation

@article{liu2025commanding,
  title={Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary},
  author={Liu, Zhirui and Ji, Kaiyang and Yang, Ke and Fan, Yahao and Yu, Jingyi and Shi, Ye and Wang, Jingya},
  journal={arXiv preprint arXiv:2511.22963},
  year={2025}
}