Abstract

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.

Overview

Figure 1. An overview of Humanoid-LLA. In stage one, we build a unified motion vocabulary leveraging a large-scale paired human and humanoid motion dataset. With a kinematic humanoid motion goal and its corresponding vocab retrieval, we distill a vocab-directed humanoid student controller from a teacher tracking controller. The first two stages enable stage three to acquire various humanoid feedback directly from physical simulation without decoding, making our LLA enhanced with high physical fidelity and language generalization.

Citation

@article{liu2025commanding,
  title={Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary},
  author={Liu, Zhirui and Ji, Kaiyang and Yang, Ke and Fan, Yahao and Yu, Jingyi and Shi, Ye and Wang, Jingya},
  journal={arXiv preprint arXiv:2511.22963},
  year={2025}
}

Commanding Humanoid by Free-form Language:
A Large Language Action Model with Unified Motion Vocabulary

Abstract

Overview

Real-World Results

Comparison with Baseline

Robustness Test

More Results on Booster T1

Citation