Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.
Prompt Direct traffic like a policeman.
LangWBC
Ours
Prompt A joyful dance with beated hip-hop.
LangWBC
Ours
A teacher is giving a lecture.
A gardener is watering flowers.
A person shakes hand with others.
A person squat in place twice.
Hug a friend.
@article{liu2025commanding,
title={Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary},
author={Liu, Zhirui and Ji, Kaiyang and Yang, Ke and Fan, Yahao and Yu, Jingyi and Shi, Ye and Wang, Jingya},
journal={arXiv preprint arXiv:2511.22963},
year={2025}
}