Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
Abstract
A Vision-Language-Action foundation model for robotic manipulation achieves generalization through unified alignment across representation, motion, and behavior dimensions, enabling large-scale training on diverse data sources.
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
Community
Qwen-RobotManip is worth highlighting because it brings the scaling mindset of multimodal foundation models into robot manipulation. The paper tackles a key bottleneck in VLA training: how to align heterogeneous manipulation data across embodiments, actions, and observations. By combining large-scale open robot data with human videos and human-to-robot synthesis, it shows impressive generalization across tasks and platforms, including strong OOD performance. A very relevant read for researchers interested in scalable robot learning and embodied foundation models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments (2026)
- Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation (2026)
- GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation (2026)
- ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining (2026)
- VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training (2026)
- Wall-OSS-0.5 Technical Report (2026)
- Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper