arxiv:2606.17846

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Published on Jun 17

· Submitted by

Xiong-Hui Chen on Jun 29

Qwen

Upvote

Authors:

Abstract

A Vision-Language-Action foundation model for robotic manipulation achieves generalization through unified alignment across representation, motion, and behavior dimensions, enabling large-scale training on diverse data sources.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including π0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

xionghuichen

Paper author Paper submitter about 17 hours ago

•

edited about 16 hours ago

Qwen-RobotManip is worth highlighting because it brings the scaling mindset of multimodal foundation models into robot manipulation. The paper tackles a key bottleneck in VLA training: how to align heterogeneous manipulation data across embodiments, actions, and observations. By combining large-scale open robot data with human videos and human-to-robot synthesis, it shows impressive generalization across tasks and platforms, including strong OOD performance. A very relevant read for researchers interested in scalable robot learning and embodied foundation models.