Stage 1: Learning the Unified Action Latent Space
Stage 2: Steering Efficient Adaptation with Latent Guidance
(a) In the first stage, we construct a unified action space to bridge the embodiment gap in pretraining and adaptation stages by utilizing the mode-seeking behavior of asymmetric VAEs. (b) In the second stage, we integrate classifier guidance in diffusion and flow-based VLAs to steer the pretrained policy towards the target action distribution with specific robot platforms.
Performance of fine-tuned VLA on RoboTwin 1.0 (Representative Tasks): We use our proposed ATE to fine-tune $\pi_0$ and RDT-1B across a mixture of 17 tasks. The resulting model consistently outperforms vanilla fine-tuned $\pi_0$ and RDT-1B across diverse manipulation scenarios. This unified strategy enables faster convergence and more robust performance on both single-arm and dual-arm tasks, including long-horizon and tool-use settings.
$\pi_0$ + ATE excels at long-horizon manipulation tasks, precise pick-and-place operations, and complex tool-use scenarios, demonstrating reliable performance across both single-arm and dual-arm tasks. Below are videos of $\pi_0$ + ATE and $\pi_0$ on physical Realman robot manipulation tasks with different manipulation skills and objects. (Videos are sped up by 3x.)
(Dual-arm Handover Task)
(Dual-arm Coordination Task)
(Single-arm Precise Task)
(Tool-use Task)
We design 5 real-world robot tasks covering different manipulation skills and objects, where $\mathbf{\pi_0}$+ATE outperforms vanilla $\pi_0$, excelling in long-horizon planning, precise manipulation, dual-arm coordination, and tool-use. These tasks are categorized as follows:
Real-world evaluation on physical robot experiments. Top panel reports success rates across four tasks and overall average, comparing ATE with the baseline $\pi_0$. Bottom panel shows full execution trajectories of four representative tasks, covering long-horizon, single-arm, and dual-arm scenarios.
@article{zhang2025ate, title={Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance}, author={Zhang, Yang and Wang, Chenwei and Lu, Ouyang and Zhao, Yuan and Ge, Yunfei and Sun, Zhenglong and Li, Xiu and Zhang, Chi and Bai, Chenjia and Li, Xuelong}, journal={arXiv preprint arXiv:2509.02055}, year={2025} }