Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Yang Zhang1,2†    Chenwei Wang1,3†    Ouyang Lu1,4†    Yuan Zhao1    Yunfei Ge1    Zhenglong Sun3   
Xiu Li2    Chi Zhang1    Chenjia Bai1*    Xuelong Li1*   
1Institute of Artificial Intelligence, China Telecom          2Tsinghua University
3The Chinese University of Hong Kong, Shenzhen          4Northwestern Polytechnical University
Equal contributions    *Corresponding authors
method overview

We present Align-Then-stEer (ATE), a plug-and-play adaptation framework for pre-trained Vision-Language-Action (VLA) models. Unlike prior methods that directly fine-tune VLAs, ATE aligns disparate action spaces into a unified latent representation and steers the diffusion- or flow-based VLA' s generation via guidance, enabling data-efficient cross-task and cross-embodiment adaptation. This framework is evaluated in simulation on RoboTwin and ManiSkill benchmarks, as well as on a real-world dual-arm RealMan 7-DoF robot, demonstrating strong generalization, bimanual dexterous coordination, and minute-level long-horizon manipulation, achieving substantial gains in multi-task success rates.

Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms.

Build Unified Latent Space & Steer Adaptation Process

stage 1 figure

Stage 1: Learning the Unified Action Latent Space

stage 2 figure

Stage 2: Steering Efficient Adaptation with Latent Guidance

(a) In the first stage, we construct a unified action space to bridge the embodiment gap in pretraining and adaptation stages by utilizing the mode-seeking behavior of asymmetric VAEs. (b) In the second stage, we integrate classifier guidance in diffusion and flow-based VLAs to steer the pretrained policy towards the target action distribution with specific robot platforms.

Simulation Results: RoboTwin 1.0

RoboTwin $\pi_0$ Results
RoboTwin RDT Results

Performance of fine-tuned VLA on RoboTwin 1.0 (Representative Tasks): We use our proposed ATE to fine-tune $\pi_0$ and RDT-1B across a mixture of 17 tasks. The resulting model consistently outperforms vanilla fine-tuned $\pi_0$ and RDT-1B across diverse manipulation scenarios. This unified strategy enables faster convergence and more robust performance on both single-arm and dual-arm tasks, including long-horizon and tool-use settings.

Real-world Results: Dual Realman RM75 Robot Arm

$\pi_0$ + ATE excels at long-horizon manipulation tasks, precise pick-and-place operations, and complex tool-use scenarios, demonstrating reliable performance across both single-arm and dual-arm tasks. Below are videos of $\pi_0$ + ATE and $\pi_0$ on physical Realman robot manipulation tasks with different manipulation skills and objects. (Videos are sped up by 3x.)

Make Sandwich

(Dual-arm Handover Task)

$\pi_0$+ATE
$\pi_0$
Cook Bun

(Dual-arm Coordination Task)

$\pi_0$+ATE
$\pi_0$
Pick Bun

(Single-arm Precise Task)

$\pi_0$+ATE
$\pi_0$
Make Yogurt Bowl

(Tool-use Task)

$\pi_0$+ATE
$\pi_0$

We design 5 real-world robot tasks covering different manipulation skills and objects, where $\mathbf{\pi_0}$+ATE outperforms vanilla $\pi_0$, excelling in long-horizon planning, precise manipulation, dual-arm coordination, and tool-use. These tasks are categorized as follows:

Dual-arm Handover Task (Make sandwich, Use toaster)
Dual-arm Coordinated Task (Cook Bun)
Single-arm Precise Manipulation Task (Pick Bun)
Tool-use Task (Make Yogurt Bowl)
Image description

Real-world evaluation on physical robot experiments. Top panel reports success rates across four tasks and overall average, comparing ATE with the baseline $\pi_0$. Bottom panel shows full execution trajectories of four representative tasks, covering long-horizon, single-arm, and dual-arm scenarios.

Generalization Evaluation Results

Make Sandwich

Cook Bun

Pick Bun

Poor
Illumination
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$
Extreme
Illumination
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$
Flash
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$
$\pi_0$+ATE
$\pi_0$

Make Sandwich

$\pi_0$+ATE
$\pi_0$

Cook Bun

$\pi_0$+ATE
$\pi_0$

Pick Farthest Bun

$\pi_0$+ATE
$\pi_0$

Make Yogurt Bowl

$\pi_0$+ATE
$\pi_0$

Pick Bun

$\pi_0$+ATE
$\pi_0$

BibTeX

@article{zhang2025ate,
  title={Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance},
  author={Zhang, Yang and Wang, Chenwei and Lu, Ouyang and Zhao, Yuan and Ge, Yunfei and Sun, Zhenglong and Li, Xiu and Zhang, Chi and Bai, Chenjia and Li, Xuelong},
  journal={arXiv preprint arXiv:2509.02055},
  year={2025}
}