Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Yang Zhang^1,2†    Chenwei Wang^1,3†    Ouyang Lu^1,4†    Yuan Zhao¹    Yunfei Ge¹    Zhenglong Sun³
Xiu Li²    Chi Zhang¹    Chenjia Bai^1*    Xuelong Li^1*
¹Institute of Artificial Intelligence, China Telecom          ²Tsinghua University
³The Chinese University of Hong Kong, Shenzhen          ⁴Northwestern Polytechnical University
^†Equal contributions    ^*Corresponding authors

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms.

Build Unified Latent Space & Steer Adaptation Process

Stage 1: Learning the Unified Action Latent Space

Stage 2: Steering Efficient Adaptation with Latent Guidance

(a) In the first stage, we construct a unified action space to bridge the embodiment gap in pretraining and adaptation stages by utilizing the mode-seeking behavior of asymmetric VAEs. (b) In the second stage, we integrate classifier guidance in diffusion and flow-based VLAs to steer the pretrained policy towards the target action distribution with specific robot platforms.

Simulation Results: RoboTwin 1.0

$RoboTwin $\pi_0$ Results$

Performance of fine-tuned VLA on RoboTwin 1.0 (Representative Tasks): We use our proposed ATE to fine-tune $\pi_0$ and RDT-1B across a mixture of 17 tasks. The resulting model consistently outperforms vanilla fine-tuned $\pi_0$ and RDT-1B across diverse manipulation scenarios. This unified strategy enables faster convergence and more robust performance on both single-arm and dual-arm tasks, including long-horizon and tool-use settings.

Real-world Results: Dual Realman RM75 Robot Arm

$\pi_0$ + ATE excels at long-horizon manipulation tasks, precise pick-and-place operations, and complex tool-use scenarios, demonstrating reliable performance across both single-arm and dual-arm tasks. Below are videos of $\pi_0$ + ATE and $\pi_0$ on physical Realman robot manipulation tasks with different manipulation skills and objects. (Videos are sped up by 3x.)

Make Sandwich

(Dual-arm Handover Task)

$\pi_0$+ATE

$\pi_0$

Cook Bun

(Dual-arm Coordination Task)

$\pi_0$+ATE

$\pi_0$

Pick Bun

(Single-arm Precise Task)

$\pi_0$+ATE

$\pi_0$

Make Yogurt Bowl

(Tool-use Task)

$\pi_0$+ATE

$\pi_0$

We design 5 real-world robot tasks covering different manipulation skills and objects, where $\mathbf{\pi_0}$+ATE outperforms vanilla $\pi_0$, excelling in long-horizon planning, precise manipulation, dual-arm coordination, and tool-use. These tasks are categorized as follows:

Dual-arm Handover Task (Make sandwich, Use toaster)

Dual-arm Coordinated Task (Cook Bun)

Single-arm Precise Manipulation Task (Pick Bun)

Tool-use Task (Make Yogurt Bowl)

Generalization Evaluation Results