VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

Shuai Tian1,2, Yupeng Zheng1,2,3, Yuhang Zheng4, Songen Gu5, Yujie Zang3,4,
Yuxing Qin1,2, Weize Li3, Haoran Li1,2,†, Wenchao Ding3,†, Dongbin Zhao1,2
1 SKL-MAIS, Institute of Automation, Chinese Academy of Sciences
2 School of Artificial Intelligence, University of Chinese Academy of Sciences
3 TARS Robotics 4 National University of Singapore 5 Fudan University
Corresponding Author

Overview

VT-WAM couples action prediction with tactile deformation dynamics, allowing temporally sparse contact information to guide action generation in contact-rich manipulation.

  • Problem Contact-rich manipulation depends on local deformation, pressure, slip, and friction, while these cues are temporally sparse and often weakly visible in wrist-camera observations.
  • Limitation Existing visual-tactile policies usually feed tactile observations into action prediction, but rarely model tactile deformation dynamics during action generation.
  • Key idea VT-WAM jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework.
  • Result Across six real-world tasks, VT-WAM reaches 71.67% average success, compared with 45.00% for Fast-WAM and 35.83% for OmniVTLA.

Method

VT-WAM jointly learns future visual prediction, tactile deformation prediction, and action prediction through modality-specific experts, Asymmetric MoT Attention, and contact-gated AVTAG.

  • Visual-tactile-action flow matching: a unified flow matching objective jointly learns future visual prediction, tactile deformation prediction, and action prediction.
  • Asymmetric MoT Attention: action tokens attend to the first-frame visual anchor and the full tactile sequence.
  • Contact-gated AVTAG: a training-only hinge ranking loss encourages action queries to use tactile information during contact phases.
Overview of VT-WAM
Figure 1. Overview of VT-WAM.

Real-World Experiments

VT-WAM achieves 71.67% average success across six real-world contact-rich tasks, compared with 45.00% for Fast-WAM and 35.83% for OmniVTLA.

  • Platform: 7-DoF xArm7 robot with a Robotiq 2F-85 gripper, wrist camera, and paired Xense tactile sensors.
  • Tasks: six real-world tasks covering surface-interaction and constrained insertion regimes.
  • Training data: 100 expert trajectories per task collected through human kinesthetic teaching.
Overview of six real-world contact-rich manipulation tasks
Figure 2. Overview of six real-world contact-rich manipulation tasks.
Table 1. Success rates on real-world contact-rich tasks.
Method Surface-Interaction Tasks Constrained Insertion Tasks Average
Wipe Board Wipe Vase Peel Cucumber Avg. Insert Plug Swipe Card Insert Tube Avg.
DP + Tactile 30%20%25%25.00% 5%35%15%18.33% 21.67%
RDP 45%60%40%48.33% 15%35%10%20.00% 34.17%
π0.5 40%35%35%36.67% 30%45%10%28.33% 32.50%
OmniVTLA 45%30%25%33.33% 40%35%40%38.33% 35.83%
Fast-WAM 70%55%45%56.67% 20%55%25%33.33% 45.00%
VT-WAM 90%85%70%81.67% 60%70%55%61.67% 71.67%

Visual-Tactile Prediction

VT-WAM predicts wrist-camera observations together with tactile deformation fields, showing that the tactile expert learns meaningful contact deformation dynamics.

  • Joint inference mode: VT-WAM denoises visual, tactile, and action tokens together for prediction analysis.
  • Qualitative result: predictions show temporally coherent wrist-camera observations and tactile deformation trajectories that capture pressure concentration and contact migration.
  • Quantitative result: VT-WAM achieves lower deformation error and higher directional consistency than baseline models.
Visual-tactile prediction results across six tasks
Figure 3. Visual-tactile prediction results across six tasks.

Ablation Studies

Ablations on wipe vase and insert tube evaluate two design questions: how to incorporate tactile dynamics into action prediction, and whether AVTAG improves real-world success.

Tactile dynamics are important. Tactile sequence prediction adds contact dynamics beyond visual dynamics, and full tactile history is better than using only the initial tactile frame.
Contact-gated guidance helps. AVTAG improves contact-aware tactile attention during training.
Table 2. Ablation study on tactile dynamics modeling and attention guidance.
Model Description Wipe Vase Insert Tube
M0Fast-WAM55%25%
M1M0 + Sym. (T Seq.)65%40%
M2M0 + Asym. (T0)40%30%
M3M0 + Asym. (T Seq.)70%50%
M4 VT-WAM: M3 + AVTAG 85%55%

AVTAG-Guided Tactile Attention

This example isolates a contact-disturbance case. When the supporting plane moves downward, the wrist-camera view changes only subtly, so the policy must rely on tactile information to identify and correct the loss of contact.

w/o AVTAG

Visual-dominant attention remains nearly static. The contact loss is hard to infer from the wrist view alone, so the policy fails to re-establish contact.

VT-WAM without AVTAG weakens tactile attention after contact is disturbed

w/ AVTAG

Tactile attention increases during contact. The policy uses tactile information to re-establish contact with the vase surface and complete the wiping task.

VT-WAM with AVTAG maintains tactile attention during contact re-establishment

Figure 4. Attention and force traces during the vase-wiping disturbance. Red and blue curves denote relative tactile and visual attention weights; the dashed curve denotes contact force.

Real-World Demo Comparison

Select a task to compare representative OmniVTLA, Fast-WAM, and VT-WAM demos.

OmniVTLA

Failure

Fast-WAM

Partial Success

VT-WAM

Success

Cite our paper

@article{vtwam2026,
  title   = {VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation},
  author  = {Shuai Tian and Yupeng Zheng and Yuhang Zheng and Songen Gu and Yujie Zang and Yuxing Qin and Weize Li and Haoran Li and Wenchao Ding and Dongbin Zhao},
  journal = {Under Review},
  year    = {2026}
}