VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

Shuai Tian^1,2, Yupeng Zheng^1,2,3, Yuhang Zheng⁴, Songen Gu⁵, Yujie Zang^3,4,
Yuxing Qin^1,2, Weize Li³, Haoran Li^1,2,†, Wenchao Ding^3,†, Dongbin Zhao^1,2

¹ SKL-MAIS, Institute of Automation, Chinese Academy of Sciences
² School of Artificial Intelligence, University of Chinese Academy of Sciences
³ TARS Robotics ⁴ National University of Singapore ⁵ Fudan University

^† Corresponding Author

Paper Code (Coming Soon) Email

Overview

VT-WAM couples action prediction with tactile deformation dynamics, allowing temporally sparse contact information to guide action generation in contact-rich manipulation.

Problem Contact-rich manipulation depends on local deformation, pressure, slip, and friction, while these cues are temporally sparse and often weakly visible in wrist-camera observations.
Limitation Existing visual-tactile policies usually feed tactile observations into action prediction, but rarely model tactile deformation dynamics during action generation.
Key idea VT-WAM jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework.
Result Across six real-world tasks, VT-WAM reaches 71.67% average success, compared with 45.00% for Fast-WAM and 35.83% for OmniVTLA.

Method

VT-WAM jointly learns future visual prediction, tactile deformation prediction, and action prediction through modality-specific experts, Asymmetric MoT Attention, and contact-gated AVTAG.

Visual-tactile-action flow matching: a unified flow matching objective jointly learns future visual prediction, tactile deformation prediction, and action prediction.
Asymmetric MoT Attention: action tokens attend to the first-frame visual anchor and the full tactile sequence.
Contact-gated AVTAG: a training-only hinge ranking loss encourages action queries to use tactile information during contact phases.

Real-World Experiments

VT-WAM achieves 71.67% average success across six real-world contact-rich tasks, compared with 45.00% for Fast-WAM and 35.83% for OmniVTLA.

Platform: 7-DoF xArm7 robot with a Robotiq 2F-85 gripper, wrist camera, and paired Xense tactile sensors.
Tasks: six real-world tasks covering surface-interaction and constrained insertion regimes.
Training data: 100 expert trajectories per task collected through human kinesthetic teaching.

**Figure 2.** Overview of six real-world contact-rich manipulation tasks.

Table 1. Success rates on real-world contact-rich tasks.

Method	Surface-Interaction Tasks				Constrained Insertion Tasks				Average
Method	Wipe Board	Wipe Vase	Peel Cucumber	Avg.	Insert Plug	Swipe Card	Insert Tube	Avg.	Average
DP + Tactile	30%	20%	25%	25.00%	5%	35%	15%	18.33%	21.67%
RDP	45%	60%	40%	48.33%	15%	35%	10%	20.00%	34.17%
π_0.5	40%	35%	35%	36.67%	30%	45%	10%	28.33%	32.50%
OmniVTLA	45%	30%	25%	33.33%	40%	35%	40%	38.33%	35.83%
Fast-WAM	70%	55%	45%	56.67%	20%	55%	25%	33.33%	45.00%
VT-WAM	90%	85%	70%	81.67%	60%	70%	55%	61.67%	71.67%

Visual-Tactile Prediction

VT-WAM predicts wrist-camera observations together with tactile deformation fields, showing that the tactile expert learns meaningful contact deformation dynamics.

Joint inference mode: VT-WAM denoises visual, tactile, and action tokens together for prediction analysis.
Qualitative result: predictions show temporally coherent wrist-camera observations and tactile deformation trajectories that capture pressure concentration and contact migration.
Quantitative result: VT-WAM achieves lower deformation error and higher directional consistency than baseline models.

**Figure 3.** Visual-tactile prediction results across six tasks.

Ablation Studies

Ablations on wipe vase and insert tube evaluate two design questions: how to incorporate tactile dynamics into action prediction, and whether AVTAG improves real-world success.

Tactile dynamics are important. Tactile sequence prediction adds contact dynamics beyond visual dynamics, and full tactile history is better than using only the initial tactile frame.

Contact-gated guidance helps. AVTAG improves contact-aware tactile attention during training.

Table 2. Ablation study on tactile dynamics modeling and attention guidance.

Model	Description	Wipe Vase	Insert Tube
M₀	Fast-WAM	55%	25%
M₁	M₀ + Sym. (T Seq.)	65%	40%
M₂	M₀ + Asym. (T₀)	40%	30%
M₃	M₀ + Asym. (T Seq.)	70%	50%
M₄	VT-WAM: M₃ + AVTAG	85%	55%

AVTAG-Guided Tactile Attention

This example isolates a contact-disturbance case. When the supporting plane moves downward, the wrist-camera view changes only subtly, so the policy must rely on tactile information to identify and correct the loss of contact.

w/o AVTAG

Visual-dominant attention remains nearly static. The contact loss is hard to infer from the wrist view alone, so the policy fails to re-establish contact.

VT-WAM without AVTAG weakens tactile attention after contact is disturbed

w/ AVTAG

Tactile attention increases during contact. The policy uses tactile information to re-establish contact with the vase surface and complete the wiping task.

VT-WAM with AVTAG maintains tactile attention during contact re-establishment

Figure 4. Attention and force traces during the vase-wiping disturbance. Red and blue curves denote relative tactile and visual attention weights; the dashed curve denotes contact force.

Real-World Demo Comparison

Select a task to compare representative OmniVTLA, Fast-WAM, and VT-WAM demos.

OmniVTLA

Failure

Fast-WAM

Partial Success

VT-WAM

Success

Cite our paper

@article{vtwam2026,
  title   = {VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation},
  author  = {Shuai Tian and Yupeng Zheng and Yuhang Zheng and Songen Gu and Yujie Zang and Yuxing Qin and Weize Li and Haoran Li and Wenchao Ding and Dongbin Zhao},
  journal = {Under Review},
  year    = {2026}
}