CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation

1University of Virginia,

Abstract

Recent advances in Behavior Cloning (BC) have led to strong performance in robotic manipulation, driven by expressive models, sequence modeling of actions, and large-scale demonstration data. However, BC faces significant challenges when applied to heterogeneous datasets, such as visual shift with different camera poses or object appearances, where performance degrades despite the benefits of learning at scale. This stems from BC’s tendency to overfit individual demonstrations rather than capture shared structure, limiting generalization. To address this, we introduce Contrastive Learning via Action Sequence Supervision (CLASS), a method for learning behavioral representations from demonstrations using supervised contrastive learning. CLASS leverages weak supervision from similar action sequences identified via Dynamic Time Warping (DTW) and opti- mizes a soft InfoNCE loss with similarity-weighted positive pairs. We evaluate CLASS on 5 simulation benchmarks and 3 real-world tasks to achieve competitive results using retrieval-based control with representations only. Most notably, for downstream policy learning under significant visual shifts, Diffusion Policy with CLASS pre-training achieves an average success rate of 75%, while all other baseline methods fail to perform competitively.

Overview

Limitations of Behavior Cloning

Description of your image

Comparison between Behavior Cloning (BC) and Contrastive Learning via Action Sequence Supervision (CLASS). (A) With homogeneous demonstrations with consistent visual conditions, BC learns a compact representation with high transferability. (B) With heterogeneous demonstrations such as varying viewpoints, BC overfits to individual state-action pairs and generalizes poorly. (C) CLASS addresses this by attracting states with similar action sequences and repelling those with dissimilar ones, with a soft supervised contrastive learning objective to learn more robust and composable representations.



Homogeneous Datasets vs Heterogeneous Datasets

Description of your image

Homogeneous/Heterogeneous data collection setups. (A) Fixed camera (Fixed-Cam), commonly used in conventional behavior cloning pipelines. (B) Random static camera (Rand-Cam), where the camera pose is randomly sampled at the start of each episode but remains fixed during the episode. (C) Dynamic camera (Dyn-Cam), where a randomly initialized camera moves during the episode with a random direction while maintaining a consistent look-at target. (D) Fixed object color (Fixed-Color), commonly assumed in vision-based behavior cloning tasks. (E) Random object color (Rand-Color), where the color of the objects is randomly varied in each demonstration.

Results

Simulation

Description of your image
CLASS consistently improves success rates across all tasks and policy classes: Across the static and dynamic settings, it achieves 85% success rate with MLP and 91% with Diffusion Policy (DP) on average, significantly outperforming the best baselines, which reach only 63% and 77%, respectively. Larger gains are observed under dynamic camera and random object color setups, where CLASS achieves mean success rates of 76% (MLP) and 85%(DP), compared to 32% and 57% from the best baselines. In the non-parametric (Rep-Only) setting, CLASS achieves success rates comparable to parametric DP, despite using no policy head, with only a 9% drop in mean success rate with an average of 83% across all tasks.

Rollouts under Visual Variation

Square (Robomimic)

ImageNet + Diffusion Policy

ImageNet + CLASS + Diffusion Policy

Three-Stack (Mimicgen)

ImageNet + Diffusion Policy

ImageNet + CLASS + Diffusion Policy

Cube-Transfer (Aloha)

ImageNet + Diffusion Policy

ImageNet + CLASS + Diffusion Policy

Push-T

ImageNet + Diffusion Policy

ImageNet + CLASS + Diffusion Policy

Real-World Experiments

Tasks

Two-Stack

Mug-Hang

Toaster-Load

Camera Placements

Description of your image

Visualization of camera poses for the Mug-Hang

Description of your image

Individual camera positions from
25 random episodes for Mug-Hang

Real-World Results

Description of your image

BibTeX

@inproceedings{lee2025class,
      title={CLASS: Contrastive Learning via Action Sequence Supervision for Robot Manipulation}, 
      author={Lee, Sung-Wook and Kang, Xuhui and Yang, Brandon and Kuo, Yen-Ling},
      booktitle={Conference on Robot Learning (CoRL)},
      year={2025}, 
}