R2-Tuning: Improving Image-to-Video Transfer Learning for Video Temporal Grounding

Apr 3, 2024

—

Transfer learning takes a leap with $R^2$-Tuning.

$R^2$-Tuning marks a leap forward in transfer learning, specifically in video temporal grounding. This framework uses the strength of CLIP features for spatio-temporal modeling in a novel way. It introduces a lightweight $R^2$ Block, which gradually combines and improves spatial features from initial layers. As a result, it sets new high standards on three VTG tasks—without relying on extra backbones.