Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models

Chaolei Han¹, Hongsong Wang¹, Jidong Kuang¹, Lei Zhang², Jie Gui¹
¹Southeast University ²Nanjing Normal University

Abstract

Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are susceptible to domain shifts and entail high computational costs. Unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging image-pretrained vision-language models (VLMs) to directly classify and localize unseen activities within untrimmed videos. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the Logarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based actionness calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling VLMs to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised training methods of ZSTAD.

Samples

We visualize the detection results of our AdaZAD method and a previous unsupervised SOTA method on three videos from the THUMOS14 dataset. These videos represent samples of varying difficulty: easy, moderate, and hard cases. Each frame is uniformly sampled from the respective videos. In sample (a), the action 'Long Jump' is accurately detected due to its clear distinction from the background, where the subject is simply standing still. In sample (b), our detection starts with the 'clean' activity and ends after the 'jerk', as indicated by its name. We believe the failure to detect the preparatory action is due to the lack of adequate descriptive information in the action name. Sample (c) illustrates an instance of failed detection, where the model confuses the distinction between the swing movement and a person holding a tennis racket who is not actively swinging. Despite certain limitations, our results demonstrate greater reliability compared to those produced by the state-of-the-art unsupervised method.