Neelu Madan - Authorea

Video Foundation Models (ViFMs) aim to develop general-purpose representations for various video understanding tasks by leveraging large-scale datasets and powerful models to capture robust and generic features from video data. This survey analyzes over 200 methods, offering a comprehensive overview of benchmarks and evaluation metrics across 15 distinct video tasks, categorized into three main groups. Additionally, we provide an in-depth performance analysis of these models for the six most common video tasks. We identify three main approaches to constructing ViFMs: 1) Image-based ViFMs, which adapt image foundation models for video tasks; 2) Video-based ViFMs, which utilize video-specific encoding methods; and 3) Universal Foundation Models (UFMs), which integrate multiple modalities (image, video, audio, text, etc.) within a single framework. Each approach is further subdivided based on either practical implementation perspectives or pretraining objective types. By comparing the performance of various ViFMs on common video tasks, we offer valuable insights into their strengths and weaknesses, guiding future advancements in video understanding. Our analysis reveals that image-based ViFMs consistently outperform video-based ViFMs on most video understanding tasks. Additionally, UFMs, which leverage diverse modalities, demonstrate superior performance across all video tasks. We provide the comprehensive list of ViFMs studied in this work at: https://github.com/NeeluMadan/ViFM Survey.git.