Towards a Complete Benchmark on Video Moment Localization

Jinyeong Chae
Donghwa Kim
Kwanseok Kim
Doyeon Lee
Sangho Lee
Seongsu Ha
Jonghwan Mun
Wooyoung Kang
Byungseok Roh
(2024)

Abstract

In this paper, we propose and conduct a comprehensive benchmark on moment localization task, which aims to retrieve a segment that corresponds to a text query from a single untrimmed video. Our study starts from an observation that most moment localization papers report experimental results only on a few datasets in spite of availability of far more benchmarks. Thus, we conduct an extensive benchmark study to measure the performance of representative methods on widely used 7 datasets. Looking further into the details, we pose additional research questions and empirically verify them, including if they rely on unintended biases introduced by specific training data, if advanced visual features trained on classification task transfer well to this task, and if computational cost of each model pays off. With a series of these experiments, we provide multifaceted evaluation of state-of-the-art moment localization models. Codes are available at https://github.com/snuviplab/MoLEF.