paper tackles the problem of segment annotation in complex Internet videos. Given a weakly labeled video, we automatically generate spatiotemporal masks for each of the concepts with which it is labeled. This is a particularly relevant problem in the video domain, as large numbers of YouTube videos are now available, tagged with the visual concepts that they contain.
Given such weakly labeled videos, we focus on the problem of spatiotemporal segment classification. We propose a straightforward algorithm, CRANE, that utilizes large amounts of weakly labeled video to rank spatiotemporal segments by the likelihood that they correspond to a given visual concept. We make publicly available segment-level annotations for a subset of the Prest et al. dataset and show convincing results. We also show state-of-the-art results on Hartmann et al.'s more difficult, large-scale object segmentation dataset.