Fine-Tuning Machine Confidence With Human Relevance For Video Discovery

Lora Mois Aroyo
Oana Inel
ACM IX Interactions(2022) (to appear)
Google Scholar


To understand what captures people's attention (what they find relevant) we focussed on understanding better the content of videos. In information science, the concept of relevance is most connected to end-users' judgments and is considered fundamental as a subjective, dynamic user-centric perception. People might have or use different relevance standards or criteria when performing the task of video searching. Textual and visual criteria are essential for identifying relevant video content, but subjective, implicit criteria, such as interest or familiarity could be equally used by people. Typically, people tend to connect bridges to concepts or perspectives that are not necessarily shown in the video, but that might be expressed or referred to. We carried out a number of studies with news videos and broadcasts. In our initial study [6], we took a digital hermeneutics approach to understand which video aspects capture the attention of digital humanities scholars and drive the creation of narratives, or short audio-visual stories. In subsequent studies, we focused on understanding the utility of machine-extracted video concepts and how people can teach machines in terms of video concept relevance. We harnessed the intrinsic subjectivity of concept relevance to capture the diverse range of video concepts found relevant through the eyes of our participants [4]. We explored to what extent current information extraction systems meet users' goals, and what are the novel aspects users bring in video concept relevance assessment. We performed two types of crowdsourcing studies. The Selection study (Figure 1) focused on understanding the utility of machine-extracted video concepts from video subtitles and video streams. While the Free Input study (Figure 2) focused on understanding the complementarity between machine and human concepts in terms of relevance. By studying the gap between machines and humans in terms of perceived video concept relevance, we gained insights into how machines can collaborate with users, to better support their needs and preferences. Our studies revealed that events, locations, people, organizations, and general concepts (i.e., of any type) are fundamental elements for content exploration and understanding. They are most commonly machine concepts extracted and as such used in machine summarization of content as well as for information search. However, people engaging with online videos most often provide events, people, locations, and organizations as relevant concepts. Concepts of other types are also found relevant, but to a lesser extent. These concept types are thus fundamental for contextualizing the content of the videos, and also sufficient to capture human interest in terms of relevance.