Joseph Roth
Research Areas
Authored Publications
Sort By
AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
Ondrej Klejch
Radhika Marvin
Liat Kaver
Sharadh Ramaswamy
Arkadiusz Stopczynski
ICASSP, IEEE (2020)
Preview abstract
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
View details
Modeling Uncertainty with Hedged Instance Embedding
Seong Joon Oh
Jiyan Pan
ICLR 2019 (2019)
Preview abstract
Instance embeddings are an efficient and versatile image representation that facilitates applications like recognition, verification, retrieval, and clustering. Many
metric learning methods represent the input as a single point in the embedding
space. Often the distance between points is used as a proxy for match confidence.
However, this can fail to represent uncertainty which can arise when the input is
ambiguous, e.g., due to occlusion or blurriness. This work addresses this issue and
explicitly models the uncertainty by “hedging” the location of each input in the
embedding space. We introduce the hedged instance embedding (HIB) in which
embeddings are modeled as random variables and the model is trained under the
variational information bottleneck principle (Alemi et al., 2016; Achille & Soatto,
2018). Empirical results on our new N-digit MNIST dataset show that our method
leads to the desired behavior of “hedging its bets” across the embedding space
upon encountering ambiguous inputs. This results in improved performance for
image matching and classification tasks, more structure in the learned embedding
space, and an ability to compute a per-exemplar uncertainty measure which is
correlated with downstream performance.
View details
AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies
Liat Kaver
Radhika Marvin
Nathan Christopher Reale
Loretta Guarino Reid
Proceedings of Interspeech, 2018
Preview abstract
Speech activity detection (or endpointing) is an important processing step for applications such as speech recognition, language identification and speaker diarization. Both audio- and vision-based approaches have been used for this task in various settings and with multiple variations tailored toward applications. Unfortunately, much of the prior work reports results in synthetic settings, on task-specific datasets, or on datasets that are not openly available. This makes it difficult to compare approaches in similar settings and to understand their strengths and weaknesses. In this paper, we describe a new dataset of densely labeled speech activity in YouTube video clips, which has been designed to address these issues and will be released publicly. The dataset labels go beyond speech alone, annotating three specific speech activity situations: clean speech, speech and music co-occurring, and speech and noise co-occurring. These classes will enable further analysis of model performance in the presence of noise. We report benchmark performance numbers on this dataset using state-of-the-art audio and vision models.
View details