Semantic Video Segmentation by Gated Recurrent Flow Propagation
Abstract
Semantic video segmentation is challenging due to the
sheer amount of data that needs to be processed and labeled
in order to construct accurate models. In this paper
we present a deep, end-to-end trainable methodology
for video segmentation that is capable of leveraging the information
present in unlabeled data, besides sparsely labeled
frames, in order to improve semantic estimates. Our
model combines a convolutional architecture and a spatiotemporal
transformer recurrent layer that is able to temporally
propagate labeling information by means of optical
flow, adaptively gated based on its locally estimated uncertainty.
The flow, the recognition and the gated temporal
propagation modules can be trained jointly, end-to-end.
The temporal, gated recurrent flow propagation component
of our model can be plugged into any static semantic segmentation
architecture and turn it into a weakly supervised
video processing one. Our experiments in the challenging
CityScapes and Camvid datasets, and for multiple deep architectures,
indicate that the resulting model can leverage
unlabeled temporal frames, next to a labeled one, in order
to improve both the video segmentation accuracy and the
consistency of its temporal labeling, at no additional annotation
cost and with little extra computation.