Google Research



RealEstate10K is a large dataset of camera poses corresponding to 10 million frames derived from about 80,000 video clips. For each clip, the poses form a trajectory where each pose specifies the camera position and orientation along the trajectory. These poses are derived by running SLAM and bundle adjustment algorithms on a large set of videos. The data consists of a set of .txt files, one for each video clip, specifying timestamps and poses for frames in that clip. For a learning application, frames can be sampled from the training clips in order for learning, for instance, a view synthesis model. In Google's 2018 SIGGRAPH paper Stereo Magnification: Learning view synthesis using multiplane images, for example, triplets of frames were sampled from each clip during training, two for predicting a model, and a third held out as ground truth for computing a view synthesis loss used to train the network.