We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion. More specifically we model the motions of individual objects and learn their 3D motion vector jointly with depth and egomotion. We obtain more accurate results, especially for challenging dynamic scenes not addressed by previous approaches. This is an extended version of Casser et al. Code and models have been open sourced at: https://sites.google.com/corp/view/struct2depth.