In this paper, we introduce a new problem of manipulating a given video by inserting other videos into it. Our main task is, given an object video and a scene video, inserting the object video at a user-specified location in the scene video so that the resulting video looks realistic. We aim to handle different object motions and complex backgrounds without expensive segmentation annotations. As it is difficult to collect training pairs for this problem, we synthesize fake training pairs that can provide helpful supervisory signals when training a neural network with unpaired real data. The proposed network architecture can take both real and fake pairs as input and perform both supervised and unsupervised training in adversarial learning scheme. To synthesize a realistic video, the network renders each frame based on the current input and previous frames. Under this framework, we observe that injecting noises into previous frames while generating the current frame stabilizes the training. We perform experiments on real-world videos such as object tracking or person re-identification benchmark databases. Results show that the proposed algorithm can synthesize a long sequence of a realistic video by inserting the given object video.