Question Generation has been emerging as new method to improve QA systems and represent factual information in text. However, despite the rash of new work on the topic, there is still no obvious method to evaluate such systems. Here we present DiffQG, a method to evaluate the precision and recall of question generation systems. DiffQG consists of expert labeled annotations, focusing on the particularly challenging task of generating questions from similar pieces of text. Given an edit to a Wikipedia passage and a noun phrase, annotators wrote questions that are answered by one passage but answered differently or not at all by the other. These questions are intended to be both unambiguous and information-seeking, pushing the bounds of current question generation systems' capabilities. Moreover, as annotators also marked when no such question exists, it serves as a new evaluation for difference detection, which also lacks evaluations with as much diversity as DiffQG. We hope that this dataset will be of value to researchers as they seek to improve such systems for a variety of purposes.