This is a dataset for evaluating document similarity models. In each file, each line consists of a triplet of URLs, either all from Wikipedia or all from arXiv.org. The triplets in the file 'wikipedia-hand-triplets-release.txt' were hand generated whereas the other two files were generated automatically be examining Wikipedia and arXiv.org document categories. The content of URLs one and two should be more similar than the content of URLs two and three.
This dataset is generated as part of the paper 'Document Embedding with Paragraph Vectors' by Andrew M. Dai, Christopher Olah, Quoc V. Le, Greg S. Corrado (2014).