Google Research

TDDC: Timely Disclosure Documents Corpus

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, pp. 3712-3719

Abstract

In this paper, we describe the details of the TDDC (Timely Disclosure Documents Corpus). TDDC was made by aligning the sentences manually from past Japanese and English timely disclosure documents in PDF format published by companies listed on Tokyo Stock Exchange. TDDC consists of approximately 1.4 million parallel sentences of Japanese and English. TDDC was used as the official dataset for the 6th Workshop on Asian Translation in order to encourage developments of machine translation.

Learn more about how we do research

We maintain a portfolio of research projects, providing individuals and teams the freedom to emphasize specific types of work