Jump to Content

TDDC: Timely Disclosure Documents Corpus

Nobushige Doi
Yusuke Oda
Toshiaki Nakazawa
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, pp. 3712-3719

Abstract

In this paper, we describe the details of the TDDC (Timely Disclosure Documents Corpus). TDDC was made by aligning the sentences manually from past Japanese and English timely disclosure documents in PDF format published by companies listed on Tokyo Stock Exchange. TDDC consists of approximately 1.4 million parallel sentences of Japanese and English. TDDC was used as the official dataset for the 6th Workshop on Asian Translation in order to encourage developments of machine translation.