NICT QE/APE Dataset
Introduction
NICT QE/APE Dataset is a multilingual parallel corpus consisting of transcribed utterances in Japanese and their MT outputs in several languages, manually associated with their gradings and post-edits. The dataset is developed for training and evaluating systems for the following four tasks.
- WQE: word-level quality estimation (annotation of each word in MT output)
- SQE-score: sentence-level quality estimation (prediction of HTER score)
- SQE-class: sentence-level quality estimation (classification of MT output)
- APE: automatic post-editing
Features
- Spoken language in two domains: travel (8,783 segments), hospital (1,676 segments)
- Available for three translation directions: ja-en, ja-zh, and ja-ko
- MT system: phrase-based SMT
News
- 2017-10-27: NICT-QEAPE version 0.201710 released
Download
- NICT-QEAPE-0.201710.zip (8.3MB).
Todo
- Include NMT outputs
- Increase the data size
- Conduct benchmarking experiments using publicly available softwares and corpora
References
- Atsushi Fujita and Eiichiro Sumita. Japanese to English/Chinese/Korean Datasets for Translation Quality Estimation and Automatic Post-Editing. In Proc. of WAT, 2017.
Precautions
- National Institute of Information and Communications Technology (henceforth, NICT) has made the database publicly available under the conditions of license specified below.
- NICT bears no responsibility for the contents of the database and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the database.
- If any copyright infringement or other problems are found in the database, please contact us at atsushi.fujita[at]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
License
Use and/or redistribution of the NICT QE/APE Dataset is permitted under the conditions of Creative Commons Attribution-NonCommercial-ShareAlike License 4.0.
Acknowledgments
The dataset has been developed at Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology under the program "Promotion of Global Communications Plan: Research, Development, and Social Demonstration of Multilingual Speech Translation Technology" of the Ministry of Internal Affairs and Communications (MIC), Japan.