Lexpanded-PPDB: Lexically-Expanded Paraphrase Database
Introduction
Lexpanded-PPDB is a lexically-expanded (lexpanded) version of an existing paraphrase database, PPDB. The dataset is developed aiming at improving natural language processing systems by making them more robust to the variability of language expressions.
Features
- Available for four languages: English, French, Spanish, and German.
- Gigantic: e.g., English-XL package contains 692 million unique paraphrase pairs.
- Just pairs: we postpone to compute features to each pair owing to the scale, but you can do so only for the filtered subset.
News
- 2016-07-27: Lexpanded-PPDB version 1.0 released
Download
Language | Package | # of pairs in PPDB | # of pairs in Lexpanded-PPDB (file size) |
---|---|---|---|
English | S | 670 thousand | 132 million (728MB) |
M | 1.32 million | 230 million (1.2GB) | |
L | 3.12 million | 418 million (2.0GB) | |
XL | 7.19 million | 692 million (3.1GB) | |
French | S | 6.21 million | 230 million (1.1GB) |
M | 12.5 million | 385 million (1.6GB) | |
L | 25.2 million | 612 million (2.4GB) | |
XL | 50.6 million | 942 million (3.6GB) | |
Spanish | S | 5.22 million | 138 million (610MB) |
M | 10.2 million | 241 million (982MB) | |
L | 19.5 million | 396 million (1.5GB) | |
XL | 37.1 million | 628 million (2.3GB) | |
German | S | 461 thousand | 154 million (763MB) |
M | 883 thousand | 239 million (1.1GB) | |
L | 1.68 million | 359 million (1.6GB) | |
XL | 3.47 million | 532 million (2.2GB) |
- md5 of the above files
- "Package" refers to the corresponding PPDB-1.0 package.
- "# of pairs" counts the pairs of unique surface forms.
- Resources used
- PPDB: its lexical, one-to-many, many-to-one, and phrasal paraphrases are used as seed.
- News Clawl 2007-2014: as the monolingual data.
- Stopword List: to filter reliable paraphrases.
Todo
Lexpand XXL and XXXL packages- Lexpand PPDBs for a wider variety of languages
- Compute features for each pair
References
- Atsushi Fujita, Pierre Isabelle, and Roland Kuhn. Enlarging Paraphrase Collections through Generalization and Instantiation. In Proc. of EMNLP-CoNLL, 2012.
- Atsushi Fujita and Pierre Isabelle. Expanding Paraphrase Lexicons by Exploiting Lexical Variants. In Proc. of NAACL-HLT, 2015.
- Atsushi Fujita and Pierre Isabelle. Expanding Paraphrase Lexicons by Exploiting Generalities. TALLIP, 2018.
Precautions
- National Institute of Information and Communications Technology (henceforth, NICT) has made the database publicly available under the conditions of license specified below. Users of the database are advised to carefully read the copyright policy of the original PPDB to ensure proper usage.
- NICT bears no responsibility for the contents of the database and assumes no liability for any direct or indirect damage or loss whatsoever that may be incurred as a result of using the database.
- If any copyright infringement or other problems are found in the database, please contact us at atsushi.fujita[at]nict[dot]go[dot]jp. We will review the issue and undertake appropriate measures when needed.
License
Use and/or redistribution of the Lexpanded-PPDB is permitted under the conditions of Creative Commons Attribution-ShareAlike License 3.0.
Acknowledgments
This work was partly supported by the following fundings.
- JSPS Postdoctoral Fellowship for Research Abroad (FYs 2011-2012)
- JSPS KAKENHI Grant-in-Aid for Young Scientists (B) 25730139 (FYs 2013-2015)
The dataset has been developed as a part of work at Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology.