Lexpanded-PPDB: Lexically-Expanded Paraphrase Database


Lexpanded-PPDB is a lexically-expanded (lexpanded) version of an existing paraphrase database, PPDB. Lexpanded-PPDB aims at improving natural language processing systems by making them more robust to the variability of language expressions. The dataset is freely available under the conditions of Creative Commons Attribution-Share-Alike License 3.0.




Language Package # of pairs in PPDB # of pairs in Lexpanded-PPDB (file size)
English S670 thousand132 million (728MB)
M1.32 million230 million (1.2GB)
L3.12 million418 million (2.0GB)
XL7.19 million692 million (3.1GB)
XXL21.4 millionto be released
XXXL73.6 millionto be released
French S6.21 million230 million (1.1GB)
M12.5 million385 million (1.6GB)
L25.2 million612 million (2.4GB)
XL50.6 million942 million (3.6GB)
XXL103 millionto be released
XXXL223 millionto be released
Spanish S5.22 million138 million (610MB)
M10.2 million241 million (982MB)
L19.5 million396 million (1.5GB)
XL37.1 million628 million (2.3GB)
XXL71.6 millionto be released
XXXL138 millionto be released
German S461 thousand154 million (763MB)
M883 thousand239 million (1.1GB)
L1.68 million359 million (1.6GB)
XL3.47 million532 million (2.2GB)
XXL7.06 millionto be released
XXXL14.8 millionto be released





This work was partly supported by the following fundings.


Creative Commons License
Use and/or redistribution of the Lexpanded-PPDB is permitted under the conditions of Creative Commons Attribution-ShareAlike License 3.0. Details can be found at http://creativecommons.org/licenses/by-sa/3.0/.

Advanced Translation Technology Laboratory, Advanced Speech Translation Research and Development Promotion Center, National Institute of Information and Communications Technology