Machine Translation

Corpus

Name Description Size License Creator Download
TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub
scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub
Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub
Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub
Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle
Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences
dev: 98 articles, 1,000 sentences
test: 97 articles, 1,018 sentences
CC BY 4.0 Website
WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub
Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face
generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub
FLORES-101 FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Facebook GitHub

Software

Name Description Status Language License
PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0

Pretrained

Name Description Status Language License
Lalita Chinese-Thai Machine Translation Chinese-Thai Machine Translation by AI Builder active Python 3.X Apache License 2.0
English-Thai Machine Translation Models English-Thai Machine Translation Models by VISTEC-depa Thailand Artificial Intelligence Research Institute active Python 3.X Apache License 2.0