Skip to content

Index

Corpus/Dataset

<- back to homepage

Menu

Word Segmentation Corpus

for segment thai text

Name Description Size License Creator Download
BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot
LST20 Corpus LST20 is a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences ? (Free for research and open source only) NECTEC aiforthai (registration required)
Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub
Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
Corpus Komped Poem (windy part) Pattarawat Chormai GitHub

up to menu

Sentence Segmentation

Name Description Size License Creator Download
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
LST20 Corpus LST20 is a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences ? (Free for research and open source only) NECTEC aiforthai (registration required)
Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub

Part of Speech

Name Description Size License Creator Download
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
LST20 Corpus LST20 is a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences ? (Free for research and open source only) NECTEC aiforthai (registration required)
UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub
thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub

up to menu

Treebank

Name Description Size License Creator Download
UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub
thtb To enable research oppotunities with very few Thai Computational Linguitic resources, we willingly introduce fundamental high-level language resouces built with passion, Thai Treebanks, build from scratch for researchers and enthusiasts. 5,200 sentences CC BY 4.0 Pechlada Seenual, Thodsaporn Chay-intr and Thanaruk Theeramunkong GitHub

up to menu

Natural Language Inference

Name Description Size License Creator Download
XNLI The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 5,000 test and 2,500 dev pairs CC BY-NC 4.0 Facebook Research GitHub

up to menu

Parallel Corpus

for Machine Translate

Name Description Size License Creator Download
TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub
scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub
Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub
Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub
Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle
Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences
dev: 98 articles, 1,000 sentences
test: 97 articles, 1,018 sentences
CC BY 4.0 Website
WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub
Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face
generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub

up to menu

Text Classification

Name Description Size Labels License Creator Download
prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub
wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub
wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub
Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub
Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub
sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub
thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (❤️😄😞🍴⚾) GPL-3.0 License iApp Technology Co, Ltd GitHub
The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub
Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub

up to menu

OCR Dataset

Name Description Size License Creator Download
KVIS Thai OCR Dataset Offline Thai Handwritten Character Dataset CC BY 4.0 John Joseph, Ferdin Joe Website
Thai OCR Thai ocr dataset from NECTEC Training set: 81,100 image CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)

up to menu

Text Summarization

Name Description Size License Creator Download
ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub
TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub

up to menu

Question Answering

Name Description Size License Creator Download
XQuAD XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. 240 paragraphs and 1,190 question-answer pairs CC BY-SA 4.0 DeepMind GitHub
Thai QA Question answering program from Thai Wikipedia. 4,000 question-answer pairs CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai (registration required), wiki: copycatch, Sample data set: copycatch
TyDi QA A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 200k human-annotated question-answer pairs Apache-2.0 License Google Research GitHub
iapp-wiki-qa-dataset Open Thai Wikipedia QA Dataset made by iApp Technology 1,961 Documents
9,170 Questions
MIT License iApp Technology GitHub
MKQA MKQA: Multilingual Knowledge Questions & Answers. MKQA contains 10,000 queries sampled from the Google Natural Questions dataset. 10,000 queries Apple GitHub
Thai WIKI QA Dataset from National Software Contest (NSC) 2018 - 2019 Factoid 15,000 question-answer pairs, boolean 2,000 question CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai

up to menu

Speech Synthesis

Name Description Size License Creator Download
TSync-1 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 6 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub
TSync-2 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 5hr 25m CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot

up to menu

Speech Recognition

Name Description Size License Creator Download
Lotus Thai Speech Recognition corpus from NECTEC (not full corpus) 12 hours CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot: GitHub
Common Voice Corpus Common Voice Corpus from mozilla 133 hours (valid) CC0-1.0 License mozilla Common Voice
Gowajee corpus The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. 11 hours MIT License Ekapol Chuangsuwanich, Atiwong Suchato, Korrawe Karunratanakul, Burin Naowarat, Chompakorn CChaichot and Penpicha Sangsa-nga GitHub
Lotus BN Thai News Speech Recognition corpus from NECTEC (not full corpus) 28 minute CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub
Lotus Cell Thai Speech corpus over the phone. (not full corpus) 11 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub

up to menu

Speech Emotion

Name Description Size License Creator Download
Thai Speech Emotion Dataset Thai Speech Emotion Recognition Dataset 36 hours CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) AIResearch

up to menu

Plagiarism

Name Description Size License Creator Download
Thai Plagiarism Thai Plagiarism Detection http://copycatch.in.th/thai-plagiarism-task.html CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)

up to menu

Named Entity Recognition

Name Description Size License Creator Download
นัชชา ถิระสาโรช corpora by Wirote Aroonmanakun's students ? นัชชา ถิระสาโรช นัชชา ถิระสาโรช Data
ศศิวิมล กาลันสีมา corpora by Wirote Aroonmanakun's students ? ศศิวิมล กาลันสีมา ศศิวิมล กาลันสีมา Data
ณัฐดาพร เลิศชีวะ corpora by Wirote Aroonmanakun's students ? ณัฐดาพร เลิศชีวะ ณัฐดาพร เลิศชีวะ Data
Thai NER Thai NER project is part of PyThaiNLP. CC BY 3.0 Wannaphong Phatthiyaphaibun GitHub
THAI-NEST Thai Named Entity tagging Corpus from NECTEC & Thammasat University CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)
WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Rahimi, Afshin and Li, Yuan and Cohn, Trevor GitHub

up to menu

Dictionaries

Name Description Size License Creator Download
LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)
Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub
Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge
Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub

up to menu

Text Corpus

Name Description Size License Creator Download
Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub
Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub
IO-LM Learn how to talk like an Information-Operation-er GitHub
HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire
thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub
Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website
HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle
ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage
mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face

up to menu

N-gram

Name Description Size License Creator Download
Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook
TTC N-gram from Thai text book 3,037,772 word Website
Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website

up to menu

Word Similarity

Name Description Size License Creator Download
Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul, Gerhard Wohlgenannt, Aleksei Pulich GitHub

up to menu

Grapheme to Phoneme

Name Description Size License Creator Download
Grapheme to Phoneme Thai Grapheme to Phoneme from Wiktionary 14,483 word CC BY-SA 3.0 Wannaphong Phatthiyaphaibun Facebook

up to menu

Name

Name Description Size License Creator Download
Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub

up to menu

WordNet

Name Description Size License Creator Download
Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website
th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge
ธนนท์ หลีน้อย 2008 ธนนท์ หลีน้อย Website
ปริศนา อัครพุทธิพร Data 2008 ปริศนา อัครพุทธิพร Website

up to menu