Language model

Text Corpus

Name Description Size License Creator Download
Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub
Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub
IO-LM Learn how to talk like an Information-Operation-er GitHub
HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire
thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub
Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website
HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle
ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage
mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face

Preatrained

Name Detail Owner Download
Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub
BERT-th BERT pre-training in Thai language ThAIKeras GitHub
BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub
bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face
WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face

Notebook