Language model

Text Corpus

Name Description Size License Creator Download
Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub
Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub
IO-LM Learn how to talk like an Information-Operation-er GitHub
HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire
thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub
Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website
HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle
ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub
OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage
mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face
Multilingual Open Text 1.0: Public Domain News in 44 Languages This is a corpus of public domain news in 44 languages. public domain GitHub
Thai depression detection dataset and baseline models Detecting Depression in Thai Blog Posts: a Dataset and a Baseline. Zenodo

Enocder Preatrained

Name Detail Owner Download
Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub
BERT-th BERT pre-training in Thai language ThAIKeras GitHub
BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub
bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face
WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face
mLUKE A multilingual extension of LUKE. Hugging Face
TwHIN-BERT TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations Twitter GitHub
PhayaThaiBERT 278M P. Sriwirote

Notebook

LLMs

Name Parameters Detail Owner Download
OpenThaiGPT 13B Kobkrit GitHub
Typhoon 7B SCB10X Hugging Face
SeaLLMs 13B DAMO GitHub
Sea-Lion 7.5B AI Singapore GitHub
WangChanGLM 7.5B VISTEC-PyThaiNLP GitHub