Skip to content

Pre-trained models/Models

<- back to homepage



Text summarization

Model Detail Paper Source
mT5: Multilingual T5 mT5: A massively multilingual pre-trained text-to-text transformer GitHub

... [WIP]

Word embeddings

Name Detail Download
ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub
Thai2Fit (old Thai2Vec) Homepage
Download word2vec: PyThaiNLP
LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub

... [WIP]

Language model

Name Detail Owner Download
Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from Charin Polpanumas GitHub
BERT-th BERT pre-training in Thai language ThAIKeras GitHub
BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub
bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face
WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face


Sentence Embedding

Name Detail Owner Download
LASER LASER Language-Agnostic SEntence Representations Facebook GitHub
MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Google Tensorflow Hub