Other

<- back to homepage

Menu

Dictionaries

Name Description Size License Creator Download
LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)
Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub
Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge
Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub

up to menu

N-gram

Name Description Size License Creator Download
Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook
TTC N-gram from Thai text book 3,037,772 word Website
Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website

up to menu

Word Similarity

Name Description Size License Creator Download
Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul, Gerhard Wohlgenannt, Aleksei Pulich GitHub

up to menu

Thai Name

Name Description Size License Creator Download
Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub

up to menu

WordNet

Name Description Size License Creator Download
Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website
th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge
ธนนท์ หลีน้อย 2008 ธนนท์ หลีน้อย Website
ปริศนา อัครพุทธิพร Data 2008 ปริศนา อัครพุทธิพร Website

up to menu

Word embeddings

Name Detail Download
ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub
FastText Word vectors The pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. Website
Thai2Fit (old Thai2Vec) Homepage
Download word2vec: PyThaiNLP
LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub

Sentence Embedding

Name Detail Paper Owner Download
LASER LASER Language-Agnostic SEntence Representations Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Facebook GitHub
MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Multilingual Universal Sentence Encoder for Semantic Retrieval Google Tensorflow Hub
LaBSE Language-Agnostic BERT Sentence Embedding by Google AI. Language-agnostic BERT Sentence Embedding Google