Word Segmentation

for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words.

Corpus

Name	Description	Size	License	Creator	Download
BEST I (BEST 2009)	Benchmark for Enhancing the Standard of Thai language processing	5,000,000 word	CC BY-SA-NC 4.0	NECTEC	aiforthai (registration required) and Mirror from @korakot
Blackboard Treebank	Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries.	122,851 clauses (38,558 sentences)	CC BY 3.0	Prachya Boonkwan, NECTEC	bitbucket
Wisesight Samples with Word Tokenization Label	This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus.	160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000)	CC0-1.0 License	Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford	GitHub
Thai National Historical Corpus (TNHC)	texts from Thai National Historical Corpus, stored by lines (manually tokenized).	47 documents, 756,478 lines, 13,361,142 characters		Jitkapat Sawatphol	GitHub
Orchid Corpus	Thai part of speech (POS) tagged corpus	5,200 sentences	CC BY-SA-NC 3.0	NECTEC	Mirror from @wannaphong
Corpus Komped Poem (windy part)				Pattarawat Chormai	GitHub
VISTEC-TP-TH-21	The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called "VISTEC-TP-TH-2021" or VISTEC-2021.	49,997 sentences with 3.39M words	CC-BY-SA 3.0	VISTEC & Chiang Mai University	GitHub

BEST I

BEST I is the Benchmark for Enhancing the Standard of Thai language processing.

Number of words: 5,000,000 words

Details

Creator: NECTEC
License: CC BY-SA-NC 4.0
Paper:
Download: aiforthai (registration required)

Benchmarks

We are not benchmarks for this corpus because we have not an answer of testset.

Blackboard Treebank

Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries.

122,851 clauses (38,558 sentences)

Details

Creator: NECTEC
License: CC-BY 3.0
Download: bitbucket

Benchmarks

[WIP]

Orchid Corpus

Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus.

Number of words: words

Details

Creator: NECTEC
License: CC BY-SA-NC 3.0
Paper: Thai Part-of-speech Tagged Corpus: ORCHID
Download: Mirror from @wannaphong

Benchmarks

Orchid Corpus is not have the testset.

Wisesight Corpus

This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus.

wisesight-160 has 160 sentences. Number of words: 3,833 words
wiseight-1000 has 1,000 sentences. Number of words: 21,745 words

Benchmarks

[WIP]

Thai National Historical Corpus

Thai National Historical Corpus or TNHC tokenized by humans.

Number of words: ? words
47 documents, 756,478 lines, 13,361,142 characters

Details

Creator: Jitkapat Sawatphol
Download: GitHub

Corpus Komped Poem (windy part)

Number of words: 317 words

Details

Creator: Pattarawat Chormai
License: CC-BY-SA 3.0
Paper: -
Download: GitHub

Benchmarks

[WIP]

VISTEC-TP-TH-21

The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called "VISTEC-TP-TH-2021" or VISTEC-2021.

Number of words: 3.39M words

Details

Creator: VISTEC & Chiang Mai University
License: CC-BY-SA 3.0
Paper: -
Download: GitHub

Software

Name	Description	Status	Language	License
ICU	ICU - International Components for Unicode	active	C/C++/Java	Unicode License
libthai	is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications.	active	C/C++	LGPL-2.1 License
SWATH	Smart Word Analysis for THai	active	C/C++	GPL-2.0 License
AttaCut	Fast and Reasonably Accurate Word Tokenizer for Thai.	active	Python 3.X	MIT License
PyThaiNLP	It's part of PyThaiNLP.	active	Python 3.X	Apache License 2.0
PyWordCut	wordcutpy is a simple Thai word breaker written in Python 3+	active	Python 3.X	LGPLv3
DeepCut	A Thai word tokenization library using Deep Neural Network.	active	Python 3.X	MIT License
TLTK	Thai Language Toolkit	active	Python 3.X	BSD License (BSD-3-Clause)
KUCut	Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH.	deactive	Python 2.4-2.5	GPL-2.0 License
SEFR CUT	Stacked Ensemble Filter and Refine for Word Segmentation	active	Python 3.X	MIT License
CutKum	Thai Word-Segmentation with LSTM in Tensorflow	-	Python 3.X	MIT License
ThaiLMCut	Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM	active	Python 3.X	MIT License
LexTo	Thai word segmentation ( Longest Matching )	-	Java	LGPLv2.1
sertiscorp /thai-word-segmentation	Thai word segmentation with bi-directional RNN	-	Python 3.X	MIT License
Thai Analysis Plugin for Elasticsearch	The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch.	active	Java	Apache-2.0 License
Wordcut	Thai word breaker for Node.js	active	JavaScript, Node.JS	LGPLv3
V8 BreakIterator	Chrome's V8 Engine, using ICU	active	JavaScript	Apache License 2.0
icu-wordsplit	Simple icu boundary analysis module bindings for node.js	inactive	JavaScript	BSD
newmm-tokenizer	Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP.	active	Python 3.X	Apache License 2.0
Stanza	Official Stanford NLP Python Library for Many Human Languages	active	Python 3.X	Apache License 2.0
Multi Candidate Thai Word Segmentation	Most existing word segmentation methods output one single segmentation solution.	active	Python 3.X	MIT License
PhlongTaIam	PHP Thai word breaker	active	PHP	LGPL-2.1 License
Chamkho	Rust Thai word breaker	active	Rust	LGPL-3 License
oxidized-thainlp	Thai Natural Language Processing in Rust, with Python-binding.	active	Python & Rust	Apache License 2.0
OSKut	Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) Stacked Ensemble Framework and DeepCut as Baseline model	active	Python	MIT License

Tools

Name	Description	License	Creator	Download
MudYom	MudYom is a module for pre/post-processing text. It combines, aka มัด, words that should be together into one token. This process is done according to a user-defined dictionary.		Pattarawat Chormai	GitHub