Word Segmentation

for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words.

Corpus

Name Description Size License Creator Download
BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot
Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket
Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub
Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 3.0 NECTEC Mirror from @wannaphong
Corpus Komped Poem (windy part) Pattarawat Chormai GitHub
VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called "VISTEC-TP-TH-2021" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub

BEST I

BEST I is the Benchmark for Enhancing the Standard of Thai language processing.

  • Number of words: 5,000,000 words

Details

  • Creator: NECTEC
  • License: CC BY-SA-NC 4.0
  • Paper:
  • Download: aiforthai (registration required)
Benchmarks

We are not benchmarks for this corpus because we have not an answer of testset.

Blackboard Treebank

Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries.

  • 122,851 clauses (38,558 sentences)

Details

  • Creator: NECTEC
  • License: CC-BY 3.0
  • Download: bitbucket
Benchmarks

[WIP]

Orchid Corpus

Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus.

  • Number of words: words

Details

Benchmarks

Orchid Corpus is not have the testset.

Wisesight Corpus

This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus.

  • wisesight-160 has 160 sentences. Number of words: 3,833 words
  • wiseight-1000 has 1,000 sentences. Number of words: 21,745 words
Benchmarks

[WIP]

Thai National Historical Corpus

Thai National Historical Corpus or TNHC tokenized by humans.

  • Number of words: ? words
  • 47 documents, 756,478 lines, 13,361,142 characters

Details

  • Creator: Jitkapat Sawatphol
  • Download: GitHub

Corpus Komped Poem (windy part)

  • Number of words: 317 words

Details

  • Creator: Pattarawat Chormai
  • License: CC-BY-SA 3.0
  • Paper: -
  • Download: GitHub
Benchmarks

[WIP]

VISTEC-TP-TH-21

The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called "VISTEC-TP-TH-2021" or VISTEC-2021.

  • Number of words: 3.39M words

Details

  • Creator: VISTEC & Chiang Mai University
  • License: CC-BY-SA 3.0
  • Paper: -
  • Download: GitHub

Software

Name Description Status Language License
ICU ICU - International Components for Unicode active C/C++/Java Unicode License
libthai is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications. active C/C++ LGPL-2.1 License
SWATH Smart Word Analysis for THai active C/C++ GPL-2.0 License
AttaCut Fast and Reasonably Accurate Word Tokenizer for Thai. active Python 3.X MIT License
PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0
PyWordCut wordcutpy is a simple Thai word breaker written in Python 3+ active Python 3.X LGPLv3
DeepCut A Thai word tokenization library using Deep Neural Network. active Python 3.X MIT License
TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)
KUCut Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH. deactive Python 2.4-2.5 GPL-2.0 License
SEFR CUT Stacked Ensemble Filter and Refine for Word Segmentation active Python 3.X MIT License
CutKum Thai Word-Segmentation with LSTM in Tensorflow - Python 3.X MIT License
ThaiLMCut Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM active Python 3.X MIT License
LexTo Thai word segmentation ( Longest Matching ) - Java LGPLv2.1
sertiscorp /thai-word-segmentation Thai word segmentation with bi-directional RNN - Python 3.X MIT License
Thai Analysis Plugin for Elasticsearch The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch. active Java Apache-2.0 License
Wordcut Thai word breaker for Node.js active JavaScript, Node.JS LGPLv3
V8 BreakIterator Chrome's V8 Engine, using ICU active JavaScript Apache License 2.0
icu-wordsplit Simple icu boundary analysis module bindings for node.js inactive JavaScript BSD
newmm-tokenizer Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP. active Python 3.X Apache License 2.0
Stanza Official Stanford NLP Python Library for Many Human Languages active Python 3.X Apache License 2.0
Multi Candidate Thai Word Segmentation Most existing word segmentation methods output one single segmentation solution. active Python 3.X MIT License
PhlongTaIam PHP Thai word breaker active PHP LGPL-2.1 License
Chamkho Rust Thai word breaker active Rust LGPL-3 License
oxidized-thainlp Thai Natural Language Processing in Rust, with Python-binding. active Python & Rust Apache License 2.0
OSKut Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings)
Stacked Ensemble Framework and DeepCut as Baseline model
active Python MIT License

Tools

Name Description License Creator Download
MudYom MudYom is a module for pre/post-processing text. It combines, aka มัด, words that should be together into one token. This process is done according to a user-defined dictionary. Pattarawat Chormai GitHub