Sentence Segmentation

Corpus

Name Description Size License Creator Download
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket
Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub

Software

Name Description Status Language License
PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0
TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)
BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License
ThaiSum Simple Thai Sentence Segmentor active Python 3.X Apache Licence 2.0