Sentence Segmentation

Corpus

Name Description Size License Creator Download
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
LST20 Corpus LST20 is a large-scale corpus with multiple layers of linguistic annotation for Thai language processing. 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences ? (Free for research and open source only) NECTEC aiforthai (registration required)
Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub

Software

Name Description Status Language License
PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0
TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)
BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License