Sentence Segmentation
Corpus
Name | Description | Size | License | Creator | Download |
---|---|---|---|---|---|
Orchid Corpus | Thai part of speech (POS) tagged corpus | 5,200 sentences | CC BY-SA-NC 4.0 | NECTEC | Mirror from @wannaphong |
Blackboard Treebank | Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. | 122,851 clauses (38,558 sentences) | CC BY 3.0 | Prachya Boonkwan, NECTEC | bitbucket |
Fake review | CC BY-SA 4.0 | AI Research Institute of Thailand (AIResearch) | GitHub |
Software
Name | Description | Status | Language | License |
---|---|---|---|---|
PyThaiNLP | It's part of PyThaiNLP. | active | Python 3.X | Apache License 2.0 |
TLTK | Thai Language Toolkit | active | Python 3.X | BSD License (BSD-3-Clause) |
BoydCut | Bidirectional LSTM-CNN Model for Thai Sentence Segmenter | active | Python 3.X | MIT License |
ThaiSum | Simple Thai Sentence Segmentor | active | Python 3.X | Apache Licence 2.0 |