Part-of-speech tagging
Corpus
Name | Description | Size | License | Creator | Download |
---|---|---|---|---|---|
Orchid Corpus | Thai part of speech (POS) tagged corpus | 5,200 sentences | CC BY-SA-NC 4.0 | NECTEC | Mirror from @wannaphong |
Blackboard Treebank | Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. | 122,851 clauses (38,558 sentences) | CC BY 3.0 | Prachya Boonkwan, NECTEC | bitbucket |
UD Thai PUD | This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. | 1,000 sentences | CC BY-SA 3.0 | Universal Dependencies | GitHub |
thai-political-tweets | A small Thai political twitter dataset with UD POS tags | 41 tweets, 965 words | Unlicense License | Can Udomcharoenchaikit | GitHub |
Software
Name | Description | Status | Language | License |
---|---|---|---|---|
PyThaiNLP | It's part of PyThaiNLP. | active | Python 3.X | Apache License 2.0 |
TLTK | Thai Language Toolkit | active | Python 3.X | BSD License (BSD-3-Clause) |