Part-of-speech tagging

Corpus

Name Description Size License Creator Download
Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong
Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket
UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub
thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub
Thai Universal Dependency Treebank (TUD) Thai Universal Dependency Treebank, consisting of 3,627 trees annotated in accordance with the Universal Dependencies (UD) framework. 3,627 trees Chulalongkorn University GitHub
Thai Discourse Treebank Thai Discourse Treebank is the first and largest Thai corpus annotated with explicit discourse relations in the style of the English Penn Discourse Treebank 3 scheme. The final corpus consists of 10,602 sentences from 384 documents, 180 of which have complete annotation of discourse connectives and its two argument spans. Ponrawee Prasertsom, Apiwat Jaroonpol, Attapol T. Rutherford GitHub

Software

Name Description Status Language License
PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0
TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)