Text Classification

Corpus

Name Description Size Labels License Creator Download
prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub
wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub
wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub
Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub
Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub
sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub
thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (❤️😄😞🍴⚾) GPL-3.0 License iApp Technology Co, Ltd GitHub
The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub
Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub
LimeSoda: Dataset for Fake News Detection in Healthcare Domain Thai fake news dataset in the healthcare domain consisting of curate and manually annotated 7,191 documents annotated 7,191 documents 3 (fact, fake, or undefined) CC-BY-4.0 License Payoungkhamdee, Patomporn and Porkaew, Peerachet and Sinthunyathum, Atthasith and Songphum, Phattharaphon and Kawidam, Witsarut and Loha-Udom, Wichayut and Boonkwan, Prachya and Sutantayawalee, Vipas GitHub
krathu-500 A dataset of post-comment on Pantip, a popular Thai web board. 3 (Positive, Negative, and Neutral) GitHub
thai_cyberbullying_lgbt LGBT Cyberbullying Detection in Thai Language Utilizing Transformers-Based Algorithms GitHub

Software

Name Description Status Language License
thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment active Python 3.X Apache License 2.0