Text Summarization

Corpus

Name Description Size License Creator Download
ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub
TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub
XL-Sum This dataset annotated article-summary pairs from BBC News and covers 45 languages ranging from low to high-resource. 8,268 (for thai) CC BY-NC-SA 4.0 GitHub
ThaiCrossSum Corpora Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization th-en 310,926 articles and th-zh 310,926 articles Nakhun Chumpolsathien GitHub

Pretrained

Model Detail Paper Download
mT5: Multilingual T5 Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. mT5: A massively multilingual pre-trained text-to-text transformer GitHub
BertSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub
ARedSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub
TNCLS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub
CLS+MS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub
CLS+MT Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub
XLS – RL-ROUGE Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub