Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection
Keywords:
Thai text summarization, multi-document summarization, iterative weightingAbstract
There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances.Downloads
Download data is not yet available.
Downloads
Published
2016-05-31
How to Cite
Ketui, N., & Theeramunkong, T. (2016). Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection. Computing and Informatics, 35(1), 1–29. Retrieved from http://147.213.75.17/ojs/index.php/cai/article/view/2209
Issue
Section
Articles