Thread Labeling for News Event

doi:10.1007/s12204-013-1416-z

摘要/Abstract

摘要： Automatic thread labeling for news events can help people know different aspects of a news event. In this paper, we present a method to label threads of a news event. We use latent Dirichlet allocation (LDA) topic model to extract news threads from news corpus. Our method first selects the thread words subset then extracts phrases based on co-occurrence calculation. The extracted phrase is then used as a label of a news thread. Experimental results show that about 60% of generated labels visualize the meaningful aspects of a news event. These labels can help people fast to capture many different aspects of a news event.

关键词: news event, topic labeling, latent Dirichlet allocation (LDA)

Abstract: Automatic thread labeling for news events can help people know different aspects of a news event. In this paper, we present a method to label threads of a news event. We use latent Dirichlet allocation (LDA) topic model to extract news threads from news corpus. Our method first selects the thread words subset then extracts phrases based on co-occurrence calculation. The extracted phrase is then used as a label of a news thread. Experimental results show that about 60% of generated labels visualize the meaningful aspects of a news event. These labels can help people fast to capture many different aspects of a news event.

Key words: news event, topic labeling, latent Dirichlet allocation (LDA)

中图分类号:

TP 391

YAN Ze-hua (闫泽华), LI Fang* (李芳). Thread Labeling for News Event[J]. 上海交通大学学报（英文版）, 2013, 18(4): 418-424.

YAN Ze-hua (闫泽华), LI Fang* (李芳). Thread Labeling for News Event[J]. Journal of shanghai Jiaotong University (Science), 2013, 18(4): 418-424.

参考文献

[1] Cnnic. The 28th statistical report on the Internet development in China [R]. Beijing, China: CNNIC, 2011 (in Chinese).
[2] Mei Q, Shen X, Zhai C. Automatic labeling of multinomial topic models [C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Jose, California, USA: ACM, 2007: 490-499.
[3] Blei D M, Ng A Y, Jordan M I, et al. Latent dirichlet allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] Nallapati R, Feng A, Peng F, et al. Event threading within news topics [C]//Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management. Washington, DC, USA: ACM, 2004: 446-453.
[5] Feng A, Allan J. Finding and linking incidents in news [C]//Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management. Lisboa, Portugal: ACM, 2007: 821-830.
[6] Wang X, McCallum A. Topics over time: A non-Markov continuous-time model of topical trends [C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA: ACM, 2006: 424-433.
[7] Mei Q, Liu C, Su H, et al. A probabilistic approach to spatiotemporal theme pattern mining on weblogs [C]//Proceedings of the 15th International Conference on World Wide Web. Edinburgh, Scotland: ACM, 2006: 533-542.
[8] Mei Q, Zhai C. A mixture model for contextual text mining [C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA: ACM, 2006: 649-655.
[9] Wang C, Zhang M, Ma S, et al. Automatic online news issue construction in web environment [C]//Proceeding of the 17th International Conference on World Wide Web. Beijing, China: ACM, 2008: 457-466.
[10] Xu R, Peng W, Xu J, et al. On-line new event detection using time window strategy [C]//The Proceeding of International Conference on Machine Learning and Cybernetics (ICMLC). Guilin, China: IEEE, 2011: 1932-1937.
[11] Shen D, Yang Q, Sun J, et al. Thread detection in dynamic text message streams [C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, WA, USA: ACM, 2006: 35-42.
[12] Kim J, Candan K, D¨onderler M. Topic segmentation of message hierarchies for indexing and navigation support [C]//Proceedings of the 14th International Conference on World Wide Web. Chiba, Japan: ACM, 2005: 322–331.
[13] Fung G P C, Yu J X, Liu H, et al. Time-dependent event hierarchy construction [C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Jose, California, USA: ACM, 2007: 300-309.
[14] Trieschnigg D, Kraaij W. Hierarchical topic detection in large digital news archives [C]//Proceedings of the 5th Dutch Belgian Information Retrieval Workshop. Utrecht, The Netherlands: University of Twente, 2005: 55-62.
[15] Kleinberg J. Bursty and hierarchical structure in streams [J]. Data Mining and Knowledge Discovery, 2003, 7(4): 373-397.
[16] Turney P. Coherent keyphrase extraction via web mining [C]//Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03). Acapulco, Mexico: Morgan Kaufmann Publishers, 2003.
[17] Ong T, Chen H, Sung W, et al. Newsmap: A knowledge map for online news [J]. Decision Support Systems, 2005, 39(4): 583-597.
[18] Chang J, Boyd-Graber J, Gerrish S, et al. Reading tea leaves: How humans interpret topic models [C]//Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. Vancouver, British Columbia, Canada: Curran Associates Inc., 2009.
[19] Pantel P, Ravichandran D. Automatically labeling semantic classes [C]//Proceedings of HLT/NAACL. Stroudsburg, PA, USA: Association for Computational Linguistics, 2004: 321-328.
[20] Yang Y, Pedersen J. A comparative study on feature selection in text categorization [C]//Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97). Palo Alto, California, USA: AAAI Press, 1997: 412-420.
[21] Gabrilovich E, Markovitch S. Overcoming the brittleness bottleneck using wikipedia: Enhancing text categorization with encyclopedic knowledge [C]//Proceedings of the 21st National Conference on Artificial Intelligence. Palo Alto, California, USA: AAAI Press, 2006: 1301-1306.
[22] Carmel D, Roitman H, Zwerdling N. Enhancing cluster labeling using wikipedia [C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Boston, Massachusetts, USA: ACM, 2009: 139-146.
[23] Lau J H, Newman D, Karimi S, et al. Best topic word selection for topic labelling [C]//Coling 2010: Posters. Beijing, China: Coling 2010 Organizing Committee, 2010: 605-613.
[24] Lau J, Grieser K, Newman D, et al. Automatic labelling of topic models [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 1536-1545.
[25] Song Y, Pan S, Liu S, et al. Topic and keyword re-ranking for LDA-based topic modeling [C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York, NY, USA: ACM, 2009: 1757-1760.
[26] Blei D, Lafferty J. Visualizing topics with multiword expressions [EB/OL]. (2009-07-06) [2011-07-07]. http://arxiv.org/abs/0907.1013.
[27] Wilson A T, Chew P A. Term weighting schemes for latent dirichlet allocation [C]//Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010: 465-473.
[28] Wang X, McCallum A, Wei X. Topical n-grams: Phrase and topic discovery, with an application to information retrieval [C]//Seventh IEEE International Conference on Data Mining. Omaha, NE, USA: IEEE, 2007: 697-702.