A Novel Word Clustering and Cluster Merging Technique for Named Entity Recognition

Rakesh Patra; Sujan Kumar Saha

doi:10.1515/jisys-2016-0074

Artikel

A Novel Word Clustering and Cluster Merging Technique for Named Entity Recognition

Rakesh Patra und Sujan Kumar Saha

Veröffentlicht/Copyright: 7. Juni 2017

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Journal of Intelligent Systems Band 28 Heft 1

Abstract

In this paper, we present a novel word clustering technique to capture contextual similarity among the words. Related word clustering techniques in the literature rely on the statistics of the words collected from a fixed and small word window. For example, the Brown clustering algorithm is based on bigram statistics of the words. However, in the sequential labeling tasks such as named entity recognition (NER), longer context words also carry valuable information. To capture this longer context information, we propose a new word clustering algorithm, which uses parse information of the sentences and a nonfixed word window. This proposed clustering algorithm, named as variable window clustering, performs better than Brown clustering in our experiments. Additionally, to use two different clustering techniques simultaneously in a classifier, we propose a cluster merging technique that performs an output level merging of two sets of clusters. To test the effectiveness of the approaches, we use two different NER data sets, namely, Hindi and BioCreative II Gene Mention Recognition. A baseline NER system is developed using conditional random fields classifier, and then the clusters using individual techniques as well as the merged technique are incorporated to improve the classifier. Experimental results demonstrate that the cluster merging technique is quite promising.

Keywords: Word clustering; brown clustering; hierarchical clustering; cluster merging; named entity recognition

Classification: 91C20; 68T50; 68T30; 62H30

Bibliography

[1] R. K. Ando, BioCreative II Gene Mention tagging system at IBM Watson, in: Proc. Second BioCreative Challenge Evaluation Workshop, pp. 101–103, 2007.Suche in Google Scholar

[2] C. Biemann, Chinese whispers — an efficient graph clustering algorithm and its application to natural language processing problems, in: Proc. HLT-NAACL-06 Workshop on Textgraphs-06, 2006.10.3115/1654758.1654774Suche in Google Scholar

[3] P. F. Brown, P. V. Desouza, R. L. Mercer, V. J. D. Pietra and J. C. Lai, Class-based n-gram models of natural language, Comput. Linguist.18 (1992), 467–479.Suche in Google Scholar

[4] H. L. Chieu and H. T. Ng, Named entity recognition: a maximum entropy approach using global information, in: Proc. 19th Int. Conf. Computational Linguistics, pp. 1–7, 2002.10.3115/1072228.1072253Suche in Google Scholar

[5] A. Ekbal and S. Saha, Combining feature selection and classifier ensemble using a multiobjective simulated annealing approach: application to named entity recognition, Soft Comput.17 (2013), 1–16.10.1007/s00500-012-0885-6Suche in Google Scholar

[6] A. Ekbal, S. Saha and U. K. Sikdar, On active annotation for named entity recognition, Int. J. Mach. Learn. & Cyber.7 (2016) 623–640.10.1007/s13042-014-0275-8Suche in Google Scholar

[7] J. R. Finkel, T. Grenager and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in: Proc. 43rd Annual Meeting of the ACL, pp. 363–370, 2005.10.3115/1219840.1219885Suche in Google Scholar

[8] K. Ganchev, K. Crammer, F. Pereira, G. Mann, K. Bellare, A. McCallum, S. Carroll, Y. Jin and P. White, Penn/UMass/CHOP BioCreative II systems, in: Proc. Second BioCreative Challenge Evaluation Workshop, pp. 119–124, 2007.Suche in Google Scholar

[9] Z. GuoDong and S. Jian, Exploring deep knowledge resources in biomedical name recognition, in: Proc. Joint Workshop on NLP in Biomedicine and Its Applications, pp. 96–99, 2004.10.3115/1567594.1567616Suche in Google Scholar

[10] X. Han and J. Zhao, Named entity disambiguation by leveraging Wikipedia semantic knowledge, in: Proc. ACM Conf. Information and Knowledge Management, pp. 215–224, 2009.10.1145/1645953.1645983Suche in Google Scholar

[11] H. S. Huang, Y. S. Lin, K. T. Lin, C. J. Kuo, Y. M. Chang, B. H. Yang, I. F. Chung and C. N. Hsu, High-recall Gene Mention Recognition by unification of multiple backward parsing models, in: Proc. Second Bio-Creative Challenge Evaluation Workshop, pp. 109–111, 2007.Suche in Google Scholar

[12] J. I. Kazama and K. Torisawa, Exploiting Wikipedia as external knowledge for named entity recognition, in: Proc. Joint Conference on EMNLP and CoNLL, pp. 698–707, 2007.Suche in Google Scholar

[13] J. Kuo, Y. M. Chang, H. S. Huang, K. T. Lin, B. H. Yang, Y. S. Lin, C. N. Hsu and I. F. Chung, Rich feature set, unification of bidirectional parsing and dictionary filtering for high F-score Gene Mention tagging, in: Proc. BioCreative Challenge Evaluation Workshop, pp. 105–107, 2007.Suche in Google Scholar

[14] J. Lafferty, A. McCallum and F. C. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proc. International Conference on Machine Learning, pp. 282–289, 2001.Suche in Google Scholar

[15] W. Li and A. McCallum, Rapid development of Hindi named entity recognition using conditional random fields and feature induction, ACM Trans. Asian Lang. Inf. Process. (TALIP)2 (2004), 290–294.10.1145/979872.979879Suche in Google Scholar

[16] P. Liang, Semi-supervised learning for natural language, Master’s thesis, Massachusetts Institute of Technology, 2005.Suche in Google Scholar

[17] Y. Matsuo and K. Uchiyama, Graph-based word clustering using web search engine, in: Proc. EMNLP 2006, pp. 542–550, 2006.10.3115/1610075.1610150Suche in Google Scholar

[18] Y. Merhav, F. Mesquita, D. Barbosa, W. G. Yee and O. Frieder. Incorporating global information into named entity recognition systems using relational context, in: Proc. International ACM Conference on Research and Development in Information Retrieval, SIGIR 2010, pp. 883–884, 2010.10.1145/1835449.1835664Suche in Google Scholar

[19] S. Miller, J. Guinness and A. Zamanian. Name tagging with word clusters and discriminative training, in: Proc. HLT-NAACL, 2004.Suche in Google Scholar

[20] T. Munkhdalai, M. Li, K. Batsuren, H. Park, N. Choi and K. H. Ryu, Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations, J. Cheminf.7 (2015), S9.10.1186/1758-2946-7-S1-S9Suche in Google Scholar PubMed PubMed Central

[21] F. Pereira, N. Tishby and L. Lee, Distributional clustering of English words, in: Proc. Annual Meeting of the ACL, pp. 183–190, 1993.10.3115/981574.981598Suche in Google Scholar

[22] L. Ratinov and D. Roth, Design challenges and misconceptions in named entity recognition, in: Proc. Thirteenth Conference on Computational Natural Language Learning (CoNLL), pp. 147–155, 2009.10.3115/1596374.1596399Suche in Google Scholar

[23] S. K. Saha, P. Mitra and S. Sarkar, A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition, Knowl. Based Syst.27 (2012), 322–332.10.1016/j.knosys.2011.09.015Suche in Google Scholar

[24] S. K. Saha, S. Sarkar and P. Mitra, A hybrid feature set based maximum entropy Hindi named entity recognition, in: Proc. Third International Joint Conference on Natural Language Processing (IJCNLP-08), pp. 343–349, 2008.Suche in Google Scholar

[25] R. Sasano and S. Kurohashi, Japanese named entity recognition using structural natural language processing, in: Proc. Third International Joint Conference on Natural Language Processing (IJCNLP-08), pp. 607–612, 2008.Suche in Google Scholar

[26] A. K. Singh, Named entity recognition for South and South East Asian languages: taking stock, in: Proc. IJCNLP-08 Workshop on NER for South and South East Asian Languages, pp. 5–16, 2008.Suche in Google Scholar

[27] L. Smith, L. K. Tanabe, R. J. Ando, C. J. Kuo, I. F. Chung, C. N. Hsu, Y. S. Lin, R. Klinger, C. M. Friedrich, K. Ganchev and M. Torii, Overview of BioCreative II Gene Mention Recognition, Genome Biol.9 (2008), 1–19.10.1186/gb-2008-9-s2-s2Suche in Google Scholar PubMed PubMed Central

[28] B. Tang, H. Cao, X. Wang, Q. Chen and H. Xu, Evaluating word representation features in biomedical named entity recognition tasks, BioMed Res. Int.2014 (2014). Article ID 240403, 6, doi: 10.1155/2014/240403.10.1155/2014/240403Suche in Google Scholar PubMed PubMed Central

[29] J. Turian, L. Ratinov and Y. Bengio, Word representations: a simple and general method for semi-supervised learning, in: Proc. 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394, 2010.Suche in Google Scholar

[30] A. Ushioda, Hierarchical clustering of words, in: Proc. COLING, pp. 1159–1162, 1996.10.3115/993268.993390Suche in Google Scholar

[31] J. Uszkoreit and T. Brants, Distributed word clustering for large scale class-based language modeling in machine translation, in: Proc. ACL-08: HLT, pp. 755–762, 2008.Suche in Google Scholar

[32] A. Yeh, More accurate tests for the statistical significance of result differences, in: Proc. COLING 2000.10.3115/992730.992783Suche in Google Scholar

Received: 2016-06-09

Published Online: 2017-06-07

Published in Print: 2019-01-28

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/jisys-2016-0074

Schlagwörter für diesen Artikel

Word clustering; brown clustering; hierarchical clustering; cluster merging; named entity recognition