DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

Maria Antònia Martí; Mariona Taulé; Venelin Kovatchev; Maria Salamó

doi:10.1515/cllt-2018-0028

Artikel

DISCOver: DIStributional approach based on syntactic dependencies for discovering COnstructions

Maria Antònia Martí
Maria Antònia Martí is a professor of Computational Linguistics at the University of Barcelona. She is currently the Director of the CLiC research group (Center for Language and Computation). Her research is focussed on Corpus Linguistics and Distributional Semantics. She teaches courses in Empirical Linguistics, Corpus Linguistics and Introduction to Linguistics to both undergraduate and postgraduate students.
, Mariona Taulé
Mariona Taulé is a professor in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. She is also Secretary of the Sociedad Espaola de Procesamiento del Lenguaje Natural and edits the Journal Procesamiento del Lenguaje Natural. Her research and publications are related to computational linguistics and natural language processing and, especially, to lexical semantics, corpus linguistics and development of linguistic resources for natural language processing, basically for Spanish, Catalan and English.
, Venelin Kovatchev
Venelin Kovatchev is a PhD researcher in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. His research is focused on paraphrasing, textual entailment and semantic similarity.
und Maria Salamó
Maria Salamó received both her B.S. in Computer Science (1999) and her Ph.D. (2004) degrees from the Universitat Ramon LLull (Spain). She is associated professor in the University of Barcelona and member of Institute of Complex Systems (UBICS). Her research covers a broad range of topics within AI including Natural Language Processing, Machine Learning, Recommender Systems, and User Modeling.

Veröffentlicht/Copyright: 4. Januar 2019

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Corpus Linguistics and Linguistic Theory Band 17 Heft 2

Abstract

One of the goals in Cognitive Linguistics is the automatic identification and analysis of constructions, since they are fundamental linguistic units for understanding language. This article presents DISCOver, an unsupervised methodology for the automatic discovery of lexico-syntactic patterns that can be considered as candidates for constructions. This methodology follows a distributional semantic approach. Concretely, it is based on our proposed pattern-construction hypothesis: those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions. Our proposal uses Distributional Semantic Models for modelling the context taking into account syntactic dependencies. After a clustering process, we linked all those clusters with strong relationships and we use them as a source of information for deriving lexico-syntactic patterns, obtaining a total number of 220,732 candidates from a 100 million token corpus of Spanish. We evaluated the patterns obtained intrinsically, applying statistical association measures and they were also evaluated qualitatively by experts. Our results were superior to the baseline in both quality and quantity in all cases. While our experiments have been carried out using a Spanish corpus, this methodology is language independent and only requires a large corpus annotated with the parts of speech and dependencies to be applied.

Keywords: constructions; semantics; distributional semantic models

Funding statement: Funding: This work was supported by Ministerio de Economía y Competitividad, Funder Id: 10.13039/501100003329, Grant Number: TIN2015-71147 and Generalitat de Catalunya, Funder Id: 10.13039/501100002809, Grant Number: 2017 SGR 341.

About the authors

Maria Antònia Martí

Maria Antònia Martí is a professor of Computational Linguistics at the University of Barcelona. She is currently the Director of the CLiC research group (Center for Language and Computation). Her research is focussed on Corpus Linguistics and Distributional Semantics. She teaches courses in Empirical Linguistics, Corpus Linguistics and Introduction to Linguistics to both undergraduate and postgraduate students.

Mariona Taulé

Mariona Taulé is a professor in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. She is also Secretary of the Sociedad Espaola de Procesamiento del Lenguaje Natural and edits the Journal Procesamiento del Lenguaje Natural. Her research and publications are related to computational linguistics and natural language processing and, especially, to lexical semantics, corpus linguistics and development of linguistic resources for natural language processing, basically for Spanish, Catalan and English.

Venelin Kovatchev

Venelin Kovatchev is a PhD researcher in the Linguistics Department at University of Barcelona and a member of the CLiC research group (Center of Language and Computation) and UBICS (Universitat de Barcelona Institute of Complex Systems) at the same University. His research is focused on paraphrasing, textual entailment and semantic similarity.

Maria Salamó

Maria Salamó received both her B.S. in Computer Science (1999) and her Ph.D. (2004) degrees from the Universitat Ramon LLull (Spain). She is associated professor in the University of Barcelona and member of Institute of Complex Systems (UBICS). Her research covers a broad range of topics within AI including Natural Language Processing, Machine Learning, Recommender Systems, and User Modeling.

References

Baldwin, Timothy & Su Nam Kim. 2010. Multiword expressions. Handbook of natural language processing 2. 267–292.Suche in Google Scholar

Baroni, Marco. 2013. Composition in distributional semantics. Language and Linguistics Compass 7(10). 511–522.10.1111/lnc3.12050Suche in Google Scholar

Baroni, Marco & Alessandro Lenci. 2010. Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4). 673–721. ISSN: 0891-2017.10.1162/coli_a_00016Suche in Google Scholar

Baroni, Marco, Brian Murphy, Eduard Barbu & Massimo Poesio. 2010. Strudel: A corpus-based semantic model based on properties and types. Cognitive Science 34(2). 222–254.10.1111/j.1551-6709.2009.01068.xSuche in Google Scholar

Bartsch, S. 2004. Structural and functional properties of collocations in English: A corpus study of lexical and pragmatic constraints on lexical co-occurrence. International Journal of Corpus Linguistics 10. 266–270. 10.1075/ijcl.10.2.08nes.10.1075/ijcl.10.2.08nesSuche in Google Scholar

Biemann, Chris & Eugenie Giesbrecht. 2011. Distributional semantics and compositionality 2011: Shared task description and results. In Proceedings of the workshop on distributional semantics and compositionality, 21–28. Association for Computational Linguistics.Suche in Google Scholar

Caliński, T. & J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-Simulation and Computation 3(1). 1–27.10.1080/03610917408548446Suche in Google Scholar

Croft, W. & D.A. Cruse. 2004. Cognitive linguistics. Cambridge Textbooks in Linguistics. Cambridge University Press. ISBN: 9780521667708.10.1017/CBO9780511803864Suche in Google Scholar

Dubremetz, Marie & Joakim Nivre. 2014. Extraction of nominal multiword expressions in French. EACL 2014. 72.10.3115/v1/W14-0812Suche in Google Scholar

Duffield, Cecily Jill, Jena D. Hwang & Laura A. Michaelis. 2010. Identifying assertions in text and discourse: the presentational relative clause construction. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 17–24. Association for Computational Linguistics.Suche in Google Scholar

Evert, Stefan. 2008. Corpora and collocations. Corpus linguistics. An international handbook 2. 223–233.Suche in Google Scholar

Farahmand, Meghdad & Ronaldo Martins. 2014. A supervised model for extraction of multiword expressions based on statistical context features. EACL 2014. 10.10.3115/v1/W14-0802Suche in Google Scholar

Fillmore, Charles J., Russell Lee-Goldman, and Russell Rhodes. 2012. The Framenet constructicon. Sign-based construction grammar. Stanford, CA: CSLI.Suche in Google Scholar

Forsberg, Markus, Richard Johansson, Linnéa Bäckström, Lars Borin, Benjamin Lyngfelt, Joel Olofsson & Julia Prentice. 2014. From construction candidates to construction entries. An experiment using semi-automatic methods for identifying constructions in corpora. Constructions and Frames 6(1). 114–135. ISSN: 1876-1933.10.1075/cf.6.1.07forSuche in Google Scholar

Franco-Salvador, Marc, Rangel Francisco, Rosso Paolo, Taulé Mariona & Mart&’ı M. Antónia. 2015. Language variety identification using distributed representations of words and documents. In Proceedings of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality and Interaction, Lectures Notes in Computer Science. Springer Verlag.10.1007/978-3-319-24027-5_3Suche in Google Scholar

Gamallo, Pablo, Alexandre Agustini & Gabriel P. Lopes. 2005. Clustering syntactic positions with similar semantic requirements. Computational Linguistics 31(1). 107–146.10.1162/0891201053630318Suche in Google Scholar

Goldberg, A. E. 1995. Constructions: A construction grammar approach to argument structure. Cognitive Theory of Language and Culture. University of Chicago Press. ISBN: 9780226300863.Suche in Google Scholar

Goldberg, A. E. 2006. Constructions at work, 280. Oxford: Oxford University Press. ISBN 0-19-9-268517 and 0-19-9-268525 (pbk).Suche in Google Scholar

Goldberg, Adele E. 2013. Argument structure constructions versus lexical rules or derivational verb templates. Mind & Language 28(4). 435–465.10.1111/mila.12026Suche in Google Scholar

Gries, Stefan Th. & Nich C. Ellis. 2015. Statistical measures for usage-based linguistics. Language Learning (65). 1–28.10.1111/lang.12119Suche in Google Scholar

Gries, Stefan Th., Beate Hampe & Doris Schönefeld. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics (16). 635–676.10.1515/cogl.2005.16.4.635Suche in Google Scholar

Harris, Zellig. 1954. Distributional structure. Word 10(23). 146–162.10.1080/00437956.1954.11659520Suche in Google Scholar

Hwang, Jena D., Rodney D. Nielsen & Martha Palmer. 2010. Towards a domain independent semantics: Enhancing semantic representation with construction grammar. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, 1–8. Association for Computational Linguistics.Suche in Google Scholar

Karypis, George. 2003. CLUTO – a clustering toolkit. Technical report, University of Minnesota.10.21236/ADA439508Suche in Google Scholar

Kesselmeier, K., T. Kiss, A. Müller, C. Roch, T. Stadteld & J. Strunk. 2009. Mining for preposition-noun constructions in german. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.Suche in Google Scholar

Kovatchev, Venelin, Maria Salamó, & M. Antònia Mart&’ı. 2016. Comparing distributional semantics models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural 57. 109–116.Suche in Google Scholar

Landauer, T. K., D. S. McNamara, S. Dennis & W. Kintsch. 2007. Handbook of latent semantic analysis. University of Colorado Institute of Cognitive Science Series. Lawrence Erlbaum Associates. ISBN: 9780805854183.10.4324/9780203936399Suche in Google Scholar

Lapesa, Gabriella & Stefan Evert. 2014. A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. TACL 2. 531–545. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/457.10.1162/tacl_a_00201Suche in Google Scholar

Levin, Beth. 1993. English verb classes and alternations: A preliminary investigation, xviii + 348. Chicago: The University of Chicago Press. Hardbound, ISBN 0-226-47532-8, Paperbound ISBN 0-226-47533-6.Suche in Google Scholar

Lin, Dekang & Patrick Pantel. 2001. Dirt@ sbt@ discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 323–328. ACM.10.1145/502512.502559Suche in Google Scholar

Mikolov, Tomas, Wen-tau Yih & Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, 746–751.Suche in Google Scholar

Miller, George A. 1995. Wordnet: A lexical database for english. Communication of the ACM 38(11). 39–41. ISSN: 0001-0782.10.1145/219717.219748Suche in Google Scholar

Mitchell, Jeff & Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science 34(8). 1388–1439.10.1111/j.1551-6709.2010.01106.xSuche in Google Scholar

Muischnek, K. & H. Sajkan. 2009. Using collocation-finding methods to extract constructions and estimate their productivity. In Workshop on Extracting and Using Constructions in Natural Language Processing, NODALIDA 2009.Suche in Google Scholar

Murphy, Brian, Partha Pratim Talukdar & Tom M. Mitchell. 2012. Learning effective and interpretable semantic models using non-negative sparse embedding. In COLING, 1933–1950.Suche in Google Scholar

Navigli, Roberto & Simone Paolo Ponzetto. 2012. Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193. 217–250. ISSN: 0004-3702.10.1016/j.artint.2012.07.001Suche in Google Scholar

Niwa, Yoshiki & Yoshihiko Nitta. 1994. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In Proceedings of the 15th Conference on Computational Linguistics, volume 1 of COLING ’94, 304–309, Stroudsburg, PA, USA. Association for Computational Linguistics.10.3115/991886.991938Suche in Google Scholar

Nunberg, Geoffrey, Ivan A. Sag & Thomas Wasow. 1994. Idioms. Language. 491–538.10.1353/lan.1994.0007Suche in Google Scholar

O’Donnell, Matthew Brook & Nick Ellis. 2010. Towards an inventory of english verb argument constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 9–16, Stroudsburg, PA, USA. Association for Computational Linguistics.Suche in Google Scholar

Padró, Llu&‘ıs & Evgeny Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk & Stelios Piperidis (eds.), LREC, 2473–2479. European Language Resources Association (ELRA). ISBN: 978-2-9517408-7-7.Suche in Google Scholar

Pecina, Pavel. 2010. Lexical association measures and collocation extraction. Language Resources and Evaluation 44. 137–158. ISSN: 1574-020X.10.1007/s10579-009-9101-4Suche in Google Scholar

Ramisch, Carlos, Aline Villavicencio & Christian Boitet. 2010. Multiword expressions in the wild?: The mwetoolkit comes in handy. In Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations, 57–60. Association for Computational Linguistics.Suche in Google Scholar

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for nlp. In Computational linguistics and intelligent text processing, 1–15. Springer/Berlin/Heidelberg.10.1007/3-540-45715-1_1Suche in Google Scholar

Sangati, Federico & Andreas van Cranenburgh. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of NAACL-HLT, 10–18.10.3115/v1/W15-0902Suche in Google Scholar

Shutova, Ekaterina, Lin Sun & Anna Korhonen. 2010. Metaphor identification using verb and noun clustering. In Proceedings of the 23rd International Conference on Computational Linguistics, 1002–1010. Association for Computational Linguistics.Suche in Google Scholar

Shutova, Ekaterina, Lin Sun, Elkin Darío Gutiérrez, Patricia Lichtenstein & Srini Narayanan. 2017. Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics 43(1). 71–123.10.1162/COLI_a_00275Suche in Google Scholar

Stefanowitsch, Anatol & Stefan Th. Gries. 2003. Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8(2). 209–243.10.1075/ijcl.8.2.03steSuche in Google Scholar

Stefanowitsch, Anatol & Stefan Th. Gries. Corpora and grammar. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: an international handbook, vol. 2, 933–951. Berlin & New York: Mouton de Gruyter.10.1515/9783110213881.2.933Suche in Google Scholar

Tomasello, Michael. 2000. First steps toward a usage-based theory of language acquisition. Cognitive Linguistics 11(1–2). 61–82.10.1515/cogl.2001.012Suche in Google Scholar

Turney, Peter D. 2008. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial Intelligence Research (JAIR) 33. 615–655.10.1613/jair.2693Suche in Google Scholar

Turney, Peter D. & Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research (JAIR), 37(1). 141–188. ISSN: 1076-9757.10.1613/jair.2934Suche in Google Scholar

Tutubalina, Elena. 2015. Clustering-based approach to multiword expression extraction and ranking. In Proceedings of NAACL-HLT, 39–43.10.3115/v1/W15-0906Suche in Google Scholar

Wible, David & Nai-Lung Tsao. 2010. StringNet as a computational resource for discovering and investigating linguistic constructions. In Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics, EUCCL ’10, 25–31, Stroudsburg, PA, USA. Association for Computational Linguistics.Suche in Google Scholar

Wray, Alison & Mick Perkins. 2000. The functions of formulaic language: An integrated model. Language and Communication 20(1). 1–28.10.1016/S0271-5309(99)00015-4Suche in Google Scholar

Zuidema, Willem. 2006. What are the productive units of natural language grammar?: A DOP approach to the automatic identification of constructions. In Proceedings of the Tenth Conference on Computational Natural Language Learning, 29–36. Association for Computational Linguistics.10.3115/1596276.1596283Suche in Google Scholar

Published Online: 2019-01-04

Published in Print: 2021-10-26

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/cllt-2018-0028

Schlagwörter für diesen Artikel

constructions; semantics; distributional semantic models