How vector space models disambiguate adjectives: A perilous but valid enterprise

Mariana Montes; Dirk Geeraerts

doi:10.1515/gcla-2022-0002

Article

How vector space models disambiguate adjectives: A perilous but valid enterprise

Mariana Montes and Dirk Geeraerts

Published/Copyright: November 11, 2022

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Yearbook of the German Cognitive Linguistics Association Volume 10 Issue 1

Abstract

The present study is part of a larger research project developing computational tools for large-scale corpus-based semantic analyses. One such tool represents semantic structure with vector space models (VSMs). The paper shows that this tool and the models built require a deeper understanding, especially with a view to how its results relate to cognitive theories of meaning. Although token-based VSMs are increasingly used in corpus-based cognitive semantics, we believe it is insufficiently appreciated how alternative parameter settings deal with a range of semantic issues, such as granularity of meaning, prototypicality of the domain of application and interaction with syntactic patterns. For the purpose of this paper, we will focus on only one of those issues, viz. the prototypicality of the domain of application, presenting the results of three of our case studies on the Dutch adjectives heilzaam, hoekig, hachelijk and geldig. The models presented are built from a 520MW corpus of contemporary Dutch and Flemish newspapers and by varying parameters such as window size, part-of-speech and frequency thresholds in the selection of features. The resulting VSMs are evaluated through visual analytics: although multidimensional, they can be reduced to 2D and represented in scatterplots where more similar tokens appear closer to each other. The color-coding with manual sense tags employed here makes it possible to compare the groupings provided by human annotators with those of the computational models in a way consistent with the cognitive approach to meaning and categorization.

Keywords: distributional semantics; vector-space models; polysemy; Dutch; adjectives

Acknowledgements

This work was conducted under grant number 3H150305 of the KU Leuven Research Fund C1. The authors would also like to thank the editors for their thorough editing work.

References

Campello, Ricardo J. G. B., Davoud Moulavi & Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda & Guandong Xu (eds.), Advances in knowledge discovery and data mining (Lecture Notes in Computer Science), 160–172. Berlin & Heidelberg: Springer.10.1007/978-3-642-37456-2_14Search in Google Scholar

Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert & Barbara Borges. 2021. Shiny: Web application framework for r. https://CRAN.R-project.org/package=shiny.Search in Google Scholar

Church, Kenneth Ward & Patrick Hanks. 1989. Word association norms, mutual information, and lexicography. In ACL’ 89: Proceedings of the 27th annual meeting on Association for Computational Linguistic, 76–83. Association for Computational Linguistics.10.3115/981623.981633Search in Google Scholar

De Pascale, Stefano. 2019. Token-based vector space models as semantic control in lexical lectometry. Leuven: KU Leuven Ph.D. thesis.Search in Google Scholar

De Pascale, Stefano & Weiwei Zhang. 2021. Scoring with Token-based Models. A Distributional Semantic Replication of Sociolectometric Analyses in Geeraerts, Grondelaers, and Speelman (1999). In Gitte Kristiansen, Karlien Franco, Stefano De Pascale, Laura Rosseel & Weiwei Zhang (eds.), Cognitive Sociolinguistics Revisited, 186–199. Berlin & Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110733945-021.Search in Google Scholar

Den Boon, Ton & Dirk Geeraerts (eds.). 2005. Groot Woordenboek van de Nederlandse Taal [Great Dictionary of the Dutch Language]. Utrecht & Antwerp: Van Dale Lexicografie.Search in Google Scholar

Firth, John Rupert. 1957. A synopsis of linguistic theory 1930-1955. In John Rupert Firth (ed.), Studies in Linguistic Analysis (Special Volume of the Philological Society), 1–32. Oxford: Blackwell.Search in Google Scholar

Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5). 378–382.10.1037/h0031619Search in Google Scholar

Geeraerts, Dirk. 1993. Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics 4. 223–272.10.1515/cogl.1993.4.3.223Search in Google Scholar

Geeraerts, Dirk. 1997. Diachronic prototype semantics: A contribution to historical lexicology. Oxford & New York: Oxford University Press (Clarendon Press Oxford).10.1093/oso/9780198236528.001.0001Search in Google Scholar

Geeraerts, Dirk. 2006. Words and other wonders: Papers on lexical and semantic topics. Berlin & New York: Mouton de Gruyter.10.1515/9783110219128Search in Google Scholar

Geeraerts, Dirk. 2010. The doctor and the semantician. In Dylan Glynn & Kerstin Fischer (eds.), Quantitative methods in cognitive semantics: Corpus-driven approaches (Cognitive Linguistics Research 46), 63–78. Berlin & New York: De Gruyter Mouton.10.1515/9783110226423.61Search in Google Scholar

Glynn, Dylan. 2014. The many uses of Run: Corpus methods and Socio-Cognitive Semantics. In Dylan Glynn & Justyna A. Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Human Cognitive Processing 43), 117–144. Amsterdam & Philadelphia: John Benjamins Publishing Company.10.1075/hcp.43.05glySearch in Google Scholar

Glynn, Dylan. 2016. Quantifying polysemy: Corpus methodology for prototype theory. Folia Linguistica 50(2). 413–447.10.1515/flin-2016-0016Search in Google Scholar

Glynn, Dylan & Kerstin Fischer (eds.). 2010. Quantitative methods in cognitive semantics: Corpus-driven approaches. Berlin & New York: De Gruyter Mouton.10.1515/9783110226423Search in Google Scholar

Glynn, Dylan & Justyna A. Robinson (eds.). 2014. Corpus methods for semantics: Quantitative studies in polysemy and synonymy. Amsterdam & Philadelphia: John Benjamins Publishing Company.10.1075/hcp.43Search in Google Scholar

Gries, Stefan Th. 2013. 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics 18(1). 137–166.10.1075/ijcl.18.1.09griSearch in Google Scholar

Gries, Stefan Thomas & Anatol Stefanowitsch (eds.). 2006. Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.10.1515/9783110197709Search in Google Scholar

Hahsler, Michael & Matthew Piekenbrock. 2021. Dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. https://github.com/mhahsler/dbscan.Search in Google Scholar

Harris, Zellig S. 1954. Distributional structure. Word 10(2-3). 146–162.10.1080/00437956.1954.11659520Search in Google Scholar

Heylen, Kris, Dirk Speelman & Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets. In Proceedings of the eacl 2012 Joint Workshop of LINGVIS & UNCLH, 16–24. Avignon: Association for Computational Linguistics.Search in Google Scholar

Heylen, Kris, Thomas Wielfaert, Dirk Speelman & Dirk Geeraerts. 2015. Monitoring polysemy: Word space models as a tool for large-scale lexical semantic analysis. Lingua 157. 153–172.10.1016/j.lingua.2014.12.001Search in Google Scholar

Hilpert, Martin & David Correia Saavedra. 2017. Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims. Corpus Linguistics and Linguistic Theory 16(2). 393–424.10.1515/cllt-2017-0009Search in Google Scholar

Hilpert, Martin & Susanne Flach. 2020. Disentangling modal meanings with distributional semantics. Digital Scholarship in the Humanities 36(2). 307–321. https://doi.org/10.1093/llc/fqaa014.Search in Google Scholar

Kaufman, Leonard & Peter J. Rousseeuw. 1990. Partitioning Around Medoids (Program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics), 68–125. Hoboken, NJ, USA: John Wiley & Sons, Inc.10.1002/9780470316801.ch2Search in Google Scholar

Koptjevskaja-Tamm, Maria & Magnus Sahlgren. 2014. Temperature in the word space: Sense exploration of temperature expressions using word-space modelling. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating Dialectology, Typology, and Register Analysis, 231–267. Berlin & Boston: De Gruyter Mouton.10.1515/9783110317558.231Search in Google Scholar

Krijthe, Jesse. 2018. Rtsne: T-distributed stochastic neighbor embedding using a barnes-hut implementation. https://github.com/jkrijthe/Rtsne.Search in Google Scholar

Lenci, Alessandro. 2018. Distributional Models of Word Meaning. Annual Review of Linguistics 4(1). 151–171.10.1146/annurev-linguistics-030514-125254Search in Google Scholar

Maechler, Martin, Peter Rousseeuw, Anja Struyf & Mia Hubert. 2021. Cluster: “Finding groups in data”: Cluster analysis extended Rousseeuw et al. https://svn.r-project.org/R-packages/trunk/cluster/.Search in Google Scholar

Montes, Mariana. 2021a. Cloudspotting: Visual analytics for distributional semantics. Leuven: KU Leuven PhD thesis.Search in Google Scholar

Montes, Mariana. 2021b. Modelling meaning granularity of nouns with vector space models. Papers of the Linguistics Society of Belgium 15. https://sites.uclouvain.be/bkl-cbl/en/journals/papers-of-thelsb/volume-15-2021/ (accessed 5 February 2022).Search in Google Scholar

Montes, Mariana. 2021c. Semcloud (0.1.0): R package to process token-level clouds and get them ready for NephoVis. https://github.com/montesmariana/semcloud (accessed 6 January 2022).Search in Google Scholar

Montes, Mariana, Karlien Franco & Kris Heylen. 2021. Indestructible Insights. A Case Study in Distributional Prototype Semantics. In Gitte Kristiansen, Karlien Franco, Stefano De Pascale, Laura Rosseel & Weiwei Zhang (eds.), Cognitive Sociolinguistics Revisited, 251–263. Berlin & Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110733945-021.Search in Google Scholar

Montes, Mariana & Thomas Wielfaert. 2021. QLVL/NephoVis: Altostratus. Interactive visualization for Nephological Semantics. https://doi.org/10.5281/ZENODO.5116843.Search in Google Scholar

Oskolkov, Nikolay. 2021. How Exactly UMAP Works. Medium. https://towardsdatascience.com/howexactly-umap-works-13e3040e1668 (accessed 7 May 2021).Search in Google Scholar

Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics 54(1). 149–188.10.1515/ling-2015-0043Search in Google Scholar

Perek, Florent. 2018. Recent change in the productivity and schematicity of the way-construction: A distributional semantic analysis. Corpus Linguistics and Linguistic Theory 14(1). 65–97.10.1515/cllt-2016-0014Search in Google Scholar

QLVL. 2021. Nephosem: A Python package for token-level distributional modelling. Zenodo. https://doi.org/10.5281/ZENODO.5710426.Search in Google Scholar

Raganato, Alessandro, Jose Camacho-Collados & Roberto Navigli. 2017. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 99–110. Valencia: Association for Computational Linguistics.10.18653/v1/E17-1010Search in Google Scholar

Schütze, Hinrich. 1998. Automatic Word Sense Discrimination. Computational Linguistics 24(1). 97–123.Search in Google Scholar

Turney, Peter & Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37. 141–188.10.1613/jair.2934Search in Google Scholar

Van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9. 2579–2605. https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf (accessed 14 January 2019).Search in Google Scholar

Van Noord, Gertjan. 2006. At Last Parsing Is Now Operational. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées, 20–42. Leuven: ATALA.Search in Google Scholar

Wattenberg, Martin, Fernanda Viégas & Ian Johnson. 2016. How to Use t-SNE Effectively. Distill 1(10). https://distill.pub/2016/misread-tsne/ (accessed 27 February 2020).10.23915/distill.00002Search in Google Scholar

Published Online: 2022-11-11

Published in Print: 2022-11-25

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/gcla-2022-0002

Keywords for this article

distributional semantics; vector-space models; polysemy; Dutch; adjectives