Abstract
The present study is part of a larger research project developing computational tools for large-scale corpus-based semantic analyses. One such tool represents semantic structure with vector space models (VSMs). The paper shows that this tool and the models built require a deeper understanding, especially with a view to how its results relate to cognitive theories of meaning. Although token-based VSMs are increasingly used in corpus-based cognitive semantics, we believe it is insufficiently appreciated how alternative parameter settings deal with a range of semantic issues, such as granularity of meaning, prototypicality of the domain of application and interaction with syntactic patterns. For the purpose of this paper, we will focus on only one of those issues, viz. the prototypicality of the domain of application, presenting the results of three of our case studies on the Dutch adjectives heilzaam, hoekig, hachelijk and geldig. The models presented are built from a 520MW corpus of contemporary Dutch and Flemish newspapers and by varying parameters such as window size, part-of-speech and frequency thresholds in the selection of features. The resulting VSMs are evaluated through visual analytics: although multidimensional, they can be reduced to 2D and represented in scatterplots where more similar tokens appear closer to each other. The color-coding with manual sense tags employed here makes it possible to compare the groupings provided by human annotators with those of the computational models in a way consistent with the cognitive approach to meaning and categorization.
Acknowledgements
This work was conducted under grant number 3H150305 of the KU Leuven Research Fund C1. The authors would also like to thank the editors for their thorough editing work.
References
Campello, Ricardo J. G. B., Davoud Moulavi & Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical Density Estimates. In Jian Pei, Vincent S. Tseng, Longbing Cao, Hiroshi Motoda & Guandong Xu (eds.), Advances in knowledge discovery and data mining (Lecture Notes in Computer Science), 160–172. Berlin & Heidelberg: Springer.10.1007/978-3-642-37456-2_14Search in Google Scholar
Chang, Winston, Joe Cheng, JJ Allaire, Carson Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert & Barbara Borges. 2021. Shiny: Web application framework for r. https://CRAN.R-project.org/package=shiny.Search in Google Scholar
Church, Kenneth Ward & Patrick Hanks. 1989. Word association norms, mutual information, and lexicography. In ACL’ 89: Proceedings of the 27th annual meeting on Association for Computational Linguistic, 76–83. Association for Computational Linguistics.10.3115/981623.981633Search in Google Scholar
De Pascale, Stefano. 2019. Token-based vector space models as semantic control in lexical lectometry. Leuven: KU Leuven Ph.D. thesis.Search in Google Scholar
De Pascale, Stefano & Weiwei Zhang. 2021. Scoring with Token-based Models. A Distributional Semantic Replication of Sociolectometric Analyses in Geeraerts, Grondelaers, and Speelman (1999). In Gitte Kristiansen, Karlien Franco, Stefano De Pascale, Laura Rosseel & Weiwei Zhang (eds.), Cognitive Sociolinguistics Revisited, 186–199. Berlin & Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110733945-021.Search in Google Scholar
Den Boon, Ton & Dirk Geeraerts (eds.). 2005. Groot Woordenboek van de Nederlandse Taal [Great Dictionary of the Dutch Language]. Utrecht & Antwerp: Van Dale Lexicografie.Search in Google Scholar
Firth, John Rupert. 1957. A synopsis of linguistic theory 1930-1955. In John Rupert Firth (ed.), Studies in Linguistic Analysis (Special Volume of the Philological Society), 1–32. Oxford: Blackwell.Search in Google Scholar
Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5). 378–382.10.1037/h0031619Search in Google Scholar
Geeraerts, Dirk. 1993. Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics 4. 223–272.10.1515/cogl.1993.4.3.223Search in Google Scholar
Geeraerts, Dirk. 1997. Diachronic prototype semantics: A contribution to historical lexicology. Oxford & New York: Oxford University Press (Clarendon Press Oxford).10.1093/oso/9780198236528.001.0001Search in Google Scholar
Geeraerts, Dirk. 2006. Words and other wonders: Papers on lexical and semantic topics. Berlin & New York: Mouton de Gruyter.10.1515/9783110219128Search in Google Scholar
Geeraerts, Dirk. 2010. The doctor and the semantician. In Dylan Glynn & Kerstin Fischer (eds.), Quantitative methods in cognitive semantics: Corpus-driven approaches (Cognitive Linguistics Research 46), 63–78. Berlin & New York: De Gruyter Mouton.10.1515/9783110226423.61Search in Google Scholar
Glynn, Dylan. 2014. The many uses of Run: Corpus methods and Socio-Cognitive Semantics. In Dylan Glynn & Justyna A. Robinson (eds.), Corpus methods for semantics: Quantitative studies in polysemy and synonymy (Human Cognitive Processing 43), 117–144. Amsterdam & Philadelphia: John Benjamins Publishing Company.10.1075/hcp.43.05glySearch in Google Scholar
Glynn, Dylan. 2016. Quantifying polysemy: Corpus methodology for prototype theory. Folia Linguistica 50(2). 413–447.10.1515/flin-2016-0016Search in Google Scholar
Glynn, Dylan & Kerstin Fischer (eds.). 2010. Quantitative methods in cognitive semantics: Corpus-driven approaches. Berlin & New York: De Gruyter Mouton.10.1515/9783110226423Search in Google Scholar
Glynn, Dylan & Justyna A. Robinson (eds.). 2014. Corpus methods for semantics: Quantitative studies in polysemy and synonymy. Amsterdam & Philadelphia: John Benjamins Publishing Company.10.1075/hcp.43Search in Google Scholar
Gries, Stefan Th. 2013. 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics 18(1). 137–166.10.1075/ijcl.18.1.09griSearch in Google Scholar
Gries, Stefan Thomas & Anatol Stefanowitsch (eds.). 2006. Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.10.1515/9783110197709Search in Google Scholar
Hahsler, Michael & Matthew Piekenbrock. 2021. Dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. https://github.com/mhahsler/dbscan.Search in Google Scholar
Harris, Zellig S. 1954. Distributional structure. Word 10(2-3). 146–162.10.1080/00437956.1954.11659520Search in Google Scholar
Heylen, Kris, Dirk Speelman & Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch synsets. In Proceedings of the eacl 2012 Joint Workshop of LINGVIS & UNCLH, 16–24. Avignon: Association for Computational Linguistics.Search in Google Scholar
Heylen, Kris, Thomas Wielfaert, Dirk Speelman & Dirk Geeraerts. 2015. Monitoring polysemy: Word space models as a tool for large-scale lexical semantic analysis. Lingua 157. 153–172.10.1016/j.lingua.2014.12.001Search in Google Scholar
Hilpert, Martin & David Correia Saavedra. 2017. Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims. Corpus Linguistics and Linguistic Theory 16(2). 393–424.10.1515/cllt-2017-0009Search in Google Scholar
Hilpert, Martin & Susanne Flach. 2020. Disentangling modal meanings with distributional semantics. Digital Scholarship in the Humanities 36(2). 307–321. https://doi.org/10.1093/llc/fqaa014.Search in Google Scholar
Kaufman, Leonard & Peter J. Rousseeuw. 1990. Partitioning Around Medoids (Program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics), 68–125. Hoboken, NJ, USA: John Wiley & Sons, Inc.10.1002/9780470316801.ch2Search in Google Scholar
Koptjevskaja-Tamm, Maria & Magnus Sahlgren. 2014. Temperature in the word space: Sense exploration of temperature expressions using word-space modelling. In Benedikt Szmrecsanyi & Bernhard Wälchli (eds.), Aggregating Dialectology, Typology, and Register Analysis, 231–267. Berlin & Boston: De Gruyter Mouton.10.1515/9783110317558.231Search in Google Scholar
Krijthe, Jesse. 2018. Rtsne: T-distributed stochastic neighbor embedding using a barnes-hut implementation. https://github.com/jkrijthe/Rtsne.Search in Google Scholar
Lenci, Alessandro. 2018. Distributional Models of Word Meaning. Annual Review of Linguistics 4(1). 151–171.10.1146/annurev-linguistics-030514-125254Search in Google Scholar
Maechler, Martin, Peter Rousseeuw, Anja Struyf & Mia Hubert. 2021. Cluster: “Finding groups in data”: Cluster analysis extended Rousseeuw et al. https://svn.r-project.org/R-packages/trunk/cluster/.Search in Google Scholar
Montes, Mariana. 2021a. Cloudspotting: Visual analytics for distributional semantics. Leuven: KU Leuven PhD thesis.Search in Google Scholar
Montes, Mariana. 2021b. Modelling meaning granularity of nouns with vector space models. Papers of the Linguistics Society of Belgium 15. https://sites.uclouvain.be/bkl-cbl/en/journals/papers-of-thelsb/volume-15-2021/ (accessed 5 February 2022).Search in Google Scholar
Montes, Mariana. 2021c. Semcloud (0.1.0): R package to process token-level clouds and get them ready for NephoVis. https://github.com/montesmariana/semcloud (accessed 6 January 2022).Search in Google Scholar
Montes, Mariana, Karlien Franco & Kris Heylen. 2021. Indestructible Insights. A Case Study in Distributional Prototype Semantics. In Gitte Kristiansen, Karlien Franco, Stefano De Pascale, Laura Rosseel & Weiwei Zhang (eds.), Cognitive Sociolinguistics Revisited, 251–263. Berlin & Boston: De Gruyter Mouton. https://doi.org/10.1515/9783110733945-021.Search in Google Scholar
Montes, Mariana & Thomas Wielfaert. 2021. QLVL/NephoVis: Altostratus. Interactive visualization for Nephological Semantics. https://doi.org/10.5281/ZENODO.5116843.Search in Google Scholar
Oskolkov, Nikolay. 2021. How Exactly UMAP Works. Medium. https://towardsdatascience.com/howexactly-umap-works-13e3040e1668 (accessed 7 May 2021).Search in Google Scholar
Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics 54(1). 149–188.10.1515/ling-2015-0043Search in Google Scholar
Perek, Florent. 2018. Recent change in the productivity and schematicity of the way-construction: A distributional semantic analysis. Corpus Linguistics and Linguistic Theory 14(1). 65–97.10.1515/cllt-2016-0014Search in Google Scholar
QLVL. 2021. Nephosem: A Python package for token-level distributional modelling. Zenodo. https://doi.org/10.5281/ZENODO.5710426.Search in Google Scholar
Raganato, Alessandro, Jose Camacho-Collados & Roberto Navigli. 2017. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 99–110. Valencia: Association for Computational Linguistics.10.18653/v1/E17-1010Search in Google Scholar
Schütze, Hinrich. 1998. Automatic Word Sense Discrimination. Computational Linguistics 24(1). 97–123.Search in Google Scholar
Turney, Peter & Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37. 141–188.10.1613/jair.2934Search in Google Scholar
Van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9. 2579–2605. https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf (accessed 14 January 2019).Search in Google Scholar
Van Noord, Gertjan. 2006. At Last Parsing Is Now Operational. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées, 20–42. Leuven: ATALA.Search in Google Scholar
Wattenberg, Martin, Fernanda Viégas & Ian Johnson. 2016. How to Use t-SNE Effectively. Distill 1(10). https://distill.pub/2016/misread-tsne/ (accessed 27 February 2020).10.23915/distill.00002Search in Google Scholar
©2022 Walter de Gruyter GmbH, Berlin/Boston
Articles in the same Issue
- Frontmatter
- Editorial: Cognitive Linguistics as an interdisciplinary endeavour
- How vector space models disambiguate adjectives: A perilous but valid enterprise
- Death, enemies, and illness: How English and Russian metaphorically conceptualise boredom
- The status of nominal sub-categories: Exploring frequency densities of plural -s
- No big deal: Situation-backgrounding uses of the Polish dative reflexive pronoun sobie/se
- Hand gestures with verbs of throwing: Collostructions, style and metaphor
- Exploring the conceptualisation of locative events in French, English, and Dutch: Insights from eye-tracking on two memorisation tasks
- Extending structural priming to test constructional relations: Some comments and suggestions
- Lexical Integrity: A mere construct or more a construction?
- Cognitive Linguistics meets Interactional Linguistics: Language development in the arena of language use
- Cognitive Linguistics meets multilingual language acquisition: What pattern identification can tell us
- Constructionist approaches to creativity
Articles in the same Issue
- Frontmatter
- Editorial: Cognitive Linguistics as an interdisciplinary endeavour
- How vector space models disambiguate adjectives: A perilous but valid enterprise
- Death, enemies, and illness: How English and Russian metaphorically conceptualise boredom
- The status of nominal sub-categories: Exploring frequency densities of plural -s
- No big deal: Situation-backgrounding uses of the Polish dative reflexive pronoun sobie/se
- Hand gestures with verbs of throwing: Collostructions, style and metaphor
- Exploring the conceptualisation of locative events in French, English, and Dutch: Insights from eye-tracking on two memorisation tasks
- Extending structural priming to test constructional relations: Some comments and suggestions
- Lexical Integrity: A mere construct or more a construction?
- Cognitive Linguistics meets Interactional Linguistics: Language development in the arena of language use
- Cognitive Linguistics meets multilingual language acquisition: What pattern identification can tell us
- Constructionist approaches to creativity