Investigating lexical-semantic effects on morphosyntactic variation using elastic net regression

Anthe Sevenants; Freek Van de Velde; Dirk Speelman

doi:10.1515/cllt-2024-0068

Article

Investigating lexical-semantic effects on morphosyntactic variation using elastic net regression

Anthe Sevenants , Freek Van de Velde and Dirk Speelman

Published/Copyright: December 23, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Corpus Linguistics and Linguistic Theory

Abstract

This article showcases elastic net regression as a means to build fairer models of morphosyntactic variation. Elastic net allows lexical items to appear on the same level as traditional, high-level predictors, enabling fuller models of variation. We apply elastic net regression to 1,296,574 Dutch verbal cluster tokens from the SoNaR corpus, analysing a morphosyntactic alternance in Dutch subordinate clauses. Our results show morphosyntactic preferences among verbs, indicating that semantic effects are indeed at play. Further analysis shows that semantic patterns for either word order exist, though it remains difficult to glean any semantic generalisations. Still, the elastic net technique shows that the inclusion of lexical items as full predictors in a model is useful, as much of the variation left unexplained by high-level predictors can be explained in lexical terms.

Keywords: Dutch; elastic net regression; morphosyntactic variation; construction grammar; probabilistic grammar

Corresponding author: Anthe Sevenants, Department of Linguistics, Quantitative Lexicology and Variational Linguistics, KU Leuven, Blijde-Inkomststraat 21 – bus 3301, Leuven, 3000, Belgium, E-mail: anthe.sevenants@kuleuven.be

Funding source: Fonds Wetenschappelijk Onderzoek

Award Identifier / Grant number: G059922N

Acknowledgements

We would like to thank Jelke Bloem for providing the odds ratio data from his study and Jeroen van Craenenbroeck for his advice on formal syntax. We truly appreciate their help.

Research funding: This work was supported by Fonds Wetenschappelijk Onderzoek (https://doi.org/10.13039/501100003130, grant no. G059922N).

Appendix

A.1 Try it yourself

To aid the analysis of the elastic net coefficients, a Javascript-based interactive analysis tool, Rekker, was developed (Sevenants 2023e). You can look at the dataset and elastic net results yourself by visiting anthesevenants.github.io/Rekker/.

A.2 Corpus and querying

To compute the semantic pull of the different verbs in the red and green word order, we collected all red and green verb clusters in subordinate clauses in the SoNaR corpus (Oostdijk et al. 2008) and SoNaR New Media corpus (Oostdijk et al. 2014). Since we are interested in syntactic alternances, we needed a syntactically informed corpus format (“treebank”) in order to reliably find the attestations we need. While the SoNaR corpus does not ship with syntactic information, much of the corpus material from SoNaR is also included in the Lassy corpus (van Noord et al. 2013), which is syntactically annotated using Alpino (van Noord 2006). We retrieved the syntactic information from SoNaR available in Lassy, and parsed the remaining sentences using Alpino ourselves. This left us with a fully syntactically informed SoNaR corpus, ready to be queried for red and green word orders.

In order to query the syntactic information of the entire SoNaR corpus, we used mattenklopper (Sevenants 2023c), a treebank search engine tailor-made for this study. While there are several Alpino search engines available, many of which are much more user friendly and faster than the custom search engine used here (e.g. Gretel by Augustinus et al. 2012, PaQu by Kleiweg 2023), these engines all have specific problems which made it so they could not be used for this study. In both GrETEL and PaQu, it is only possible to retrieve entire sentences. One cannot further retrieve the participles or auxiliaries in a verb cluster – this must be done manually. GrETEL only supports searching through subsections of SoNaR^[9] and PaQu does not even offer the SoNaR corpus for querying. Finally, GrETEL results are limited to only 500 sentences due to copyright concerns, which is not enough for a sophisticated analysis. The custom mattenklopper engine was developed as a solution to all these problems. It is available online and can be used for future alternance studies of Dutch using Alpino-based corpora. The xpath queries used to search the corpus are included in Subsection A.7. The mattenklopper search engine returned 1,604,412 attestations of either the red or green word order.

A.3 Filtering and enriching

The mattenklopper results were further filtered in order to guarantee the quality of the attestations. In short, duplicates were removed, tokenisation errors were fixed (i.e. removing superfluous punctuation from participles) and obvious tagging mistakes were removed (e.g. words such as zgn ‘so-called’ and gemiddeld ‘average’ were removed). In addition, wrong participle endings (e.g. gebeurt instead of gebeurd ‘happened’) were corrected using naive-dt-fix (Sevenants 2023d), a library for the R language designed for this study. This library automatically corrects wrong participle endings by relying on the relative frequencies of all possible spellings. The most frequent spelling is seen as the correct spelling and is used as a correction.^[10] Declensed words were also removed (e.g. geplaatst e ). Past participles cannot be declensed in Dutch, so all declensed forms in the corpus tagged as participles are, in fact, mis-tagged adjectives. We also removed all verb clusters with an auxiliary other than hebben, zijn or worden and removed all attestations without a sentence ID (which we need to compute priming). In addition, all types occurring less than 10 times were removed in order to guarantee a stable estimate for the semantic pulls of each type. As a result of these operations, 177,014 attestations were removed.

Furthermore, the attestations were enriched with additional information to be used in the multifactorial elastic net regression. Firstly, regional information was added for each attestation. SoNaR comes with contextual information about its documents, such as country of origin information. Since region is an important influence in the red-green word order, this variable is vital for multifactorial control.

We also used the subcorpus division in SoNaR (e.g. WR-P-E-A_discussion_lists, WR-P-E-F_press_releases) to distinguish between edited and unedited genres. We decided to focus on an edited-unedited dichotomy, because it is difficult to assess the formality of certain genres in the corpus (e.g. websites and blogs). By focussing on whether a genre is typically edited or not, we sidestep these issues, but we are still able to include some form of formality distinction. Refer to Table 8 for an overview of our judgements.

Table 8:

An overview of the SoNaR subcorpora and our edited-unedited judgement.

Subcorpus	Contents	Degree of editing
WR-P-E-A	Discussion lists	Unedited
WR-P-E-C	e-magazines	Edited
WR-P-E-E	Newsletters	No attestations^*
WR-P-E-F	Press releases	Edited
WR-P-E-G	Subtitles	Edited
WR-P-E-H	Teletext pages	Edited
WR-P-E-I	Websites	Edited
WR-P-E-J	Wikipedia	Edited
WR-P-E-K	Blogs	Edited
WR-P-P-B	Books	Edited
WR-P-P-C	Brochures	Edited
WR-P-P-D	Newsletters	Edited
WR-P-P-E	Guides, manuals	Edited
WR-P-P-F	Legal texts	Edited
WR-P-P-G	Newspapers	Edited
WR-P-P-H	Periodicals, magazines	Edited
WR-P-P-I	Policy documents	Edited
WR-P-P-J	Proceedings	Edited
WR-P-P-K	Reports	Edited
WR-U-E-E	Written assignments	Edited
WS-U-E-A	Auto cues	Edited
WS-U-T-B	Texts for the visually impaired	Edited
WR-P-E-L	Tweets	Unedited
WR-U-E-A	Chats	Unedited
WR-U-E-D	Sms	Unedited

^*The newsletters subcorpus is incredibly small, hence why we have no attestations.

Adjectiveness information was added for all participles. Adjectiveness is expressed as a ratio denoting how often a participle functions as an adjective in language use:

(4) #uses as an adjective #uses as an adjective + #uses as a participle

0 denotes no adjectival use, 1 denotes maximal adjectival use. We computed adjectiveness on the entire Lassy corpus (van Noord et al. 2013).^[11]

Because the Alpino syntactic parser marks separable verbs by infixing an underscore (_) between the preposition and verb root, we can exploit this behaviour to automatically infer whether a verb cluster contains a separable verb.

To compute the length of the middle field, we calculated the number of words between the start of the clause and the verbal cluster itself. This information is based on the tokenisation of the SoNaR corpus.

We included frequency information from the SUBTLEX dataset (Keuleers et al. 2010) in order be able to assess the effect of frequency. Because frequency is typically Zipfian (Zipf 1965), we transformed the frequency information using the natural logarithm for a multitude of reasons: (1) to compress the frequency variation among the types in our dataset (2) to make the distribution of priming more normal (3) because it makes the distribution more psychologically real.

Priming information is also important to include. To obtain priming information, we relied on the sentence IDs included in the SoNaR corpus. Consider the following example:

WR-P-P-B-0000000103.p.37.s.4

The ID refers to document 103 of the WR-P-P-B component of the SoNaR corpus (“books”). Within that document, it refers to the 4th sentence of the 37th paragraph. The window we chose for priming is one paragraph: this means that in our example, we would consider all attestations from paragraph 36 and all sentences leading up to sentence 4 of paragraph 37 to be possible prime sources. It was not possible to work on the sentence level, since paragraphs can have a variable number of sentences and not all sentences have red-green attestations in the dataset.

We included priming in our model by using a corrected log-odds measure, which we will call the “priming ratio”. For every attestation, we computed the following equation:

ln #red primes + 0.001 #green primes + 0.001

We computed the ratio between the number of red and green primes and used Laplace smoothing (Brysbaert and Diependaele 2013) to prevent division by zero. The natural logarithm attenuates large disparities between red and green and turns our priming ratio into a continuous variable ranging from −∞ to +∞.

As a final step, we removed all participles for which no adjectiveness value was defined, as these were found not to be participles but mis-taggings. For the same reason, participles with an adjectiveness value of over 0.9 were removed. In addition, all attestations for which no region information was defined were also removed, because they lack the information required for the multifactorial analysis. As a result of these two steps, another 256,112 items were removed.

A.4 Converting the dataset

To compute the semantic preference of the participles found in our attestations, we used elastic net regression, the technique detailed in Section 2. In contrast to regular regression techniques, a tabular dataset cannot be used “as-is” for analysis with elastic net. Instead, the dataset has to be supplied in a matrix form. Consider the toy example in Table 9.

Table 9:

Toy example dataset to illustrate the workings of elastic net regression.

Word order	Participle	Country	Adjectiveness
Green	gebroken	Belgium	0.5
Red	mislukt	The Netherlands	0.4
Green	gebeurd	Belgium	0.1

In the matrix form, each multidimensional column is converted so that each unique value of that column becomes its own predictor. In our case, all unique values of the column participle will become binary predictors, each predictor indicating whether that participle occurs in the verb cluster or not. This means our matrix will be inherently sparse, since each verbal cluster can only feature one participle. Binary predictors such as country are also converted to a binary column in the matrix, and simply indicate a deviation from the reference level. For example, if is_be is a binary column, a Belgian attestation will be encoded as 1, and a Netherlandic attestation as 0. The adjectiveness column is numeric and can be adopted as-is. The response variable word order is also encoded as a binary variable, as is typical in logistic regression, but it is not a part of the input matrix. The input matrix for how our toy example would look is given in Table 10. The response variables would be encoded as [0, 1, 0] with the red order as response variable 1.

Table 10:

Example input matrix to illustrate the workings of elastic net regression.

is_gebroken	is_mislukt	is_gebeurd	is_BE	Adjectiveness
1	0	0	1	0.5
0	1	0	0	0.4
0	0	1	1	0.1

To facilitate the conversion process, we used ElasticToolsR (Sevenants 2023b), an R library written for this study. It can automatically convert “traditional” datasets to the matrix format detailed above in seconds.

A.5 Bayesian correlations

To provide more robust evidence for the correlation between our elastic net coefficients and the results of previous studies, we have also computed the Bayesian correlations between the two, complete with their credible intervals (CI), according to Van Doorn et al. (2018). The results are given in Tables 11 and 12.

Table 11:

The Bayesian correlation results for the comparison between De Sutter’s LLR values and our elastic net coefficients.

Measure	Estimate	CI
Pearson	0.386	0.172–0.561
Kendall	0.457	0.289–0.587

Table 12:

The Bayesian correlation results for the comparison between Bloem’s OR values and our elastic net coefficients.

Measure	Estimate	CI
Pearson	0.494	0.449–0.537
Kendall	0.315	0.273–0.353

A.6 Comparison tables

Tables 13–16 show the inconsistenties between our results and those of De Sutter et al. (2005) and Bloem (2021).

Table 13:

Overview of all participles which have significant LLR values, but were eliminated in our elastic net regression model.

Participle	LLR	Elastic net coefficient
doorzocht	−6.66	0
geïnspireerd	−6.66	0
gerept	−4.67	0
vergemakkelijkt	−4.44	0
opgeheven	4.01	0
teruggevonden	4.01	0

Table 14:

Overview of all participles which have significant LLR values, but do not appear in our dataset and therefore do not have coefficients.

Participle	LLR	Elastic net coefficient
bereid	−15.56	None
bevoegd	−8.88	None
bewust	−6.66	None
gekant	−6.49	None
geneigd	−5.93	None
geoorloofd	−4.67	None
geschikt	−4.44	None
gezond	−4.44	None
verkeerd	−4.44	None

Table 15:

Sample of all participles which have OR values, but were eliminated in our elastic net regression model. ORs converted to logits.

Participle	Logit	Elastic net coefficient
aanbeden	−0.2884802	0
aangehaald	0.3161285	0
aangeduid	0.7527511	0
aangeklaagd	0.7649804	0
aangemeld	1.1174773	0
aangekocht	1.1233365	0
aangebroken	1.2100116	0
aangeleverd	1.3297969	0

Table 16:

Sample of all participles which have logit values, but do not appear in our dataset and therefore do not have coefficients. ORs converted to logits.

Participle	Logit	Elastic net coefficient
baseren	−0.8770771	None
afkorten	−0.2571632	None
afstemmen	0.3392527	None
aanwennen	0.4346616	None
aanbouwen	0.4984842	None
aflopen	0.7253951	None
aanplanten	1.2493800	None
aanhangen	1.6617019	None

A.7 Xpath queries

A.7.1 Xpath queries for identifying eligible clauses

Red order

//node[(@cat=”cp” or @cat=”rel” or @cat=”inf”) and //node[(@wvorm=”pv” or @wvorm=”inf”) and @begin < ./preceding−sibling::node/node[@wvorm=”vd”]/@begin | ./following−sibling::node/node[@wvorm=”vd”]/@begin]]

Green order

//node[(@cat=”cp” or @cat=”rel” or @cat=”inf”) and //node[(@wvorm=”pv” or @wvorm=”inf”) and @begin > ./preceding−sibling::node/node[@wvorm=”vd”]/@begin | ./following−sibling::node/node[@wvorm=”vd”]/@begin]]

A.7.2 Xpath queries for retrieving verb cluster participle

.//node[@rel=”hd” and @wvorm=”vd” and @begin $SIGN$ ../../node[@rel=”hd” and @pt=”ww”]/@begin and not(../../@cat=”smain”) and ../../../../node[@id=”$ID$”]]

with $SIGN = > for the red order, < for the green order, and $ID = the ID of the parent sentence

A.7.3 Xpath queries for retrieving verb cluster auxiliary

.//node[@rel=”hd” and @pt=”ww” and @begin $SIGN$ ../node/node[@rel=”hd” and @wvorm=”vd”]/@begin and not(../@cat=”smain”) and ../../../node[@id=”$ID$”]]

with $SIGN = < for the red order, > for the green order, and $ID = the ID of the parent sentence

References

Adger, David & Graeme Trousdale. 2007. Variation in English syntax: Theoretical implications. English Language and Linguistics 11(2). 261–278. https://doi.org/10.1017/S1360674307002250.Search in Google Scholar

Augustinus, Liesbeth. 2015. Complement raising and cluster formation in Dutch. PhD thesis. https://www.lotpublications.nl/complement-raising-and-cluster-formation-in-dutch (Accessed 18 June 2024).Search in Google Scholar

Augustinus, Liesbeth, Vincent Vandeghinste, Frank Van Eynde, Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis. 2012. Example-based treebank querying. In Proceedings of the 8th international conference on language resources and evaluation (LREC 2012), 3161–3167. Paris: ELRA. https://aclanthology.org/L12-1442/ (Accessed 17 May 2023).Search in Google Scholar

Barbiers, Sjef, Hans Bennis & Lotte Dros-Hendriks. 2018. Merging verb cluster variation. Linguistic Variation 18(1). 144–196. https://doi.org/10.1075/lv.00008.bar.Search in Google Scholar

Bloem, Jelke. 2021. Processing verb clusters. LOT international series, vol. 586. Amsterdam: LOT. https://doi.org/10.48273/LOT0586.Search in Google Scholar

Bossuyt, Tom. 2019. Oppassen geblazen*: Over vormelijke, semantische en historische aspecten van de Nederlandse geblazen-constructie [Oppassen geblazen*: About formal, semantic and historical aspects of the Dutch geblazen-construction]. Nederlandse Taalkunde 24(3). 259–290. https://doi.org/10.5117/NEDTAA2019.3.001.BOSS.Search in Google Scholar

Brysbaert, Marc & Kevin Diependaele. 2013. Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods 45(2). 422–430. https://dx.doi.org/10.3758/s13428-012-0270-5.10.3758/s13428-012-0270-5Search in Google Scholar

Colleman, Timothy. 2009. Verb disposition in argument structure alternations: A corpus study of the dative alternation in Dutch. Language Sciences 31(5). 593–611. https://doi.org/10.1016/j.langsci.2008.01.001.Search in Google Scholar

Croft, William. 2010. Construction grammar. The Oxford handbook of cognitive linguistics, 463–508. Oxford University Press.10.1093/oxfordhb/9780199738632.013.0018Search in Google Scholar

De Sutter, Gert, Dirk Geeraerts & DirkSpeelman. 2005. Rood, groen, corpus! Een taalgebruiksgebaseerde analyse van woordvolgordevariatie in tweeledige werkwoordelijke eindgroepen [Red, green, corpus! A usage-based analysis of word order variation in two-part verbal clusters]. Leuven: KU Leuven PhD thesis.Search in Google Scholar

Evers, Arnold. 1975. The transformational cycle in Dutch and German. Amsterdam: Utrecht University PhD thesis.Search in Google Scholar

Friedman, Jerome, Robert Tibshirani & Trevor Hastie. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33(1). 1–22. https://doi.org/10.18637/jss.v033.i01.Search in Google Scholar

Geeraerts, Dirk. 2005. Lectal variation and empirical data in cognitive linguistics. In Cognitive linguistics: Internal dynamics and interdisciplinary interaction, vol. 32 (Cognitive linguistics research), 163–189. Berlin: Mouton de Gruyter.10.1515/9783110197716.2.163Search in Google Scholar

Grafmiller, Jason, Benedikt Szmrecsanyi, Melanie Röthlisberger & Benedikt Heller. 2018. General introduction: A comparative perspective on probabilistic variation in grammar. Glossa: A Journal of General Linguistics 3(1). https://doi.org/10.5334/gjgl.690.Search in Google Scholar

Gries, Stefan Thomas. 2015. The most under-used statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora 10(1). 95–125. https://doi.org/10.3366/cor.2015.0068.Search in Google Scholar

Haeseryn, Walter, Kirsten Romijn, Guido Geerts, Jaap de Rooij & Maarten van den Toorn. 1997. 30.3.2.1 Het werkwoord [30.3.2.1 The verb] https://e-ans.ivdnt.org/topics/pid/ans30030201lingtopic (Accessed 28 March 2024).Search in Google Scholar

Haiman, John. 1980. The iconicity of grammar: Isomorphism and motivation. Language 56(3). 515–540. https://doi.org/10.2307/414448.Search in Google Scholar

Hartigan, John A. 1975. Clustering algorithms. Michigan: John Wiley & Sons, Inc.Search in Google Scholar

Hoffmann, Thomas & Graeme Trousdale. 2013. Construction grammar: Introduction. In The Oxford handbook of construction grammar, 1–9. Oxford: Oxford University Press.10.1093/oxfordhb/9780195396683.001.0001Search in Google Scholar

Hurford, James R. 2014. The origins of language: A slim guide (Oxford linguistics), 173. Oxford: University Press.Search in Google Scholar

Israel, Michael. 1996. The way constructions grow. In Adele Goldberg (ed.), Conceptual structure, Discourse and language, 217–230. Stanford: Stanford University Press.Search in Google Scholar

Kaufman, Leonard & Peter J. Rousseeuw. 1990. Partitioning around Medoids (Program PAM). Finding groups in data, 68–125. New York: John Wiley & Sons, Ltd.10.1002/9780470316801.ch2Search in Google Scholar

Keuleers, Emmanuel, Marc Brysbaert & Boris New. 2010. SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behavior Research Methods 42(3). 643–650. https://doi.org/10.3758/BRM.42.3.643.Search in Google Scholar

Kleiweg, Peter. 2023 PaQu. https://github.com/rug-compling/paqu (Accessed 17 May 2023).Search in Google Scholar

Labov, William. 1972. Sociolinguistic patterns (Conduct and communication). Philadelphia: University of Pennsylvania Press.Search in Google Scholar

Lander, Jared P., Nicholas Galasinao, Joshua Kraut & Daniel Chen. 2023. Useful: A collection of Handy, useful functions. https://cran.r-project.org/web/packages/useful/index.html (Accessed 18 April 2024).Search in Google Scholar

Lenth, Russell V. 2024. Emmeans: Estimated marginal means, aka least-squares means. R package version 1.10.1. Available at: https://github.com/rvlenth/emmeans.Search in Google Scholar

Levshina, Natalia & Kris Heylen. 2014. A radically data-driven construction Grammar: Experiments with Dutch causative constructions. Extending the Scope of Construction Grammar 54. 17.10.1515/9783110366273.17Search in Google Scholar

Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert & Kurt Hornik. 2022. Cluster: Cluster analysis Basics and extensions. Available at: https://CRAN.R-project.org/package=cluster.Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv: 1301.3781 [cs.CL].Search in Google Scholar

Montes, Mariana. 2021. Cloudspotting: Visual analytics for distributional semantics dissertation. https://lirias.kuleuven.be/retrieve/630179 (Accessed 30 November 2021).Search in Google Scholar

Nettle, Daniel & Robin Dunbar. 1997. Social markers and the evolution of reciprocal exchange. Current Anthropology 38(1). 93–99. https://doi.org/10.1086/204588.Search in Google Scholar

Oostdijk, Nelleke, Martin Reynaert, Paola Monachesi, Gertjan van Noord, Roeland Ordelman, Ineke Schuurman & Vincent Vandeghinste. 2008. From D-Coi to SoNaR: A reference corpus for Dutch. Available at: https://aclanthology.org/L08-1226/.Search in Google Scholar

Oostdijk, Nelleke, Martin Reynaert, Veronique Hoste, Henk van den Heuvel, Orphee de Clercq & Ewoud Sanders & Creative Computing. 2014. SoNaR nieuw media corpus. https://research.tilburguniversity.edu/en/publications/ac128452-d97c-4290-8e65-12a1462ba47d (Accessed 17 May 2023).Search in Google Scholar

Pardoen, Justine. 1991. De interpretatie van zinnen met de rode en de groene volgorde [The interpretation of sentences in the red and green order]. In Forum der letteren, Vol. 32, 22.Search in Google Scholar

Pijpops, Dirk, De Smet Isabeau & Freek Van de Velde. 2018. Constructional contamination in morphology and syntax: Four case studies. Constructions and Frames 10(2). 269–305. https://doi.org/10.1075/cf.00021.pij.Search in Google Scholar

Schäfer, Roland & Felix Bildhauer. 2012. Building large corpora from the web using a new efficient tool chain. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12), 486–493. Istanbul, Turkey: European Language Resources Association (ELRA). Available at: https://www.lrec-conf.org/proceedings/lrec2012/pdf/834_Paper.pdf.Search in Google Scholar

Sevenants, Anthe. 2023a. Adjectiveness dataset for past participles in Dutch. Leuven. https://doi.org/10.5281/zenodo.7753211Search in Google Scholar

Sevenants, Anthe. 2023b. ElasticToolsR. Version 1.3. Leuven. Available at: https://github.com/AntheSevenants/ElasticToolsR/tree/v1.3.Search in Google Scholar

Sevenants, Anthe. 2023c. Mattenklopper. Version 1.0. Leuven. Available at: https://github.com/AntheSevenants/mattenklopper/releases/tag/v1.0.Search in Google Scholar

Sevenants, Anthe. 2023d. naive-dt-fix. Version 1.2. Leuven. Available at: https://github.com/AntheSevenants/naive-dt-fix/tree/v1.2.Search in Google Scholar

Sevenants, Anthe. 2023e. Rekker. Version 1.0. Leuven. Available at: https://github.com/AntheSevenants/Rekker.Search in Google Scholar

Speed, Laura J. & Marc Brysbaert. 2023. Ratings of valence, arousal, happiness, anger, fear, sadness, disgust, and surprise for 24,000 Dutch words. Behavior Research Methods 56. 5023–5039. https://doi.org/10.3758/s13428-023-02239-6.Search in Google Scholar

Stefanowitsch, Anatol. 2013. Collostructional analysis. In Thomas Hoffmann & Graeme Trousdale (eds.), The Oxford Handbook of construction grammar. Oxford: Oxford University Press.10.1093/oxfordhb/9780195396683.013.0016Search in Google Scholar

Tulkens, Stephan, Chris Emmery & Walter Daelemans. 2016. Evaluating unsupervised Dutch word embeddings as a linguistic resource. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Portorož, Slovenia: European Language Resources Association (ELRA).Search in Google Scholar

van Craenenbroeck, Jeroen, Marjo van Koppen & Antal van den Bosch. 2019. A quantitative-theoretical analysis of syntactic microvariation: Word order in Dutch verb clusters. Language 95(2). 333–370. https://doi.org/10.1353/lan.2019.0033.Search in Google Scholar

Van de Velde, Freek & Dirk Pijpops. 2019. Investigating lexical effects in syntax with regularized regression (Lasso). Journal of Research Design and Statistics in Linguistics and Communication Science 6(2). 166–199. https://doi.org/10.1558/jrds.18964.Search in Google Scholar

Van Doorn, Johnny, Alexander Ly, Maarten Marsman & Eric-Jan Wagenmakers. 2018. Bayesian inference for Kendall’s Rank correlation coefficient. The American Statistician 72(4). 303–308. https://doi.org/10.1080/00031305.2016.1264998.Search in Google Scholar

van Noord, Gertjan. 2006. At last parsing is now operational. In Actes de la 13ème conférence sur le Traitement Automatique des Langues Naturelles. Conférences invitées, 20–42. Leuven: ATALA. https://aclanthology.org/2006.jeptalnrecital-invite.2 (Accessed 15 April 2023).Search in Google Scholar

van Noord, Gertjan, Gosse Bouma, Frank Van Eynde, Daniel De Kok, Jelmer Van der Linde, Ineke Schuurman, Erik Tjong Kim Sang & Vincent Vandeghinste. 2013. Large scale syntactic annotation of written Dutch: Lassy. In Essential speech and language technology for Dutch: Resources, tools and applications, 147–164. Berlin & Heidelberg: Springer.10.1007/978-3-642-30910-6_9Search in Google Scholar

Vossen, Piek, Attila Görög, Rubén Izquierdo & Antal van den Bosch. 2012. DutchSemCor: Targeting the ideal sense-tagged corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12), 584–589. Istanbul, Turkey: European Language Resources Association (ELRA). Available at: http://www.lrec-conf.org/proceedings/lrec2012/pdf/187_Paper.pdf.Search in Google Scholar

Vossen, Piek, Isa Maks, Roxane Segers, Hennie Van Der Vliet, Marie-Francine Moens, Katja Hofmann, Erik Tjong Kim Sang & De Rijke Maarten. 2013. Cornetto: A combinatorial lexical semantic database for Dutch. In Essential speech and language technology for Dutch: Resources, tools and applications, 165–184. Berlin & Heidelberg: Springer.10.1007/978-3-642-30910-6_10Search in Google Scholar

Wurmbrand, Susi. 2004. Syntactic vs. post-syntactic movement. In Proceedings of the 2003 annual meeting of the Canadian linguistic association (CLA), 284–295.Search in Google Scholar

Wurmbrand, Susi. 2017. Verb clusters, verb raising, and restructuring. In The Wiley Blackwell companion to syntax, Vol. 109. Wiley Online Library. https://doi.org/10.1002/9780470996591.ch75.Search in Google Scholar

Zipf, George Kingsley. 1965. The psycho-biology of language. Cambridge, USA: MIT Press.Search in Google Scholar

Zwart, Cornelius. 1993. Dutch syntax: A minimalist approach. Groningen: University of Groningen PhD thesis.Search in Google Scholar

Received: 2023-12-12

Accepted: 2024-11-19

Published Online: 2024-12-23

You are currently not able to access this content.

https://doi.org/10.1515/cllt-2024-0068

Keywords for this article

Dutch; elastic net regression; morphosyntactic variation; construction grammar; probabilistic grammar