Lexical factors in English definiteness marking: a corpus-based investigation

Florent Perek; Lotte Sommerer

doi:10.1515/cllt-2024-0032

Article Open Access

Lexical factors in English definiteness marking: a corpus-based investigation

Florent Perek and Lotte Sommerer

Published/Copyright: January 21, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Corpus Linguistics and Linguistic Theory

Abstract

Using data from the British National Corpus, this paper examines the correlation between particular nouns and verbs and the definiteness of direct objects. We find that many nouns, and even more verbs, strongly prefer either definite or indefinite direct objects. Using distributional semantics, we show that these preferences correlate with the meaning of those lexical items, in that semantically similar items tend to show similar preferences. These findings allow us to reexamine to what extent definiteness marking is primarily driven by discourse-pragmatic considerations, as is traditionally considered to be the case. We argue that lexical factors play a role in definiteness marking, and following a usage-based approach to grammatical representation, we suggest that speakers may store information about the typical definiteness status of noun phrases with their mental representation of lexical items, in particular the argument structure of verbs.

Keywords: argument structure; definiteness; direct object; lexical biases; noun phrase

1 Introduction

This paper revisits (in)definiteness marking in Present Day English. Definiteness marking and the usage and function of English determiners have been researched extensively (e.g. Christophersen 1939; Hawkins 1978; Chesterman 1991; Langacker 1991; Löbner 1985; Lyons 1999). However, this area has only occasionally been investigated from a corpus-based, quantitative perspective, and to our knowledge, no large-scale empirical study exists on the usage and distribution of (in)definiteness markers in corpus data.^[1] More importantly, in cognitive-functional linguistics, definiteness marking is primarily seen as a discourse-pragmatic phenomenon which codes – among other things – accessibility, identifiability, familiarity or specificity (e.g. Hawkins 1978; Ariel 1990; Gundel et al. 1993). The definiteness marking of noun phrases is taken to be strongly influenced by discourse-pragmatic factors, and thus to be largely ‘extra-grammatical’ (e.g. Chafe 1976; Du Bois 1980; Ariel 1990; Prince 1992). In other words, the speaker decides for each NP whether it needs to be marked as definite or indefinite, mostly depending on situational factors relating to the referent of the NP and its accessibility at the current point in discourse. Broadly speaking, marking an NP as definite usually signals that the hearer is able to (exhaustively) identify the referent on the basis of the previous discourse, the immediate situation, and/or general world knowledge; in other words, in all contexts that make it clear that the noun phrase refers to a unique, identifiable entity. On the other hand, indefiniteness typically signals that a new, non-identifiable referent gets introduced in the discourse (Payne and Huddleston 2002: 371; Quirk et al. 1985: 5.12).

It is undeniably true that that every noun can be used as the head of an indefinite or definite noun phrase. In example (1) below, answer is introduced in the discourse as a new referent by using an indefinite direct object argument. In the next sentences, this referent is taken up as an identifiable, previously mentioned discourse-argument, namely the/your answer.

(1)

I need [an answer to this question]_NPindef. I need [the answer]_NPdef by tomorrow the latest. By the way, [your answer]_NPdef better be good.

A particular area that has been under-investigated so far is the possible role of lexical factors in the marking of (in)definiteness. It is worth considering whether the meaning of the head noun may influence definiteness marking to a certain extent (Hoey 2005). From a usage-based perspective, it is reasonable to assume that over time speakers come to realize that some nouns occur more often in definite NPs than in indefinite NPs. For instance, relational body-part nouns like head are expected to be used in definite NPs more than in indefinite ones as speakers typically use these nouns with reference to their ‘possessors’ (e.g. my head, the company’s head) (Hoffmann 2022; Payne and Huddleston 2002).

Besides the head noun, we propose that another possible source of lexical bias towards (in)definiteness might be the main verb, an idea which has seldom been considered in previous research. Verbs are generally taken to determine argument structure (Levin 1993; Levin and Rappaport Hovav 2005; Perek 2015; Tesnière 1959), i.e. the number and type of constituents contained in the clause (e.g. transitive vs. intransitive verbs). In this paper, we hypothesize that verbs may not only project information about which and how many arguments are required, but also whether the arguments are expected to be definite or indefinite.

Our two research questions are the following:

RQ1: To what extent are lexical factors involved in the marking of (in)definiteness, in particular the semantics of the head noun and the main verb?
RQ2: Are verbs a better predictor of definiteness than nouns, or vice versa?

To investigate our RQs and hypotheses, we conducted a large corpus study based on over two million direct object NPs extracted from the British National Corpus (XML Edition) by means of a dependency parser (Chen and Manning 2014). For each verb and noun we calculated its so-called ‘Definiteness Ratio’, i.e. the proportion of its definite uses. Using a distributional semantic model, we show that verbs and nouns with a similar meaning tend to occur to a similar extent with (in)definite direct objects. Ultimately, we will argue that, from a usage-based point of view (e.g. Bybee 2010; Diessel 2019; Goldberg 2019), speakers have probabilistic knowledge about whether certain arguments (in this case object NPs) are most likely definite or indefinite. We subscribe to the idea that speakers store exemplars and acquire knowledge about lexical and grammatical expectations in sentence processing (collocational/colligational probabilities). Thus, speakers may also memorize a verb’s or noun’s preferences regarding the most likely discourse status of the constructions they are used with or in.

The paper is structured as follows: in Section 2 we summarize some theoretical background information regarding definiteness marking in English and we discuss the assumed lexical biases that one might observe in argument realization. In Section 3, we present our methodology, discussing in particular the data extraction procedure as well as our coding scheme. Section 4 reports our findings. Section 5 interprets the data by highlighting the interaction between definiteness marking and argument structure realization and by discussing how our findings should influence cognitive-functional models of grammatical knowledge. Section 6 concludes the paper by discussing the potential implications of this research and giving an outlook for future work.

2 Theoretical background

2.1 Definiteness marking in English

Despite the fact that (in)definiteness has been investigated extensively (e.g. Abbott 2004; Becker 2021; Givón 1978; Hawkins 1978; Heim 1982; Himmelmann 1997; Krámský 1972; Löbner 1985; Lyons 1999; Meier 2020), there is still no universally agreed definition. Generally, (in)definiteness is regarded as a functional category pertaining to noun phrases. For example, it has been stated that “[a] definite NP has a referent which is assumed by the speaker to be unambiguously identifiable by the hearer (in brief, a known or identifiable referent)” (Chesterman 1991: 10). Quirk et al. (1985: 266) claim that a definite NP (ex. 2) refers “to something which can be identified uniquely in the general knowledge shared by the speaker and hearer”.

(2)

I am working in [the/my office today]_NPdef.

In contrast to this, “an indefinite NP has a referent which is assumed by the speaker not to be unambiguously identifiable by the hearer (i.e. a new, or unknown referent)” (Chesterman 1991: 10):

(3)

[A man]_NPindef came into the office yesterday.

What previous accounts make abundantly clear is that (in)definiteness is not a simplistic concept but a composite notion which subsumes many characteristic features. An essential component of all definitions seems to be the aspect of identifiability; however, other components of (in)definite meaning have also been discussed extensively: referentiality, familiarity, uniqueness, inclusiveness, specificity and accessibility (e.g. Abbott 2004; Becker 2021; Chesterman 1991; Christophersen 1939; Gundel et al. 1993; Heim 1982; Löbner 1985; Meier 2020; Russell 1905; Sommerer 2018). For reasons of space, we will not discuss all these notions in detail here, but we will limit ourselves to what is relevant for the purpose of this paper. Broadly speaking, definiteness marking depends on discourse-pragmatic factors mostly pertaining to how accessible the referent of an NP might be in the current discourse context (cf. Gundel et al. 1993). Of course, this is not to say that discourse pragmatics underlie all uses of definite or indefinite NPs; there are NPs that are formally definite or indefinite although the features of their referent should make them fall into the opposite type, as shown by examples (4) and (5). Some NPs are semantically neither definite nor indefinite as they correspond to the description of a type rather than an identifiable referent, as shown in the ascriptive use in (6) and the generic use in (7). Finally, in some cases, such as (8), it can even be argued that the article’s function has been reduced to a sheer marker of nounhood, rather than referring to a specific referent (Hawkins 2004).

(4)

A first attempt was made by Jones in 1932. (from Chesterman 1991: 53)

(5)

So I went into the pub and there was this guy sitting at the bar. (indefinite this)

(6)

Mary is a Manchester United supporter . (from Huddleston and Pullum 2002: 402)

(7)

A lion is a noble beast. (from Hawkins 1978: 214)

(8)

His hobby is playing the violin .

Such uses, which contradict the usages mentioned above, are well known and have for instance been classified as “unfamiliar uses” by Hawkins (1978). Such cases notwithstanding, the vast majority of definiteness versus indefiniteness marking can still be accounted for by discourse-pragmatic factors. The overall point is that (in)definiteness marking depends largely on considerations that are mostly external to the sentence: the deciding factors relate to the current discourse and to the extra-linguistic context, rather than to constraints dictated by sentence structure or any other aspects of the linguistic make-up of the sentence.

In Present Day English, the overt marking of definite reference is obligatory, which by default is realized by definite determiners and especially the article the. In contrast, if the noun phrase has indefinite reference this is marked obligatorily by the indefinite article a/an and other indefinite determiners with singular count nouns (Payne and Huddleston 2002: 371; Quirk et al. 1985: 5.12). In addition to the definite article (the), demonstratives (this/that, these/those), possessive pronouns (my, her, etc.) and genitive phrases are also taken to express definiteness (Abbott 2004: 122). At the same time, these determiners have additional meanings, e.g. the pronoun my expresses possession whereas the demonstrative those expresses spatial deixis. The indefinite article a/an marks the NP as indefinite. In many accounts, the following elements are considered to mark the NP as indefinite as well: universals all, both; existential determinatives some/any; quantifying determinatives some, many, much etc.; cardinal numbers one, two, twenty etc.; the negative particle no; disjunctives either, neither; distributives each, every, etc.; and interrogative/relative determinatives which, what (Payne and Huddleston 2002: 356; see also Abbott 2004: 124). Finally, zero marking (i.e. leaving the noun ‘bare’) with plural nouns and mass nouns is an indication of indefiniteness as well.

2.2 Lexical biases in definiteness marking

Besides discourse-pragmatic factors, it is worth asking whether particular lexical items might play a role in the definiteness marking of noun phrases, in particular the head noun and the main verb. As it is well-known, verbs determine in large part the structure of clauses in which they occur, and in particular the number and type of constituents contained in the clause (Goldberg 1995; Levin 1993; Levin and Rappaport 2005; Perek 2015; Tesnière 1959). For example, the verb kill is normally accompanied by two noun phrases, one typically placed before the verb which traditional grammar calls the subject, and one placed after the verb, called the (direct) object:

(4)

[Brutus]_SUB killed [his adoptive father]_OBJdir.

This information is commonly referred to as subcategorization, argument structure, or valency, to name only the most usual terms (Allerton 1982; Chomsky 1965; Grimshaw 1990; Perek 2015; Tesnière 1959). Depending on the grammatical framework, it is often taken to be stored with verbs, or at least to follow in one way or another from particular properties of verbs, typically related to their meaning (Fillmore 1968; Levin 1993; Levin and Rappaport Hovav 2005). Argument realization information is assumed to include the number of constituents and their morphosyntactic properties, such as their syntactic type, case marking, prepositions, etc., but not necessarily more specific aspects of their content, such as the type of determiner(s) that they contain. In particular, the question of whether and to what extent the meaning of a verb also has an influence on the definiteness status of its accompanying arguments has not so far been investigated in depth.

As was already mentioned, the definiteness marking of noun phrases is typically assumed to mostly depend on discourse-pragmatic factors, and thus to be largely extra-grammatical. In other words, the speaker decides for each NP whether it needs to be definite or indefinite according to discourse factors, but this is presumably independent of any aspect of sentence grammar, in particular the verb, in contrast to argument realization. We hypothesize that specific verbs with their particular semantics are more likely to attract definite arguments (e.g. direct objects) than indefinite ones and vice versa. For example, compare the verbs need and forget: the underlying hypothesis is that a verb like need is more likely to take a direct object which is indefinite because what speakers typically need is something that they do not already have and thus is not already familiar to them in the discourse (ex. 5). This is in contrast with a verb like forget: one can only forget what one already knows and thus what is already familiar to them (ex. 6).

(5)

I need [ an answer to this question]_NPindef.

(6)

She forgot [ the answer to this question]_NPdef.

The situation is presumably similar with verbs like build or get versus verbs like cut or explain.

(7)

need/build /get + [OBJ_direct]_NPindef

forget/cut/explain… + [OBJ_direct]_NPdef

This idea is not completely new. Research in formal semantics and thematic role mentions so-called intensional transitive verbs (ITVs) (e.g. creation/search/desire verbs) and their existence entailments. For instance, a verb like to build (when used in the present tense) entails that what ‘is being built’ does not yet fully exist. Here, some scholars observe that some of these verbs thus often take indefinite arguments (e.g. Dowty 1979, 1991]; Zimmermann 2001; Forbes 2020, Moltmann 2008). However, the indefiniteness remark is only mentioned on the side and not being investigated quantitatively in these publications. This is where our paper will expand current research; we also investigate all kinds of verbs, not only those classified as ITVs. In a different research tradition, Hopper and Thompson (1980) also touch on the relation between transitivity and definiteness (of the direct object in particular), which particularly relates to the present study as it focuses specifically on direct object noun phrases. They analyze transitivity as a composite notion involving a number of parameters which are “all concerned with the effectiveness with which an action takes place” (p. 251). They posit the individuation of the object as a feature of transitivity, and mention definiteness as a typical marker of high individuation in English and other languages (and conversely indefiniteness as a marker of low individuation). On the basis of examples taken from a range of languages, they show that the individuation of objects, including definiteness, co-varies cross-linguistically with other features of high transitivity. This suggests a relation between verb argument structure and definiteness across languages, although the exact nature of this relation in a particular language like English is yet to be determined.

At the same time, it is also reasonable to ask whether the type of head noun and its semantics may also exert an influence on the definiteness status of a noun phrase. Here, we hypothesize that nouns like birthday, mother, end, arm have a general preference for definite marking due to their (relational) semantics (e.g. my birthday, his arm, the end of X). For instance, it is much more likely that a speaker uses a noun like wife as the head of a definite NP because the meaning of this noun is inherently tied to another referent: one cannot be a wife in isolation but always in relation to a spouse (Hoffmann 2022; Payne and Huddleston 2002). This fact is often mirrored by a high number of relational definite NPs with possessive determinatives and genitive phrases filling the determiner slot (my wife, uncle Joe’s wife). Some studies do report findings that suggest nouns may indeed display particular preferences for definite or indefinite marking. For instance, Hoey (2005) analyses what he calls “colligational primings” of the noun consequence, i.e. grammatical features that frequently co-occur with this noun. With regards to definiteness, he finds that consequence is used more frequently as the head of an indefinite noun phrase compared to other abstract nouns (see also Stefanowitsch and Gries 2009). However, this kind of analysis is not extended to a wider range of nouns. In language acquisition, Pine and Lieven (1997) find that there tends to be no or little overlap between the nouns used with the indefinite article and those used with the definite article in the speech of toddlers between the age of 1 and 3. In other words, children’s early use of articles is largely lexically-specific, and the acquisition of determiners seems to work on the basis of “constructional islands” (Tomasello 2003), as was also found in many other areas of grammar, whereby particular nouns are stored with particular determiners as chunks but there is no general, lexically independent category of determiners yet. While it is clearly no longer the case for adult speakers, it is plausible that some traces of these early determiner islands are kept in the form of statistical preferences of nouns for particular types of determiners, and that similar preferences are formed with other nouns acquired later in a speaker’s life.

To conclude this section, we do not deny that (in)definiteness marking is primarily a discourse-pragmatic issue. After all, every verb and every noun can be used in indefinite and definite NPs, but since we subscribe to the idea that speakers store exemplars of language use in all their detail as part of their cognitive representation of grammar, this leads to the consequence that speakers might also memorize a verbs’ preferences regarding the most likely discourse status of their arguments, as well as the likelihood for a particular noun to be preferably grounded by a particular definite or indefinite determinative. As mentioned above, some studies already suggest that this might indeed be the case, but there is to this day no largescale investigation of the preferences of particular lexemes towards definiteness, either for nouns or for verbs. In the remainder of this paper, we empirically examine the question of lexical factors in definiteness marking by means of a large-scale corpus study.

3 Data and methodology

3.1 Corpus data

We decided to address our research questions by exclusively focusing on direct objects in English. We made this methodological decision for two reasons. First, noun phrases are very common syntactic structures occurring across a very wide range of contexts, so narrowing down our data set seems a necessity. Focusing on direct objects is deliberate, as it is a position in which the main verb is especially likely to play a role. Second, by keeping the sentence position constant, we presumably compare noun phrases that are subject to similar discourse-pragmatic constraints, which should allow us to identify non-discourse-pragmatic factors in definiteness marking (in our case, lexical ones) more effectively.

To extract information about the definiteness behaviour of English direct object noun phrases, we used data from a syntactically parsed version of the British National Corpus (BNC) XML Edition, a general-purpose reference corpus of British English consisting of 100 million words from texts of a wide range of written and spoken genres. We submitted the entire BNC to the Stanford neural-network dependency parser (Chen and Manning 2014), using the Penn Treebank Part-of-Speech tagset (Santorini 1990) and the Stanford typed dependencies (De Marneffe et al. 2008). We then extracted all instances of the dependency relation “DOBJ” (direct object) between a verb head (i.e. with a tag starting with “VB”) and a common noun dependent (“NN” and “NNS” tags). This excludes in particular proper nouns (e.g. John, London) and pronouns (e.g. you, them, independent this, etc.).

We also added restrictions on the verb. The verbs do, have, get, make, and take were excluded, as they are commonly used in idioms and light verb constructions like do the dishes, have a smoke, get the munchies, make a decision, take a shower, etc. Many of these expressions are highly idiomatic and not fully analyzable, and it is not clear whether they can indeed be considered to combine a verb with a direct object; very often, the noun phrase in them is not referential, in that it conveys the main predication rather than a participant to the verbal predicate. More importantly, many of such expressions are fixed or semi-fixed chunks, disallowing or at least restricting variation within the NP such as modifiers and, crucially for our purposes, the choice of determiners and their definiteness status. These expressions often do not allow the choice between indefinite and definite uses (e.g. *do many dishes), making them irrelevant to our research question. While such cases cannot be fully removed without manual intervention and will thus remain a source of noise, excluding the verbs listed above arguably reduces the problem. Additionally, because these verbs are highly polysemous and have very general meanings, their semantics are harder to characterize than more substantial verbs (especially in the distributional semantic approach that we adopt in Section 4.2). Since one of the aims of our investigation is to describe the relation between lexical meaning and definiteness, it is difficult to include these verbs in the analysis in a meaningful way. Be and go were also excluded because these verbs cannot normally be combined with a direct object as it is commonly defined by grammarians. In addition, do, be, and have are also frequently used as auxiliaries.

Finally, we excluded instances of verbs combined with a modal (e.g. must, can, should, etc.), and instances in which the verb is the dependent in an “XCOMP” dependency relation with another predicative head word (typically a verb, adjective, or noun), in other words when it is the main verb in an open clause complement, defined as “a predicative or clausal complement without its own subject” (De Marneffe et al. 2008: 10). This removes cases in which the direct-object-taking verb is embedded in a complex predicate, such as for instance achieve in necessary to achieve, eating in avoid eating, or allow in must allow. We did this in order to avoid a potential confounding factor: if there is a relation between definiteness and verb meaning, then combining the verb with another predicate might influence that relation in more or less predictable ways. For instance, when it is combined with another verb, a predicate such as like will often require a habitual reading, e.g. I like reading novels, which will be more likely to use an indefinite direct object NP. Similarly, aspectual verbs such as start or finish profile the initial, medial, or final part of an activity, which is likely to interact with the discourse status of the direct object, and in turn its definiteness; for instance, we can reasonably expect finish reading the N to be more common than finish reading a N, as we reason that speakers are more likely to talk about finishing something that was already known to them rather than something that was not mentioned before. Hence, we remove those structures from the dataset, retaining mostly instances in which the verb is used as the main predicate in a clause of any kind without modal verbs, including main clauses and finite subordinate clauses, as well as some non-finite clauses used as adverbials, for instance when a to-infinitive clause expresses purpose.

These selection criteria resulted in the identification of 1,951,005 direct object NPs. In the next section, we discuss how these NPs were coded for (in)definiteness.

3.2 Coding for (in)definiteness

All the direct object NPs were automatically classified into various determiner categories by using the dependency parsing information, in particular focusing on the dependents of the head noun and looking for its determiner(s). The determiner categories were identified depending on what kind of determiner(s) were found, as explained below.

The easiest cases to identify were those in which the head noun was involved in the dependency relation “DT” (or “POSS” for the possessive determiners) with particular types of determiners, namely: the definite article the, the indefinite article a (or an), a demonstrative determiner (this, that, these, or those), or a possessive determiner (e.g. your, his, their, etc.). All of these types are considered central determiners in reference grammars (e.g. Downing 2015; Huddleston and Pullum 2002; Quirk et al. 1985).

The category “genitive” contains cases in which the head noun is modified by a preceding noun phrase with ’s attached to it, e.g. your father’s death. In the corpus annotations, they were identified as dependent of the head noun linked by the dependency relation “POSS” that were not possessive determiner (“PRP$”).

If the NP contained a central determiner that did not fall into one of the categories listed so far, or only contained a pre-determiner and/or post-determiner, it was categorized as “other determiner”. This category mostly includes quantifiers such as many, few, much, some, several, no, all, etc., numerals (e.g. three, ten, a hundred), and other indefinite determiners like any, every, each, either, etc. All instances of this category are indefinite.

Also included in this category are instances of binominal quantifiers, i.e. a lot of, a bit of, a couple of, a pair of, lots of, plenty of, hundreds/thousands/millions/dozens of. In such cases, the first noun (e.g. lot, couple) is treated as the head in the corpus annotations, but we consider instead the head of the noun phrase to correspond to the head of the prepositional complement of, e.g. wine in a lot of red wine. While it is still technically possible for some of these quantifying nouns (in particular couple and pair) to be used literally in a non-quantifying reading (e.g. a pair of gloves), we found that the vast majority of instances in the corpus are indeed quantifiers. As to partitive structures, e.g. some of his objections, although they look superficially similar to binominal quantifiers except that the head word can be a determiner (e.g. some, any, each, all), a numeral, or a quantifying word (e.g. many, much, more, none), we decided to exclude them from the dataset. We found these NPs hard to categorize: since they typically refer to an undefined subset of the referent of the inner NP, they are technically indefinite, but this subset is determined by the inner NP, which is normally definite. Because they are a relatively small category (6,618 tokens, i.e. a mere 0.3 % of all tokens), we excluded partitives to avoid any potential confound.

Finally, if any NP was not recognized as one of the above categories, it was categorized as “zero-marking”, corresponding to those cases in which a noun is used without a determiner, which happens in particular with plural nouns and mass nouns.

Table 1 below summarizes the determiner categories identified as either definite or indefinite, with their frequency of occurrence in the dataset. We find similar numbers of definite and indefinite NPs, with a ratio slightly in favour of indefinite ones (at 0.53). The investigated examples include definite and indefinite NPs that might also contain pre- and post-head modification in addition to their determiners, for instance through adjectives, prepositional phrases (with of or other prepositions), relative clauses, or non-finite clauses.

Table 1:

Determiner categories identified in the dataset.

Definite categories		Indefinite categories
Definite article (the N)	584,324	Indefinite article (a, an)	344,166
Demonstrative (this/that/these/those N)	41,787	Zero-marking (plural nouns, mass nouns, etc.)	518,706
Possessive (my/her/their/… N)	255,544	Other determiner (no, any, all, other, many, some, quantifiers and numerals)	152,793
Genitive (NP’s N, e.g. my sister’s phone)	47,067
Total definite NPs	928,722	Total indefinite NPs	1,015,665

4 Empirical analysis and findings

As already mentioned, our main research question is whether lexical factors have an influence on the definiteness status of noun phrases, focusing on direct objects as our case study. We examine the influence of two words in particular: the verb that governs each direct object, and the head noun of the direct object itself. A secondary question will be to compare the role of these two lexical factors and say which one (if any) seems to have the greater impact on definiteness. In Section 4.1, we define the main quantitative measure derived from our dataset: the definiteness ratio, which basically captures the proportion of definite versus indefinite uses of a lexeme. We explore how definiteness ratios vary among the nouns and verbs in our dataset. In Section 4.2, we assess to what extent definiteness ratios relate to the meaning of nouns and verbs, using distributional semantics. In Section 4.3, we use linear regression modelling to quantify the strength of this relation and compare nouns and verbs in this respect.

4.1 Definiteness ratios

We start our analysis by examining the null hypothesis, i.e. the idea that there is no relation between the definiteness of a direct object and the verb or head noun, and that definiteness is purely determined by discourse-pragmatic factors. Recall that we found approximately the same proportion of definite versus indefinite direct object noun phrases in our sample (see Section 3.2). If this hypothesis is true, then we should expect direct objects to be roughly equally likely to be definite or indefinite for all verbs and all nouns, and conversely, we should not expect verbs or nouns to vary widely in the definiteness behaviour of their direct objects.

To quantify the likelihood of direct objects to be definite or indefinite, we calculate for each verb and each noun its definiteness ratio (DR), which we define as the number of definite direct objects governed by this verb or headed by this noun, divided by the number of all such direct objects (both definite and indefinite), as summarised by the formulae below.

DR V = F ( V ∧ DefDO ) F ( V ∧ DefDO ) + F ( V ∧ IndefDO )

DR N = F ( N ∧ DefDO ) F ( N ∧ DefDO ) + F ( N ∧ IndefDO )

In other words, the DR of a verb is the proportion of its uses with a definite direct object relative to all of its uses with a direct object, and the DR of a noun is the proportion of its uses as the head of a definite direct object (of any verb) relative to its uses as the head of a direct object of any kind (both definite and indefinite). As a ratio, DRs vary between 0 and 1. When the DR is closer to 1, this means that the verb tends to prefer definite direct objects or that the direct objects headed by this noun tend to be definite; when it is closer to 0, the verb tends to prefer indefinite direct objects, or the direct objects headed by this noun tend to be indefinite.

We calculated the DR for all verbs and all nouns that had at least 1,000 tokens in our dataset. The 1,000 threshold is arbitrary, but it is deemed high enough to provide robust data about the use of each verb and noun, since a larger sample is less prone to random variation than a smaller number of tokens. To analyze how the DRs of the 369 verbs thus selected vary, we plotted their probability density in Figure 1. This graph shows, for each value of DR between 0 and 1, how common this value is across the verbs in our sample; the higher the curve is at any given point, the more common the corresponding DR value on the x-axis is among the verbs in our dataset.

Figure 1:

Probability density of the definiteness ratio of verbs with at least 1,000 tokens in our dataset. Median = 0.5015 (dashed line); Q1 = 0.3634, Q3 = 0.6486 (dotted lines).

As can be seen from Figure 1, the DRs of these 369 verbs vary widely. While, unsurprisingly, no verb is used 100 % of the time with only definite or indefinite direct objects, some of them are strongly biased towards one or the other type, with DR values approaching 0 or 1. The first and third quartiles are respectively at 0.3634 and 0.6486 (the dotted lines in Figure 1), meaning that half the verbs in our dataset fall within these values, with the other half receiving more extreme values. Hence, there does seem to be a relation between verbs and the definiteness of their direct objects: while some verbs are equally likely to be used with a definite versus indefinite direct object, many of them show a preference towards one or the other type.

Turning to nouns, the probability density of the definiteness ratios of the 427 nouns attested at least 1,000 times as the head of a direct object in our dataset is plotted in Figure 2. As can be seen in this figure, the DRs of nouns vary widely just like verbs’, but they are also more narrowly distributed. On the whole, the distribution is more clustered around the middle values than that of verbs, reflecting a more balanced proportion of definite versus indefinite direct objects. The first and third quartiles are at 0.3798 and 0.6028, corresponding to an interquartile range (IQR) of 0.223, compared to 0.2853 for verbs. The interquartile range corresponds to the span of values assigned to the middle portion of the distribution, and is often used as a measure of dispersion in a dataset; the higher IQR for verbs confirms that verbs are on the whole more prone than nouns to display a strong preference for definiteness or indefiniteness.

Figure 2:

Probability density of the definiteness ratio of nouns with at least 1,000 tokens in our dataset. Median = 0.4888 (dashed line); Q1 = 0.3798, Q2 = 0.6028 (dotted lines).

In sum, both nouns and verbs seem likely to play a role in the definiteness behaviour of direct objects, as we find in both cases a significant number of lexemes that are biased towards definite or indefinite direct objects. At the same time, it looks like the role of verbs may be more significant than that of nouns, since the definiteness of direct objects varies more for verbs than for nouns, and fewer nouns than verbs are biased towards definite or indefinite uses. However, there is a potential confound in our dataset which makes it unclear whether we can reliably draw this conclusion. Particular verbs notoriously show preferences in the kinds of nouns that can serve as the head of their direct object, which largely has to do with their meaning and with the type of entities that can fulfill the semantic role assigned to their direct object. For instance, a verb of ingestion like eat will tend to be combined with nouns referred to edible substances, and a verb like waste will often be combined with nouns referring to some kind of resource, such as money, time, or energy. If we consider that certain nouns may be biased towards definite or indefinite noun phrases when they occur as direct objects, it might thus be the case that verbs are biased towards definite or indefinite direct objects not in and of themselves, but simply by virtue of their preference towards nouns displaying such a bias. The converse may also be true for nouns: some nouns might be biased towards definite or indefinite DOs because they tend to occur with verbs that are definite- or indefinite-biasing.

To control for this potential confound and check whether there are independent effects of nouns and verbs, we re-examine the distribution of the DR of verbs when we remove biased nouns, and conversely the distribution of the DR of nouns when we remove biased verbs. For verbs, we only keep in the dataset those nouns from the middle range of the distribution of noun DRs, i.e. nouns with a frequency of at least 1,000 in our dataset that have a DR between 0.3798 (the lower quartile) and 0.6028 (the upper quartile); this is the part of the distribution that is between the dotted lines in Figure 2. We then re-calculate the DR of all verbs when they are combined with only these nouns. We apply the same correction to nouns, i.e. we re-calculate the DR of all nouns with a frequency of at least 1,000, when they occur as DO to the verbs from the middle range of the DR distribution, i.e. with an original DR between 0.3634 and 0.6486. The distribution of verb DRs and noun DRs thus corrected are plotted in Figures 3 and 4 respectively.

Figure 3:

Probability density of the definiteness ratio of verbs with F > 1,000 in the whole dataset, when only the direct object nouns occurring in the middle range of definiteness ratios are included. Median = 0.5120 (dashed line); Q1 = 0.3788, Q3 = 0.6559 (dotted lines).

Figure 4:

Probability density of the definiteness ratio of nouns with F > 1,000 in the whole dataset, when only the verbs occurring in the middle range of definiteness ratios are included. Median = 0.5128 (dashed line); Q1 = 0.4052, Q3 = 0.6317 (dotted lines).

As can be seen in these plots, for both verb and nouns, the distributions of DRs does not change substantially when the nouns or verbs with extreme DRs are removed. The central tendency of each distribution does not seem to vary much, either for verbs (Median = 0.5120; Q1 = 0.3788, Q3 = 0.6559; IQR = 0.2853; versus Median = 0.5015; Q1 = 0.3634, Q3 = 0.6486; IQR = 0.2771) or for nouns (Median = 0.4888; Q1 = 0.3798, Q3 = 0.6028; IQR = 0.223; versus Median = 0.5128; Q1 = 0.4052, Q3 = 0.6317; IQR = 0.2265). This means that both verbs and nouns can bias direct objects towards a definite or indefinite use in and of themselves.

For verbs in particular, such tendencies are not due to their preference towards nouns with such a bias; rather, we suggest that some verbs impose certain information structure requirements to their direct object, which translate into a tendency for definiteness or indefiniteness. Nouns too can be biased towards definite or indefinite use due to their semantics, but in the absence of such bias, they tend to follow the definiteness profile of the verb. In the next two sections, we provide evidence for these views by investigating what kinds of verbs and nouns display a preference towards definite or indefinite direct objects.

We now turn to the question of what makes the DR of nouns and verbs vary. We start by looking at the verbs and nouns that show the strongest preference for definite or indefinite direct objects. Table 2 below lists the 10 verbs with the lowest DRs (lefthand side) and the 10 verbs with the highest DRs (righthand side); Table 3 does the same for nouns.

Table 2:

The 10 verbs most used with indefinite direct objects (left), and the 10 verbs most used with definite direct objects (right) in our dataset.

Indefinite-preferring				Definite-preferring
Verb	Def DOs	Indef DOs	Def ratio	Verb	Def DOs	Indef DOs	Def ratio
undergo	127	1762	0.0672	shake	4,535	596	0.8838
appoint	285	1,675	0.1454	appreciate	929	150	0.8610
issue	361	2078	0.1480	lower	975	162	0.8575
suffer	609	3,338	0.1543	shut	1,017	182	0.8482
generate	287	1,565	0.1550	stress	1,278	244	0.8397
seem	380	1980	0.1610	emphasise	1,080	243	0.8163
need	2,595	12,630	0.1704	explain	2,582	593	0.8132
impose	348	1,680	0.1716	alter	1,166	273	0.8103
produce	2,299	11,093	0.1717	close	2,970	726	0.8036
imply	249	1,201	0.1717	wash	888	225	0.7978

Table 3:

The 10 nouns most used as the head of an indefinite direct object (left), and the 10 nouns most used as the head of a definite direct object (right) in our dataset.

Indefinite-preferring				Definite-preferring
Noun	Def DOs	Indef DOs	Def Ratio	Noun	Def DOs	Indef DOs	Def Ratio
access	137	1753	0.0725	mouth	1,312	73	0.9473
pound	234	2,149	0.0982	head	9,868	723	0.9317
chapter	209	1,593	0.1160	lip	1,564	123	0.9271
deal	353	2,550	0.1216	back	1,271	116	0.9164
minute	176	1,183	0.1295	father	1,295	126	0.9113
copy	248	1,592	0.1348	nature	1,462	152	0.9058
mile	168	1,051	0.1378	mind	2,260	244	0.9026
variety	218	1,295	0.1441	husband	887	120	0.8808
hour	319	1867	0.1459	door	4,955	686	0.8784
care	212	1,126	0.1584	truth	1,656	238	0.8743

The definiteness preference of many of these items can arguably be made sense of if we consider their meaning. For instance, the direct object of verbs of creation such as generate, issue, produce, and impose (i.e. the creation of a constraint) is an entity that, by definition, does not exist before the creation event, and therefore it makes sense that this entity has not been mentioned in the previous discourse and is referred to with an indefinite NP. A similar explanation goes for need, whose direct object refers by nature to something that someone does not have yet, and thus is likely to be discourse-new. Conversely, with change of state verbs such as shut, lower, alter, close, and wash, one can only affect something that has already been identified and is thus likely to have already been mentioned or to be accessible in the context of utterance, hence it tends to be referred to with a definite NP. Similarly, one can only appreciate, explain, emphasize, or stress something that one is already aware of, making a definite direct object especially likely for these verbs.

As to nouns, names for units (pound, minute, mile) are presumably biased towards indefinite use because they naturally tend to be quantified (e.g. a pound of meat, 3 minutes) rather than to refer to a specific entity. The definite-preferring nouns in our list also include body-part nouns (mouth, head, lip, back, and to some extent mind) and kinship terms (father, husband), which can be explained by the fact that the semantics of these nouns inherently tie their referent to another entity: a mouth or head cannot exist independently of an entire body (either literally of a person or animal, or figuratively of an inanimate object or abstract entity), and one cannot be a father or husband in isolation but always in relation to their child or spouse. Hence both types of nouns are used most naturally with possessive determination (e.g. her mouth, John’s father) or a structure with a definite article and a post-modifier (e.g. the head of the company).

In sum, there seems to be a connection between the meaning of nouns and verbs and their definiteness behaviour and some degree of regularity in this connection, in that verbs sharing aspects of their meaning display a similar behaviour. In the next section, we explore this connection on a much larger scale using distributional semantics.

4.2 Definiteness ratio and lexical meaning

In this section, we evaluate in more detail the proposition that definiteness preferences are related to the meaning of nouns and verbs. More precisely, we examine one prediction of this hypothesis, i.e. that semantically similar verbs and nouns should behave similarly with respect to the definiteness of direct objects. In order to do so on a large scale, we capture the meanings of verbs and nouns, and in particular the semantic similarity between them, by means of a distributional semantic model (hereafter DSM).

Distributional semantics, also often called vector-space semantics or word embeddings, aims to capture the meaning of words through their lexical collocates in large text corpora. As such, it follows Firth’s (1957: 11) idea that “you shall know a word by the company it keeps”. Distributional semantics draws on the intuition that semantically similar words are expected to have the same collocates. For instance, the verb drink and sip, because they refer to similar actions, are expected to both co-occur with similar words, for instance words for beverages (wine, water, coffee, beer), containers (cup, glass, bottle), as well as words related to drinking or dining practices (restaurant, bar, table, party, etc.). The main principle behind distributional semantics is to approximate semantic similarity between words through their similarity in distribution.

The DSM we used for this study was built using word2vec (Mikolov et al. 2013), which aims to quantify the relation between a word and its contexts of occurrence in a corpus, captured in terms of weights in a neural network. The model was trained on a lemmatised and PoS-tagged version of the Bank of English corpus, a 700-million-word corpus of general contemporary English curated at the University of Birmingham that includes a wide range of genres (spoken language, fiction, newspapers, etc.) mostly from both British and American English. More specifically, we used the skip-gram algorithm of word2vec from the gensim library in Python (Řehůřek and Sojka 2010). Given a word, skip-gram attempts to predict its context of occurrence, defined here as the surrounding words in a 2-word window (i.e. 2 words to the left and 2 words to the right); words are considered similar to the extent that they predict similar contexts. The meaning of each word is captured as a vector, i.e. an array of numerical values, derived from the weights of nodes in the layers of the neural network. Semantic similarity between words can be measured by quantifying the similarity between vectors; this is typically done using the cosine similarity measure.

We sought to exclude from the model words that can contribute to marking (in)definiteness, as discussed in 3.2; we implemented this filter simply by making the word2vec algorithm ignore words from the following parts-of-speech: determiners (tags starting with D), numerals (tags starting with M), and pronouns, which include possessive determiners (tag starting with P). This was done to avoid a possible confound whereby the definiteness ratios and the semantic similarity between words would partly be based on the same information, hence leading to words occurring with the same (in)definiteness markers being simultaneously measured as more semantically similar and closer in definiteness behaviour. In addition, in a departure from the standard implementation of distributional semantics, we sought to distinguish the transitive use of verbs (i.e. with a direct object) from their other uses (e.g. intransitively, with a complement clause, with a non-finite clause, etc.), essentially treating these two uses of each verb as two distinct lexical items. This was done in order to obtain a more refined semantic representation for each verb, and more importantly, a more appropriate description of their meaning for the present study. The meaning of a verb can indeed vary widely according to its argument structure, and not all meanings of a verb can be used with the same range of valency requirements. For instance, add is exclusively used as a verb of communication when it is combined with a subordinate clause, e.g. He added that he had missed her, but it is used in the more general sense of ‘put in, include’ when it is combined with a direct object NP, e.g. Add the chopped onions to the pan. This means that the semantic representation of a verb can be very different depending on whether it is used transitively or not. Since we are dealing with verbs followed by direct objects, we decided to use semantic vectors of verbs based on their transitive use only. To achieve this, we added dependency annotations to the Bank of English corpus, using the same software and parameters as for the BNC above (see 3.1). We then identified from the annotations all instances of a verb combined with a direct object, and we added a special tag “_vt” to the lemma of the verb in every such instance. This essentially marks a formal distinction between the transitive uses of a verb and the non-transitive uses, which are thus treated as separate lexical items by word2vec. As for nouns, even though we have no a priori reason to suspect that nouns take on a different meaning when they are used as direct objects, we applied the same treatment to nouns for good measure, if only to make their semantic representation maximally comparable with that of verbs. Namely, we identified the nouns that were annotated as the head of a direct object of a verb and added a “_do” tag to their lemma. The analyses presented in this section and the next one are based on the semantic vectors of transitive verbs and direct object nouns, i.e. those verbs with the “_vt” tag and those nouns with the “_do” tag. We did find that this manipulation provided us with higher-quality semantic representations (especially for verbs) compared to a DSM that did not separate the different uses of verbs and nouns, leading to more appropriate semantic groupings.

Following Perek (2016, 2018], we extracted pairwise cosine similarity scores between verbs occurring at least 1,000 in our dataset, and we used this information to plot these verbs in two dimensions with t-SNE (Van der Maaten and Hinton 2008), a technique that aims to place objects in a 2-dimensional space such that the between-object distances are preserved as well as possible. In this plot, found in Figure 5, verbs that are found to be semantically similar in the DSM are located close to each other. This allows us to visualise the semantic space populated by these verbs and identify semantically coherent classes of verbs. We use colour-coding to relate this semantic information to the verbs’ definiteness ratio: each verb in the plot receives a colour on a spectrum from red to blue, directly proportional to the definiteness ratio of the verb. Hence, a red verb is one that prefers indefinite direct objects, a blue verb is one is one that prefers definite direct objects, with varying degrees of purple in between signifying verbs that are more balanced in the definiteness of their direct objects.

Figure 5:

Distributional semantic plot of verbs in our dataset (F > 1,000), colour-coded according to their definiteness ratio (red = indefinite-preferring, blue = definite-preferring).

The overall picture that emerges from Figure 5 is that verbs cluster in groups of similar colours. In other words, semantically similar verbs do tend to have similar definiteness ratios. By inspecting the plot, we can identify classes of verbs on either side of the spectrum. For instance, among indefinite-preferring verbs, comprise, combine, contain, feature, include, exclude, incorporate, involve (bottom left) form a cluster of verbs coding a part-whole relation between its subject and object. We also find verbs of experiencing (experience, face, suffer, undergo; top middle), verbs of giving (give, lend, offer, provide, supply; bottom middle), and verbs of creation (e.g. build, construct, create, design, develop, form, generate, produce; bottom middle) to have a similar general preference for definite direct objects. As to definite-preferring verbs, we can identify for instance verbs of scrutinizing (assess, analyse, check, examine, explore, investigate, measure, monitor, observe, review, study, test; middle left), verbs of landmark-based motion (approach, cross, enter, reach, walk; bottom right), and change of state verbs (cut, enhance, expand, extend, improve, increase, limit, lower, raise, reduce, reinforce, restrict, strengthen; bottom middle). We also find pairs of synonyms or antonyms with a similar colour: alter/change, attract/draw, collect/gather, start/begin, close/shut, grant/award, possess/lack, complete/finish.

Figure 6 contains a similar plot to Figure 5 for the nouns occurring at least 1,000 times in our dataset. Similarly to Figure 5, nouns are placed in the plot according to their semantic similarity as measured by the DSM, and they are colour-coded according to their definiteness ratio.

Figure 6:

Distributional semantic plot of nouns in our dataset (F > 1,000), colour-coded according to their definiteness ratio (red = indefinite-preferring, blue = definite-preferring).

As in Figure 5, we can identify semantically coherent groups of nouns that also receive a similar colour, indicating that they have similar definiteness ratios. As noted above, words for body parts (e.g. foot, head, heart, leg; bottom middle) and kinship terms (e.g. daughter, father, wife; middle left) are all typically used in definite direct objects. The same holds for nouns referring to abstract concepts (e.g. concept, idea, principle; middle right). Conversely, nouns that prefer indefinite uses include time units (e.g. day, hour, year; bottom middle), information nouns (e.g. detail, evidence, information; bottom right), and nouns referring to an augmentation (growth, increase, rise; top right). We also find many pairs of related nouns with similar definiteness behaviour, as indicated by their similar colour: loss/profit, truth/fact, success/victory, care/treatment, plant/tree, class/course, and road/street.

However, for both verbs and nouns, we need to recognize that there are also counter-examples, i.e. words that are related in meaning but do not behave similarly with respect to definiteness. For instance, the noun pairs reason/cause, space/room, and image/picture, and the verb pairs seek/pursue and acknowledge/admit, differ in their definiteness preferences despite their semantic relatedness. Such cases mean that the relation between definiteness and lexical meaning may not be fully systematic and seems to know some exceptions, although, as we will see in Section 5 below, many of such exceptions can still be accounted for by subtle but important semantic differences. At any rate, such exceptions are relatively uncommon, even more so for verbs than for nouns. On the whole, the two plots show different levels of overlap in the colours coding definiteness ratio: for verbs, there are larger coherent areas with roughly the same colour, while for nouns the blue and red items seem more mixed, suggesting that the relation between meaning and definiteness is stronger for verbs than for nouns. In the next section, we explore this idea in a more systematic and quantitative way.

4.3 The relative weight of nouns versus verbs in definiteness marking

So far, we have explored the question of the connection between lexical meaning and definiteness in a rather impressionistic way. We observed that semantically similar verbs and nouns tend to receive similar definiteness ratios, and that this relation seems to be stronger for verbs than for nouns, which suggests a more central role for the former in predicting the definiteness of a direct object. In this section, we seek to confirm these observations through more precise and systematic quantitative measurement.

One issue that immediately presented itself to us when quantifying the relation between definiteness and lexical meaning is that this relation is an essentially unidirectional one. While semantically similar verbs and nouns are predicted to have similar definiteness ratios, there is no such reverse prediction: words that are totally unrelated semantically may also differ in terms of definiteness, but they may also be similar for different reasons. This precludes the use of standard correlation measures, such as the Pearson product-moment correlation coefficient, to quantify the correlation between semantic similarity and the closeness in definiteness ratio, since such a measurement would only make sense at one end of the relation (i.e. high semantic similarity), but it would be meaningless and riddled with noise at the other end (low semantic similarity). For this reason, we opted for a different approach: if we sort nouns and verbs into semantically coherent classes, we predict that items within each class should be similar in terms of the definiteness behaviour of direct objects. In other word, the semantic class of a noun or verb should be a reliable predictor of its definiteness ratio.

To operationalise semantic classes, we automatically sorted nouns and verbs into a set number of groups on the basis of their distributional semantic representations, using cluster analysis (Aldenderfer and Blashfield 1984). Cluster analysis is a family of unsupervised learning techniques aimed at the classification of objects into homogenous categories, according to a set of numerical variables against which each object is characterized; in our case, this is the semantic vector of each verb or noun. Specifically, we used model-based clustering based on parameterized finite Gaussian mixture models, an advanced form of cluster analysis that approaches the problem of classification in terms of fitting the data to underlying probability distributions. Model-based clustering tends to obtain better results than comparable techniques like k-means clustering, while avoiding some of their pitfalls (cf. Bouveyron et al. 2019), and it did prove to work very well with our data. We used the Mclust function from the R package mclust (Scrucca et al. 2016).

The question of the “right” number of clusters is a notoriously thorny issue in cluster analysis. Model-based clustering does come with a built-in method to determine the ideal number of clusters in a data-driven way, drawing on Akaike’s information criterion (AIC) to compare model fit. However, with our data this method disappointingly pointed to a 2-cluster solution, which is clearly not sufficient to adequately account for the semantic variation among our items in a meaningful way. Therefore, we decided to run the analysis several times using different numbers of clusters (namely, between 40 and 100 in increments of 20). While the choice in the number of clusters remains arbitrary, by varying this number we check that our results are consistent across different clustering solutions and do not depend on any particular number of clusters. For illustrative purposes, we list in Table 4 a selection of noun and verb classes identified in the 40-cluster solution.

Table 4:

Some semantic classes of verbs and nouns identified in the 40-cluster solution.

Verb classes	Noun classes
Cluster 2: see need say know tell feel ask mean hear think understand help expect claim learn believe teach assume concern speak hope talk	Cluster 3: people government group side member school world force party country state person movement other staff police patient unit student worker player court club
Cluster 7: include contain involve comprise feature	Cluster 4: time hour year day period minute moment
Cluster 10: want enjoy like love welcome prefer hate appreciate	Cluster 11: part role position place office seat post vote
Cluster 12: keep maintain save retain protect restore preserve	Cluster 20: idea view sense feeling principle theory concept impression opinion
Cluster 18: raise reduce increase cut improve extend limit enhance strengthen lower restrict expand reinforce	Cluster 30: result end development loss performance profit event success death return sale victory
Cluster 33: examine study check investigate assess explore measure test review analyse monitor	Cluster 36: rise increase growth

The groups in Table 4 clearly show that cluster analysis successfully identifies semantically coherent classes from the distributional semantic information. We find some of the classes already identified in Section 4.2 from the distributional semantic plots: verbs expressing part-whole relations (Cluster 7), change of state verbs (Cluster 18), verbs of scrutinizing (Cluster 33), nouns for time units (Cluster 4), nouns for increases (Cluster 36). Other classes were less immediately visible from the plot but are clearly identified as clusters: verbs expressing attitudes (Cluster 10), verbs of preserving (Cluster 12), nouns that are synonyms for role (Cluster 11), nouns for states of mind (Cluster 20), nouns expressing outcomes (Cluster 30). However, not all clusters are as well-defined; in some cases, a cluster combines two or more loosely related semantic categories rather than a single coherent one. Two examples are provided in Table 4: Verb Cluster 2, which comprises verbs of cognition (e.g. know, feel, assume, understand) and verbs of communication (e.g. say, tell, ask, claim), and Noun Cluster 3, which includes both names for individual people (e.g. member, student, player) and for groups of people (e.g. team, government, school, club). There is still some semantic coherence in these groups, and their apparent heterogeneity is probably an effect of the number of clusters chosen, which necessarily forces a certain level of granularity. It is likely that these would be split into smaller and more semantically coherent classes in a solution with a higher number of clusters. Finally, there are also occasional outliers in these classes, e.g. vote in Noun Cluster 11, which are at best related to some of the other nouns through word associations but are certainly not of the same ontological type. This kind of outcome is unavoidable in a method that seeks to sort out items that might be genuine orphans and thus ends up being categorised on quite loose grounds, but fortunately such cases are not common in our data.

The next step of the analysis is to assess to what extent the semantic class of the head noun and the governing verb can predict the definiteness of NPs in our dataset, which we achieve using mixed effects logistic regression modelling. We restricted the dataset to the 369 verbs used so far in our analysis, and the 369 most frequent nouns (instead of the 427 with a frequency of at least 1,000 used earlier). We matched the number of nouns and verbs in order to make sure that verb and noun clusters had the same degree of variability; if we had used a higher number of nouns (i.e. all 427), this would have meant that the noun clusters had on average more members than the verb clusters (since the number of clusters is held constant between nouns and verbs), which could also have meant a more diverse behaviour in each cluster for nouns compared to verbs, hence penalising nouns as a predictor of definiteness. The lowest frequency noun in this dataset, length, totals 1,133 tokens, which does not substantially differ from the 1,000 frequency threshold applied to verbs.

We added to this data the semantic class of each noun and each verb, and we submitted it to mixed effects logistic regression modelling with the definiteness of each token as binary dependent variable. Using the lme4 library in R, we fitted two types of models: (i) models with random intercepts for the semantic classes of nouns and verbs, (ii) models with random intercepts for the semantic classes of nouns and verbs as well as random intercepts for individual nouns or verbs nested within semantic classes.^[2] The latter type of models captures the definiteness preferences of individual lexical items contained in each class in addition to that of the whole class, they are thus more precise and take lexical idiosyncrasies into account. However, because not all these models could be properly fitted to the data, as we discuss below, we decided to keep the simpler models containing only the semantic classes in addition to those more complex ones with nested effects. We chose to capture the semantic classes and lexical items as random effects because the number of levels for these variables is high and taken from a potentially even wider population (i.e. the entire set of nouns or verbs and the entire semantic space that they cover). In essence, this means that random intercepts are calculated to capture the baseline probability of a definite NP in each semantic class (and for each lexical item in each class in the models with nested effects), and we evaluate the amount of variance in the definiteness of NPs in our dataset that is captured by these intercepts alone. For each type of model, we fitted several variants that differed depending on whether we included nouns, verbs, or both. The formulas (in lme4 syntax) used for these various models are provided in Table 5. Since we merely seek to measure the effect of semantic classes with these models and how much variation in definiteness they can capture in and of themselves, we did not include any other factors in the models, which is why they do not include any fixed effects. As mentioned above, we repeated this analysis with different number of semantic classes, i.e. different number of clusters in the initial cluster analysis, namely 40, 60, 80, and 100 clusters.

Table 5:

Model formulas used for the mixed effects regression analysis (in lme4 syntax).

Noun class model with nested nouns	Definiteness ∼ (1\|NounClass) + (1\|NounClass\|Noun)
Verb class model with nested verbs	Definiteness ∼ (1\|VerbClass) + (1\|VerbClass\|Verb)
Noun + verb class model with nested nouns and verbs	Definiteness ∼ (1\|NounClass) + (1\|VerbClass) + (1\|NounClass\|Noun) + (1\|VerbClass\|Verb)
Noun class model	Definiteness ∼ (1\|NounClass)
Verb class model	Definiteness ∼ (1\|VerbClass)
Noun + verb class model	Definiteness ∼ (1\|NounClass) + (1\|VerbClass)

Table 6 reports two performance statistics for all models, namely the Bayesian information criterion (BIC) and conditional R².^[3] The models with nested effects are reported in the top half of Table 6, the models without in the bottom half. Both measures provide an indication of how the model fits the data. R² ranges between 0 and 1 and captures what proportion of the variance is explained by a regression model. The value of BIC, on the other hand, is unbounded and increases with unexplained variation; hence, a lower BIC means that a model is a better fit. Both statistics provide a measure of the predictive power of the variables in the model; in our case, it tells us how reliably the definiteness of a noun phrase can be predicted from the noun or verb alone, and whether this prediction is more reliable for verbs than for nouns. As mentioned earlier, it is important to note that many of the models with nested random effects (i.e. the top half of Table 6) were singular fits, meaning in technical terms that some dimensions of the variance-covariance matrix have been estimated as exactly zero. In practical terms, such models are considered problematic and the consensus is that their results are not reliable (cf. Oberpriller et al. 2022); the statistics of the affected models are formatted in italics in Table 6. For the same reason, the R² value for one of the singular fit models (namely, the verb class model with nested nouns and 60 semantic verb classes) could not be calculated and is thus missing from Table 6. Although results from the singular fit models should be taken with a pinch of salt, we still find it appropriate to report them here since they fully line up with those of the models that do not suffer from such issues, i.e. the nested effects models with the larger number of semantic classes (80 and 100) and all models without nested effects. The full R output of all models is also provided in an online supplement to this paper.

Table 6:

Performance statistics from mixed effects regression models predicting the definiteness of NPs with the semantic class of nouns and verbs as random effects. The italic values are from models with singular fits.

Number of clusters (k):	k = 40	k = 60	k = 80	k = 100
Noun class model with nested nouns	BIC = 860,099.2 R² = 0.1749113	BIC = 860,099.2 R² = 0.1749116	BIC = 860,099.2 R² = 0.1749103	BIC = 860,099.2 R² = 0.1749117
Verb class model with nested verbs	BIC = 854,011.9 R² = 0.1840432	BIC = 854,011.9 R² = N/A	BIC = 854,011.8 R² = 0.1840409	BIC = 854,011.7 R² = 0.1840310
Noun + verb class model with nested nouns and verbs	BIC = 792,360.6 R² = 0.2709655	BIC = 792,360.6 R² = 0.2709634	BIC = 792,360.1 R² = 0.2709594	BIC = 792,359.5 R² = 0.2709387
Noun class model	BIC = 939,668.0 R² = 0.03012360	BIC = 935,017.6 R² = 0.04066266	BIC = 922,697.7 R² = 0.06780322	BIC = 916,678.9 R² = 0.08161265
Verb class model	BIC = 932,144.3 R² = 0.04370842	BIC = 925,918.6 R² = 0.05777119	BIC = 920,157.1 R² = 0.07726112	BIC = 914,996.9 R² = 0.08562260
Noun + verb class model	BIC = 920,216.1 R² = 0.06778813	BIC = 909,860.8 R² = 0.09137038	BIC = 895,596.5 R² = 0.11929799	BIC = 885,526.3 R² = 0.14191183

For all models, we find R² values to be quite low (even more so for the models without nested effects), and BIC values to be quite high, showing that these models have overall low predictive power. This should not be surprising, since we know that definiteness marking is subject to many factors that have a priori more to do with discourse pragmatics than with lexical items. However, this does confirm our earlier observation that there is a relation between the meaning of a head noun or governing verb and the definiteness behaviour of a direct object. The models with nested effects perform markedly better than the ones without, meaning that including random effects for individual lexical items in addition to semantic classes improves model fit. This too should not be surprising: since categories defined by lexical items are by nature considerably narrower than semantic classes generalising over multiple lexical items, random effects calculated on the basis of lexical items alone naturally perform better as predictors of definiteness.

Crucially, model statistics are consistently better (higher R², lower BIC) for the verb data than for the noun data, regardless of the number of clusters. This finding holds both for models with nested effects (including the singular fits) and for those without nested effects. This confirms statistically our impression that the relation between lexical meaning and definiteness is stronger for verbs than for nouns. In other words, while both the meaning of the head noun and that of the governing verb can be seen to predict the definiteness of a direct object NP, this prediction is more reliable for verbs than it is for nouns. We can also see from Table 6 that models combining both verb class and noun class are consistently better than verb-only and noun-only models, and analysis of variance (ANOVA) tests comparing the noun-only and verb-only models to the model with both factors both turn out highly significant, showing that both semantic class factors significantly improve the amount of variance captured by the model. This was found with any number of semantic classes, and both with the models with nested effects and those without. This confirms that there are independent effects of the semantics of both nouns and verbs, as argued in Section 4.1, although the effect of verb semantics seems to be slightly stronger than that of noun semantics.

5 Discussion

In this section, we summarize the main findings of our study and we discuss their implications. Through a large-scale quantitative analysis of direct object NPs in the BNC, we were able to identify a relation between particular verbs and nouns and the definiteness marking of the direct objects that these verbs govern or that these nouns are the head of. First, we found that the likelihood of a direct object NP to be marked as definite or indefinite varies widely depending on the head noun and the governing verb. While all nouns and verbs in our sample are attested with definite and indefinite direct objects, some nouns and some verbs display a strong preference towards definiteness or indefiniteness. Both effects were shown to be largely independent from each other, as they could still be observed when the nouns or verbs with extreme biases were removed from the sample, showing that the preference of verbs for (in)definite DOs is not merely due to a preference for nouns that display this bias, and vice versa. As we investigated possible explanations for these preferences, we found a widespread correlation between the meaning of nouns and verbs and the definiteness preferences of DOs, in that items with a similar meaning tend to display similar preferences. This correlation was also found to be stronger for verbs than for nouns, as the semantic class of verbs appears to have stronger predictive power on the definiteness of direct objects than that of nouns.

These findings indicate that lexical meaning and definiteness pattern in non-trivial ways, which leads us to suggest that we should rethink the basis for definiteness marking and its determinants and add nuance to the default position that it is primarily a discourse-pragmatic phenomenon. The idea that lexical meaning might play a role in definiteness is not incompatible with a discourse-pragmatic account: it is reasonable to imagine that the particular conceptual content of a word, combined with world knowledge, makes it more likely for a certain definiteness marking to occur according to discourse-pragmatic principles. For instance, the preference of world for definiteness can be explained in large part by the fact that in one of its main senses, this noun has a unique referent, i.e. planet Earth, calling for the definite article. However, in an alternate reality in which humans have colonized multiple star systems, one can imagine that world would be more readily used with indefiniteness marking in the same meaning (since there would be more than one world), and thus would show a less pronounced bias towards definiteness. This example highlights that definiteness preferences of a lexical item may be contingent, and can therefore still be claimed to rely on pragmatics. In some cases, however, lexical preferences for (in)definiteness can indicate a more purely conceptual basis for definiteness marking, next to the discourse-pragmatic one. For example, we noted earlier that many definiteness-preferring nouns are inherently relational, e.g. terms for body parts (head, foot) and kinship relations (mother, son). It is arguably this conceptual property that leads NPs headed by these nouns to receive unique reference through another entity (the entire body or the other family member), and thus to be used with definiteness marking, before any discourse-pragmatic considerations are involved. In such cases then, lexical semantics stands out as a more straightforward alternative to discourse-pragmatics in accounting for definiteness marking.

Even apparent exceptions to the correlation between meaning and definiteness can often be reconciled with this account if we focus on subtle differences with other members of the same semantic class. Very often, two words that can be considered similar in their central sense actually possess a slightly different range of polysemous meanings. For instance, the meaning of the noun room is wider than that of space, because it can also refer to a part of a building (e.g. her room, the dining room), in addition to the meaning shared by the two nouns; this gives room a stronger preference for definiteness than space. Similarly, while the verbs admit, acknowledge, and recognize can all be used in the sense of ‘accept that something is true’, they can also take on very different meanings, i.e. ‘allow to enter’ for admit, ‘recognize the importance of’ for acknowledge, and ‘identify as familiar’ for recognize; accordingly, acknowledge and recognize display a stronger preference for definiteness than admit, in line with their different meanings.

Hence, in the same way that argument structure is taken to be derived from the semantic representation of verbs in many accounts of argument realization, likewise definiteness marking can be predicted to some extent from the meaning of lexical items, at least in the case of direct objects. We would like to argue that this connection between lexical meaning and definiteness should be considered part of the grammatical knowledge of speakers. This idea is fully compatible with a usage-based approach to language processing and representation (e.g. Bybee 2010; Diessel 2019; Goldberg 2019). In this view, the representation of grammar is shaped by every exemplar of language use that speakers are exposed to, in all their formal, semantic, and pragmatic detail. If a noun is frequently witnessed as the head of a noun phrase with certain properties, in particular a certain definiteness status, or if a certain verb is frequently followed by a noun phrase with these properties, then these tendencies may be stored in the mental representation of these lexical items and trigger expectations in ulterior uses of these items. We leave it for future research as to how to best capture this relation in a model of grammar, but we suggest that for nouns, this information could be stored with noun phrase structure templates, and for verbs, it could be stored in the argument structure of the verbs, which in this view are taken to project not just information about the number and types of their arguments, but also information about the discourse status of some of these arguments. In other words, we suggest that some verbs impose certain discourse-pragmatic requirements to their arguments, which translate into a tendency for definiteness or indefiniteness.

Given the systematic relation between meaning and definiteness, it is possible for these representations to be stored at a higher level of abstraction rather than lexeme by lexeme, relating semantic classes or aspects of lexical meaning to definiteness. In other words, when multiple items with a similar meaning are found to display similar definiteness behaviour, this can lead to a higher-level generalization. Conversely, the usage-based exemplar model also leaves room for idiosyncrasies, i.e. lexical items that behave differently from their neighbors, and likewise it also accounts for items whose behaviour is not easily predicted from their meaning and world knowledge. An example of the former is the difference between indefinite-preferring reason and definite-preferring cause. Despite their very similar meaning, an examination of the usage of these nouns shows that cause is used more frequently with the definite article, and reason is more frequent in the “other determiner” category, especially with no and any. These facts seem obvious when pointed out, as the phrases no reason and any reason do indeed seem more plausible that no cause and any cause, but similarly to lexical collocations, they are beyond conscious awareness, and the reasons for their existence are ultimately unclear. Rather, they are purely textual facts that language users somehow know, but they have to be learned from the input rather than deduced from any other properties of these nouns. By the same token, the usage-based exemplar model naturally accounts for the existence of systematic definiteness preferences in entire classes of items for which the conceptual explanation is not clear, and which therefore seem largely arbitrary. For instance, we found that various verbs of experiencing, e.g. undergo, suffer, experience, face, markedly prefer indefinite direct objects, a fact that we find hard to explain on conceptual grounds alone. In other words, while we accept that, e.g., undergo an operation seems more likely than undergo the operation, there is in our view nothing in the meaning of undergo that would prepare us for that fact, and would explain why the source of the subject’s experience would typically not be presented as uniquely identifiable and/or discourse-given; this is, again, a textual rather than conceptual fact.

6 Conclusions

This study is the first largescale investigation of the statistical relation between the definiteness of a noun phrase and the lexical items related to it in the clause. From the quantitative analysis of a large number of direct objects extracted from a dependency-parsed version of the BNC, we identified nouns and verbs displaying varying preferences towards definite or indefinite direct object NPs, and using distributional semantics, we showed that this variation largely correlate with the meaning of these nouns and verbs and can often be explained by their semantic properties; overall, this correlation also seems to be stronger for verbs than for nouns. While it is not entirely surprising for definiteness and lexical semantics to be related in this way, and while such a relation was already anticipated in some previous research, ours is the first attempt at extending this relation to a significant portion of the lexicon and confirming its existence on a large scale. We interpret our findings to indicate that lexical factors may well play in a role in definiteness marking, in addition to discourse-pragmatic factors. Our results also open the possibility that lexical items may be stored with definiteness information, an idea that is especially compatible with a usage-based approach to language. Verbs in particular can be seen to project not only information about the number and type of arguments that they can be combined with, but also about their definiteness status.

In future research, we plan to collect more evidence for the cognitive reality of lexical factors in definiteness marking, in particular by examining whether speakers form expectations about the discourse status of upcoming NPs given a certain verb. We are also interested in increasing the scope of our corpus-based empirical findings, for instance by extending the analysis to other grammatical positions besides direct objects. While we have in this paper barely scratched the surface of this investigation, we believe that further exploring the lexical factors involved in definiteness marking is likely to make a significant contribution to our understanding of this area of grammar, and could have important implications for various fields of research, for instance first and second language acquisition and language pedagogy, notably by raising awareness of the so far unsuspected extent of lexical biases in definiteness marking.

Corresponding author: Florent Perek, University of Birmingham, Birmingham, UK, E-mail: f.b.perek@bham.ac.uk

References

Abbott, Barbara. 2004. Definiteness and indefiniteness. In Laurence R. Horn & Gregory Ward (eds.), The handbook of pragmatics, 122–149. Malden, MA: Wiley-Blackwell.10.1002/9780470756959.ch6Search in Google Scholar

Aldenderfer, Mark & Roger Blashfield. 1984. Cluster analysis. Newbury Park: Sage Press.10.4135/9781412983648Search in Google Scholar

Allerton, David J. 1982. Valency and the English verb. London: Academic Press.Search in Google Scholar

Ariel, Mira. 1990. Accessing noun-phrase antecedents. London: Routledge.Search in Google Scholar

Becker, Laura. 2021. Articles in the World’s languages. Berlin: De Gruyter.10.1515/9783110724424Search in Google Scholar

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman grammar of spoken and written English. Harlow: Longman.Search in Google Scholar

Bouveyron, Charles, Gilles Celeux, T. Brendan Murphy & Adrian E. Raftery. 2019. Model-based clustering and classification for data science: With applications in R. Cambridge: Cambridge University Press.10.1017/9781108644181Search in Google Scholar

Bybee, Joan. 2010. Language, usage and cognition. Cambridge: Cambridge University Press.10.1017/CBO9780511750526Search in Google Scholar

Chafe, Wallace. 1976. Givenness, contrastiveness, definiteness, subjects, topics, and point of view. In Charles N. Li (ed.), Subject and Topic, 25–55. New York: Academic Press.Search in Google Scholar

Chen, Danqi & Christopher Manning. 2014. A fast and accurate dependency parser using neural networks. In Alessandro Moschitti, Bo Pang & Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 740–750. Doha, Qatar: ACL.10.3115/v1/D14-1082Search in Google Scholar

Chesterman, Andrew. 1991. On definiteness. A study with special reference to English and Finnish. Cambridge: Cambridge University Press.10.1017/CBO9780511519710Search in Google Scholar

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge, MA: MIT Press.10.21236/AD0616323Search in Google Scholar

Christophersen, Paul. 1939. The articles: A study of their theory and use in English. Oxford: Oxford University Press.Search in Google Scholar

De Marneffe, Marie-Catherine & Christopher, D. Manning. 2008. Stanford typed dependencies manual. Technical report: Stanford University. https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf (accessed 12 March 2024).Search in Google Scholar

Diessel, Holger. 2019. The grammar network. How linguistic structure is shaped by language use. Cambridge: Cambridge University Press.10.1017/9781108671040Search in Google Scholar

Downing, Angela. 2015. English grammar: A university course, 3rd edn London: Routledge.Search in Google Scholar

Dowty, David. 1979. Word meaning and montague grammar: The semantics of verbs and times in generative semantics and montague’s PTQ. Dortrecht: Kluwer.10.1007/978-94-009-9473-7Search in Google Scholar

Dowty, David. 1991. Thematic proto-roles and argument selection. Language 67(3). 547–619. https://doi.org/10.1353/lan.1991.0021.Search in Google Scholar

Du Bois, John. 1980. Beyond definiteness: The trace of identity in discourse. In Wallace Chafe (ed.), The pear stories: Cognitive, cultural, and linguistic aspects of narrative production, 203–274. Norwood, NJ: Ablex.Search in Google Scholar

Field, Andy, Jeremy Miles & Zoë Field. 2012. Discovering statistics using R. Los Angeles: Sage.Search in Google Scholar

Fillmore, Charles. 1968. The case for case. In Emmon Bach & Robert T. Harms (eds.), Universals in linguistic theory, 1–88. New York: Holt, Rinehart, and Winston.Search in Google Scholar

Firth, John R. 1957. A synopsis of linguistic theory 1930–1955. Studies in linguistic analysis (special volume of the philological society), 1–32. Oxford: Blackwell.Search in Google Scholar

Forbes, Graeme. 2020. Intensional transitive verbs. In Edward, N. (ed.), The stanford encyclopedia of philosophy. Stanford, CA: Metaphysics Research Lab, Stanford University. https://plato.stanford.edu/entries/intensional-trans-verbs/ (accessed 12 March 2024).Search in Google Scholar

Givón, Talmy. 1978. Definiteness and referentiality. In Joseph H. Greenberg (ed.), Universals of human language, 291–330. Stanford, CA: Stanford University Press.Search in Google Scholar

Goldberg, Adele E. 1995. Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press.Search in Google Scholar

Goldberg, Adele E. 2019. Explain me this. Creativity, competition, and the partial productivity of constructions. Princeton, NJ: Princeton University Press.10.2307/j.ctvc772nnSearch in Google Scholar

Grimshaw, Jane. 1990. Argument structure. Cambridge: MIT Press.Search in Google Scholar

Gundel, Jeannette, Nancy Hedberg & Ron Zacharski. 1993. Cognitive status and the form of referring expressions in discourse. Language 69. 274–307. https://doi.org/10.2307/416535.Search in Google Scholar

Hawkins, John A. 1978. Definiteness and indefiniteness. A study in reference and grammaticality prediction. London: Croom Helm/Routledge.Search in Google Scholar

Hawkins, John A. 2004. Efficiency and complexity in grammars. Oxford: Oxford University Press.10.1093/acprof:oso/9780199252695.001.0001Search in Google Scholar

Heim, Irene. 1982. The semantics of definite and indefinite noun phrases. New York: Garland.Search in Google Scholar

Himmelmann, Nikolaus. 1997. Deiktikon, Artikel, Nominalphrase: Zur Emergenz syntaktischer Struktur. Tübingen, Germany: Niemeyer.10.1515/9783110929621Search in Google Scholar

Hoey, Michael. 2005. Lexical priming: A new theory of words and language. London & New York: Routledge.Search in Google Scholar

Hoffmann, Thomas. 2022. Construction grammar: The structure of English. Cambridge: Cambridge University Press.Search in Google Scholar

Hopper, Paul J. & Sandra A. Thompson. 1980. Transitivity in grammar and discourse. Language 56(2). 251–299. https://doi.org/10.1353/lan.1980.0017.Search in Google Scholar

Huddleston, Rodney & Geoffrey K. Pullum (eds.). 2002. Cambridge grammar of the English language. Cambridge: Cambridge University Press.10.1017/9781316423530Search in Google Scholar

Krámský, Jirí. 1972. The article and the concept of definiteness in languages. The Hague: Mouton de Gruyter.10.1515/9783110886900Search in Google Scholar

Langacker, Ronald W. 1991. Foundations of cognitive grammar, Vol. 2: Descriptive application. Stanford, CA: Stanford University Press.Search in Google Scholar

Levin, Beth. 1993. English verb classes and alternations. Chicago: University of Chicago Press.Search in Google Scholar

Levin, Beth & Malka Rappaport Hovav. 2005. Argument realization. Cambridge: Cambridge University Press.10.1017/CBO9780511610479Search in Google Scholar

Lyons, Christopher. 1999. Definiteness. Cambridge: Cambridge University Press.Search in Google Scholar

Löbner, Sebastian. 1985. Definites. Journal of Semantics 4. 279–326. https://doi.org/10.1093/jos/4.4.279.Search in Google Scholar

Meier, Cécile. 2020. Definiteness: “Gagarin was the first human to travel to space”. In Daniel Gutzmann, Lisa Matthewson, Cécile Meier, Hotze Rullmann & Thomas Zimmermann (eds.), The Wiley Blackwell companion to semantics, 1–41. Hoboken, NJ: John Wiley & Sons, Ltd.10.1002/9781118788516.sem085Search in Google Scholar

Mikolov, Tomas, Kai Chen, Greg Corrado & Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781. https://doi.org/10.48550/arXiv.1301.3781.Search in Google Scholar

Moltmann, Friederike. 2008. Intensional verbs and their intensions objects. Natural Language Semantics 16(3). 239–270. https://doi.org/10.1007/s11050-008-9031-5.Search in Google Scholar

Oberpriller, Johannes, Melina de Souza Leite & Maximilian Pichler. 2022. Fixed or random? On the reliability of mixed‐effects models for a small number of levels in grouping variables. Ecology and Evolution 12(7). 1–15. https://doi.org/10.1002/ece3.9062.Search in Google Scholar

Payne, John & Rodney Huddleston. 2002. Nouns and noun phrases. In Rodney Huddleston & Geoffrey K. Pullum (eds.), The Cambridge grammar of the English language, 323–523. Cambridge: Cambridge University Press.10.1017/9781316423530.006Search in Google Scholar

Perek, Florent. 2015. Argument structure in usage-based construction grammar: Experimental and corpus-based perspectives. Amsterdam: John Benjamins.10.1075/cal.17Search in Google Scholar

Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics 54(1). 149–188. https://doi.org/10.1515/ling-2015-0043.Search in Google Scholar

Perek, Florent. 2018. Recent change in the productivity and schematicity of the way-construction: A distributional semantic analysis. Corpus Linguistics and Linguistic Theory 14(1). 65–97. https://doi.org/10.1515/cllt-2016-0014.Search in Google Scholar

Pine, Julian & Elena Lieven. 1997. Slot and frame patterns and the development of the determiner category. Applied PsychoLinguistics 18. 123–138. https://doi.org/10.1017/s0142716400009930.Search in Google Scholar

Prince, Ellen F. 1992. The ZPG letter: Subjects, definiteness, and information status. In Sandra A. Thompson & William C. Mann (eds.), Discourse description: Diverse analyses of a fund raising text, 295–325. Philadelphia: Benjamins.10.1075/pbns.16.12priSearch in Google Scholar

Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech & Jan Svartvik. 1985. A comprehensive grammar of the English language. Harlow: Longman.Search in Google Scholar

Řehůřek, Radim & Petr Sojka. 2010. Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, 45–50. Valletta, Malta: ELRA.Search in Google Scholar

Russell, Bertrand. 1905. On denoting. Mind 14. 479–493.10.1093/mind/XIV.4.479Search in Google Scholar

Santorini, Beatrice. 1990. Part-of-speech tagging guidelines for the Penn treebank project (3rd revision, 2nd printing). Technical report: Dept of Computer and Information Science, University of Pennsylvania. https://www.cis.upenn.edu/∼bies/manuals/tagguide.pdf (accessed 12 March 2024).Search in Google Scholar

Scrucca, Luca, Michael Fop, T. Brendan Murphy & Adrian E. Raftery. 2016. Mclust 5: Clustering, classification and density estimation using Gaussian finite mixture models. The R Journal 8(1). 205–233. https://doi.org/10.32614/rj-2016-021.Search in Google Scholar

Sommerer, Lotte. 2018. Article emergence in old English. A constructionalist perspective. Berlin: De Gruyter Mouton.10.1515/9783110541052Search in Google Scholar

Stefanowitsch, Anatol & Stefan Gries. 2009. Corpora and grammar. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook, 2, 933–951. Berlin & New York: Mouton de Gruyter.10.1515/9783110213881.2.933Search in Google Scholar

Tesnière, Lucien. 1959. Éléments de syntaxe structurale. Paris: Klincksieck.Search in Google Scholar

Tomasello, Michael. 2003. Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.Search in Google Scholar

Van der Maaten, Laurens & Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9. 2579–2605.Search in Google Scholar

Zimmermann, Thomas E. 2001. Unspecificity and Intensionality. In Caroline Féry & Wolfgang Sternfeld (eds.), Audiatur vox sapientae: A Festschrift for Arnim von Stechow, 514–532. Berlin: Akademie Verlag.10.1515/9783050080116.514Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cllt-2024-0032).

Received: 2024-03-18

Accepted: 2024-12-20

Published Online: 2025-01-21

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material

https://doi.org/10.1515/cllt-2024-0032

Keywords for this article

argument structure; definiteness; direct object; lexical biases; noun phrase

Creative Commons

BY 4.0