Home Orality and overtness: effects on Spanish subject use
Article Open Access

Orality and overtness: effects on Spanish subject use

  • Gemma McCarley ORCID logo EMAIL logo
Published/Copyright: July 31, 2025

Abstract

This study of a corpus of varieties of Spanish finds that the level of orality of a text is a strong predictor of subject pronoun expression. Following previous studies’ application of orality to interrogative constructions in Brazilian Portuguese and French, an orality measurement was adapted for Spanish and applied to the new corpus Corpus Diacrónico del Español Latinoamericano: Edición de Sujetos (CorDELES). CorDELES was created to investigate the historic development of subject pronoun expression that led to the high rates of overt subject pronouns attested in current varieties of Latin American Spanish, specifically whether overt subject pronoun expression increases following contact with the enslaved Africans brought to the Caribbean during the colonial period. This contact hypothesis was used as a backdrop to investigate the effects of orality on a corpus. Indeed, the inclusion of orality as a predictor in a mixed-effects model found significant effects for a distinction between Spain and the Americas as well as an intriguing interaction between year and orality. These results add to the burgeoning body of work revealing the benefits of accounting for orality in corpus work.

1 Introduction

When compiling a written corpus, it is important to account for textual differences, as the effects of genre, register, and text-type are well-known (cf. Biber 2001; Biber and Conrad 2019; Szmrecsanyi 2006). However, these kinds of static categorizations do not account fully for stylistic variation and change. Orality, on the other hand, is an independent metric that often corresponds with the level of formality expressed in certain genres but is not beholden to them. Indeed, changes in orality levels within genres have already been shown to create the illusion of ‘apparent’ change in historical corpora. Specifically, this factor has been applied to wh-interrogatives in Brazilian Portuguese (BP) (Rosemeyer 2019), and clefted questions in French (Mazzola et al. 2023; Rosemeyer 2022), but not yet to the phenomenon of subject pronoun expression (SPE). Given the influence differing spoken genres have had on SPE rates in Spanish (cf. Travis 2007), the context of variable SPE in Latin American Spanish seems a prime testing ground.

In the last fifty years, the study of SPE in Spanish has become an increasingly popular topic. Variationist research has swelled, producing quantitative and qualitative studies of SPE in the Caribbean (Alfaraz 2015; Camacho 2013; Orozco and Guy 2008; among many others), Central America, South America (e.g. Cerrón-Palomino 2018; Travis 2005), and the U.S. (Otheguy and Zentella 2007). The major throughline of this boom has been the observation that varieties of Spanish spoken throughout Latin America, most notably in the Caribbean, use overt subject pronouns to a higher degree and in more contexts than standard Latin American or Peninsular Spanish. Despite many dialectal differences, such as variation in the pronominal subject system (e.g. the use of usted [you.SG.F]; the loss of vosotros [you.PL.INF] outside of Spain; and the vos [you.SG.INF] form in parts of Latin America [(Real Academia Española)]), each standard Spanish only allows overt pronouns when they mark emphasis or focus. Yet the Dominican example (1) below shows an unmarked overt pronoun in every clause:

(1)
Ellos me dijeron que yo tenía anemia . . . Si ellos me dicen que yo estoy en peligro cuando ellos me entren la aguja por el ombligo, yo me voy a ver en una situación de estrés.
‘They told me that I had anemia . . . If they tell me that I am in danger when they put the needle in my belly-button, I am going to find myself in a stressful situation.’
(Toribio 2000: 319, ex. 3e).

Brazilian Portuguese (BP) has shown a similar development but going even further, to the extent that it has been categorized as a partial null subject or semi-pro-drop language (Duarte 1993; Erker and Guy 2012). This evolution seems to represent movement along van Gelderen’s (2011) subject cycle where a null subject language (NSL) can become a non-null subject language (NNSL). This is what happened in the history of French which used to allow null subjects but today requires overt pronouns (e.g. Foulet 1928). Partial null subject languages (PNSLs) would represent an intermediary stage in this cycle.[1] The question then is what prompted this change in so many varieties of Latin American Spanish as well as BP?[2]

As part of a larger project to explore the hypothesis that the answer lies in their shared context of contact with the languages of enslaved Africans during the colonial period (cf. Granda 1978; Sessarego and Gutiérrez-Rexach 2017; Section 3.1), I created the historical corpus Corpus Diacrónico del Español Latinoamericano: Edición de Sujetos (CorDELES). Despite much discussion of the historical path of SPE in Latin American Spanish (e.g. Camacho 2013; Sessarego and Gutiérrez-Rexach 2017), a historical corpus study has yet to be carried out. Similarly, although significant contributions have been made to the study of the effects of linguistic variables such as switch-reference (Abreu 2012; Alfaraz 2015; Otheguy and Zentella 2007; Silva-Corvalán 1994; Travis 2005), priming (Cameron and Flores-Ferrán 2004; Travis 2007), person and number (Carvalho et al. 2015), TAM features (Cameron 1993; De Prada Pérez 2015), and verb class (Orozco and Hurtado 2021b; Travis 2005) on SPE, there has not been as much attention paid to inter-textual differences such as genre, particularly in the written modality.

For this reason, I apply Rosemeyer’s (2019) orality measurement to CorDELES in order to evaluate the effect of orality on SPE against the backdrop of the contact hypothesis. Concretely, I seek to first determine the relationship between orality and SPE and then assess the effects this relationship may have on the written historical corpus. The second step is accomplished by comparing the results of a mixed model analyzing the contact hypothesis with and without the orality measurement included. This study is the first (to my knowledge) that investigates the relationship between orality and SPE. Section 2 provides background on the measurement and application of orality in corpora. Section 3 first briefly summarizes the contact hypothesis that CorDELES was created to investigate – and that the orality measurement is tested in conjunction with. I then outline the building, annotation, and testing of the corpus, as well as how to quantify orality. Section 4 presents the main finding of this study: a correlation between orality and overt pronominal expression that has an intriguing diachronic component, providing far-reaching consequences for corpus methodology and SPE research. Section 5 discusses the implications of the findings and Section 6 concludes.

2 An introduction to orality

The issue of the difference in linguistic behavior between written and oral registers is not new. The degree to which written texts can be trusted to reflect spoken language is one of the biggest problems historical linguists face. Formality and genre differences ensure that people write differently than they speak (Biber and Conrad 2019) while prescriptivism, illiteracy, and access to publication prevent all varieties of a language from being equally represented in the written record. While the latter set of obstacles are more difficult to surmount, formality and genre are easier to account for. If we pay special attention to the types of written texts that behave most similarly to speech, then we can get a closer approximation to spoken data from historical corpora. However, genre is not stagnant (Biber and Finegan 1989). Norms can and do change over the years, especially for more artistic media like novels, poetry, and plays. ‘Apparent change’ has already been shown to affect frequencies of linguistic features in corpus studies (cf. Szmrecsanyi [2016] for the case of genitive alternation in English). In other words, superficial trends that appear to be the result of one (linguistic) factor are actually the result of a secondary (non-linguistic) factor. This concept can be applied to the inconstancy of genre style. Thus, it is not a given that a 16th century play and a 20th century play will equally reflect the spoken language of their day. This is where the ability to measure orality comes into play.

Rosemeyer (2019) investigates the development and distribution of wh-interrogatives in BP plays. Changes in theatre style are as relevant to SPE as to wh-interrogatives because both can be influenced by discourse functions that are more or less frequent depending on whether a given play relies on soliloquy, punchy dialogue, or has a narrative component. In order to make the BP corpora more comparable over time, Rosemeyer used Biber and Finegan’s (1989) “dimension of involvement” to create a functional manner of measuring orality in an easily searchable and countable way. This measurement consists of five linguistic components that either express intellectual states that tend to be conveyed in orality or depend on spatial, temporal, or discourse deixis: private verbs (achar ‘mean’, pensar ‘think’, acreditar ‘believe’, crer ‘believe’) in singular present tense, present progressive, demonstrative neuter pronouns (isso and isto ‘this’), time and place adverbs (aqui ‘here’ and agora ‘now’), and discourse markers ( ‘isn’t it?’, bom ‘well’, pois ‘so’, então ‘so’, olha ‘listen’). Section 3.3 reviews how these oral elements can be adapted to Spanish and applied to CorDELES. Once orality was accounted for, Rosemeyer (2019) found that many of the previously attested frequency changes for certain interrogative constructions in the 20th century were actually reflective of the changing degree of orality expressed in plays, i.e. ‘apparent’ change. Some of these increases in frequency, however, were the result of real change from below, i.e. changes in speech spreading to written plays through the most oral ones first. Before applying this orality measurement to CorDELES, an introduction to the corpus is necessary.

3 CorDELES

Now that the concept of orality has been elucidated, there are two main research questions that this paper seeks to address. The first is whether orality has an effect on SPE. If so, does this relationship inflate ‘apparent’ trends and/or obscure potential ones in a corpus? To answer this second research question, I test the hypotheses that overt SPE has increased (1) over time, (2) at a higher rate in Latin America than in Spain, and (3) at a higher rate in the Caribbean than South America. If a mixed-effects model finds different effects for year and region when orality is included, then orality will once again be confirmed to be an integral factor in corpus work. This section is thus dedicated to describing CorDELES, the corpus under analysis. Section 3.1 reviews the context for the creation of CorDELES which also doubles as the study orality is being applied to. Section 3.2 outlines how the corpus was built. While Section 3.3 reports how null subjects were annotated, Section 3.4 summarizes how orality was measured and quantified. Section 3.5 explains the statistical models used to analyze the results.

3.1 Context for the corpus

CorDELES was created to investigate the historical path that led to the higher rate of overt SPE in Caribbean Spanish as seen in (1). Considering BP has also undergone this change (Duarte 1993), their shared language contact history is likely responsible. It is crucial to remember that language contact, as well as language change, must always be viewed through the lens of language acquisition and use (Weerman 1993). It is not that two or more concrete monoliths of language come into contact, but that individual speakers learning and using new languages do. With that in mind, acquisition literature has shown that null subject systems are more difficult to acquire than non-null subject systems, particularly for L2 learners (e.g. Bini 1993; Margaza and Bel 2006; Pérez-Leroux and Glass 1999). For example, (2) is a sample from an intermediate L2 speaker who uses the overt pronoun nosotros ‘we’ without emphasis or focus.[3]

(2)
Nosotros tenemos una casa en la montaña y voy cada fin de semana con amigos y amigas.
‘We have a house at the mountains and go each weekend with friends and girlfriends.’
(Margaza and Bel 2006: 95, ex. 7)

Different theories have arisen to explain why this might be. For instance, the Interface Hypothesis suggests the null subject system is difficult to acquire because it lies on the interface between multiple domains, specifically pragmatics and syntax (Sorace 2011). Alternatively, the Interpretability Hypothesis works with the concept of uninterpretability from the Minimalist framework, suggesting second language learners past the critical threshold, i.e. adults, are only able to access interpretable (that is, semantically active as well as syntactically present) features (Hawkins and Hattori 2006; Tsimpli and Dimitrakopoulou 2007; Tsimpli and Lavidas 2019). This means that when they come across uninterpretable features such as those on null subjects, they misanalyse when to use them and overcompensate with features that are interpretable: overt subjects. If it is indeed the case that null subjects are harder for adult learners to acquire, then, following Dahl’s (2004) definition of complexity as L2-difficulty, the resultant increased use of overt pronouns is an act of simplification. Thus, language contact is simplifying when the feature being learned is difficult for adults to acquire (cf. Walkden and Breitbarth 2019). Trudgill (2011) further qualifies the type of contact that would lead to strain in acquiring new systems: short-term, loose-knit, adult language learning. Those conditions perfectly describe the context for African learners of Spanish in colonial Latin America.[4]

During this period, enslaved (West-)Africans were taken to the Caribbean and coastal South America (e.g. Bowser 1974). As adult second-language learners who would have had to quickly learn Spanish without consistent access to a dense network of native speakers, they would have struggled to acquire the L2-difficult null subject system, preferring overt subjects. Their children would then nativize this system. This origin is exactly what Sessarego (2013) proposes for the genesis of Afro-Hispanic Languages of the Americas (AHLAs). These varieties have been spoken by the subsequent generations of the original enslaved Africans and they demonstrate the kind of change predicted, namely an increased rate of overt pronouns. To verify this account, we need to look into the diachronic trajectory of SPE. Ideally, we would do this using historical AHLA data, but not enough exists in the written record to compare systematically over centuries. What we can do is track if this potential change might have bled into other varieties of Latin American Spanish, comparing it with AHLA data when we have it. To this end, I constructed CorDELES to attempt to find out whether overtness rates increase over the last five centuries, and if they do, if they are higher in countries with larger Afro-Hispanic populations.

3.2 Corpus composition

Although several western European languages are represented in parsed corpora that allow the simple extraction of null subjects, such a Spanish corpus (especially a Latin American one) does not exist. With this in mind, and due to the difficulty of counting null subjects, it was clear from the start that they would need to be annotated by hand. The simplest method would have been to use an extant corpus to export a list of every finite verb that I would have then annotated by hand. However, despite there being many excellent historical and/or dialectal Spanish corpora (e.g. CDE [Davies 2002], CORDIAM [CORDIAM n.d.], CORDE [Real Academia Española n.d.], and CDH [Real Academia Española 2013]), none provides the necessary metadata or is sufficiently POS-tagged or lemmatized so as to produce such a list. Instead, I created a corpus that I could balance and annotate directly.

CorDELES was designed to consist of 64 texts pulled from eight countries over the period 1500–1899 that are split into two genres. Table 1 shows a grid of the dimensions of the corpus as well as where each individual text (marked by an abbreviation of its title) fits into it. In the event that a prose text could not be found, one written in verse was included. Unfortunately, an extant text for every cell combination still could not be found, resulting in a shortage of seven texts. For a full listing of texts, please see Appendix A.

Table 1:

CorDELES composition. DR is short for the Dominican Republic while LIT and DOC stand for ‘literature’ and ‘documents’, respectively. Dashes indicate a gap in the corpus, i.e. no text could be found that filled the conditions of that cell. Bolded texts are those written by a known Afro-Hispanic author. Italicized texts’ authors were born in Spain. Lastly, starred cells are written in verse.

CARIBBEAN/CENTRAL SOUTH AMERICAN SPAIN
DR PANAMÁ CUBA PERÚ COLOMBIA BOLIVIA VENEZUELA
16TH
LIT ENT HGNI HDLI HNMI EVII* GDUI LAH
DOC SDJ CAR DRF NDP OYC RVP NDA CAN
17TH
LIT DPHJ LLDP* EDP* CEVP* VDM NHLC DQ
DOC DLYD LCDH CPVV GNRG PR ACRA
18TH
LIT LIVIE PJFC* PAD PPYM HVIP EOID ARJD
DOC ASD SPPH MC GSFB ALTU EAU
19TH
LIT GAL* HS* ADUE MYT IHDC JDLR VH CPC
DOC ALD MPE GDLH CRP SYL ADLA GDC QDEV

The eight countries can be grouped into three sections on a geographical basis: Caribbean/Central America (Dominican Republic, Cuba, and Panama), South America (Peru, Colombia, Bolivia, and Venezuela), and Peninsular Spanish as a baseline. Each Latin American country was selected based on whether a known AHLA community lives there[5] (Klee and Lynch 2009). AHLA populations are densest in coastal areas, reflecting where their enslaved ancestors were originally taken to. The Caribbean islands therefore hold the densest populations and should be the most likely to use overt subjects (Orozco and Guy 2008). To check the relationship between expected population level[6] and pronoun usage, countries with low (Panama, Bolivia, Peru), mid (Venezuela, Colombia), and high (Cuba, Dominican Republic) Afro-Hispanic populations from the Caribbean, Central America, and South America were all represented. Admittedly, it is less than ideal to categorize Latin American varieties by country, as many of these countries already contain their own well-differentiated dialectal regions, not to mention that many of these dialects cross borders (cf. Orozco and Díaz-Campos 2016; Orozco 2018). However, the limited availability of texts made it difficult enough to fill each cell in Table 1 based only on country. Narrowing the pool to certain sub-regions would have resulted in even more gaps in the corpus. Further, information on provenance beyond country was not even available for many texts. For these reasons, the corpus was regionally delineated by country. Genre was accounted for in a rough literary versus non-literary break to balance the corpus in case it influenced SPE. Because the corpus is so small (3,773 tokens of pronominal subjects) in relation to the number of parameters and there is already a paucity of texts for certain country/century combinations, the corpus would not have been able to be adequately balanced for genre on a more fine-grained level. However, as Section 4 demonstrates, orality is a much better metric for analyzing textual effects on SPE in the corpus. Diachronically, the 16th century is included to create a baseline of the Spanish spoken in Latin America when the first Africans arrived.

The main sources for the texts were Cervantes Virtual (Biblioteca Virtual Miguel de Cervantes 1999), Biblioteca Digital Hispánica (La Biblioteca Nacional de España 2008), and the Digital Library of the Caribbean (Digital Library of the Caribbean 2004). A continuous sample of 2,000–3,000 words was randomly selected from each text. CorDELES totals 151,295 words. Some supplemental texts that were of interest but did not fit into the main corpus were also set aside, most of which will be included with data from the main corpus in the following section. The first of these texts is a transcript from an Afro-Bolivian field interview conducted in 2010 (pulled from several such transcripts provided in Sessarego [2011]). The rest are from Cantos populares de mi tierra and Secundino el zapatero, a series of poems and a play written by Candelario Obeso, an Afro-Colombian poet and playwright from the 19th century. Obeso wrote in Afro-Colombian, providing rare insight into actual historical AHLA language. These four texts also make for a fascinating comparison as they are written by the same author but split across three genres: play, poetry, and prose (pulled from the dedication). Obeso also provided translations of his poetry into standard Spanish, so any marked difference in SPE would be very telling. Each sample was then transcribed where necessary, POS-tagged by an automated tagger, and annotated by hand in XML.

3.3 Annotating null subjects

Part of the necessity of creating a new annotated resource rather than comparing figures from previous studies is that null subjects are very difficult to count consistently. Many judgements need to be made as to what counts as a null subject, and no two studies have taken the exact same approach, making their data tough to compare (as pointed out in de Prada 2009; Martinez-Sanz 2011; Travis 2007). In order to reliably compare frequencies from different countries, they need to have been annotated under the same guidelines. So, before looking at the data for this corpus, we must walk through some methodological decisions that needed to be made. The motivation behind these decisions was to only operate within the envelope of variation (cf. Orozco and Hurtado 2021a; Otheguy and Zentella 2007); that is, to only extract tokens for which both the presence and absence of a subject pronoun were possible. In other words, the goal was to annotate every environment where alternation was possible and only those environments.

Firstly, the focus of CorDELES is the relationship between the subject and finite verb of each clause, meaning subject pronouns of nonfinite clauses were not considered. Imperatives were also excluded due to their unique relationship with subject pronouns that allows their omission even in NNSLs. Similarly, cases of coordination were counted as only one realization of the pronoun since NNSLs are able to omit the second pronoun. E.g. come y bebe mucho ‘he/she eats and drinks a lot’ would count as only one null subject (that was elided for the second VP) rather than two.[7] Fixed expressions such as hace tiempo ‘time ago’ and es decir ‘that is to say’ were excluded.

Most importantly, null pronouns were tagged for animacy and referentiality because it is not possible (or at least exceptionally rare in the case of inanimacy) in standard Spanish for inanimate, non-referential/impersonal subjects to be overt. If a variety were to realize such an expletive subject overtly, it would be immediate proof that it could no longer be classified as a consistent NSL (cf. Roberts and Holmberg 2010). If this were to occur, it would be one of the final stages in the transition from an NSL to a NNSL. However, such a change is not expected to appear in the corpus (and indeed does not), so I am currently only interested in the environment where overt pronouns might first start overtaking null pronouns: animate, referential subjects. In addition to SPE, each clause was tagged for clause type, inversion, verbal and pronominal morphology, co-referentiality with the previous subject, priming, and interrogative status. The purpose of the present study is to analyze the potential regional and diachronic trends, so I will only consider (referential, animate) pronouns at the moment.

The biggest challenge for replicability came with some of the oldest texts in the corpus whose syntax was often difficult to parse. They contained many run-ons accompanied by lots of relative clauses that left whether a subject was coordinated or not, as well as whether it shared reference with the previous subject, unclear. In those instances, I defaulted toward creating a new null token with a different referent. I was as consistent as possible in my annotation so the texts are as comparable as can be reasonably expected, especially as I was the sole annotator. Additionally, sometimes handwritten texts could not be fully transcribed as some words were lost to bindings or smudged. If these words seemed to be the subject or finite verb of a clause, they were marked ‘unintelligible’ and excluded.

3.4 Measuring orality

Recall that Rosemeyer (2019) measured orality based on the presence of the private verbs achar ‘mean’, pensar ‘think’, acreditar ‘believe’, and crer ‘believe’; the present progressive, the demonstrative neuter pronouns isso and isto ‘this’; the time and place adverbs aqui ‘here’ and agora ‘now’; and the discourse markers ‘isn’t it?’, bom ‘well’, pois ‘so’, então ‘so’, and olha ‘listen’. These factors were chosen to measure BP orality, but, because of its genetic and typological closeness to Spanish, they can easily be adapted for use in my corpus. To create a slightly easier number of variables to find and count, I narrowed the category of private verbs down to creer ‘to believe/think’ and pensar ‘to think’ and excluded discourse markers. The rest of the variables translate easily, leaving us with the progressive; the demonstrative pronouns esto ‘this’, eso ‘that’, and aquello ‘that’; and the time and place adverbs aquí ‘here’ and ahora ‘now’. The oral nature of these factors is confirmed when they are searched for by genre in the Corpus del Español (Davies 2002). Figures 18 each demonstrate that although the frequency of every oral marker fluctuates over time, they all consistently increase in tokens from the genre with the lowest orality to the highest: academic (ACAD) → news (NEWS) → fiction (FICT), → oral (ORAL).[8]

Figure 1: 
Creo ‘I think/believe’ (n = 35,204).
Figure 1:

Creo ‘I think/believe’ (n = 35,204).

Figure 2: 
Pienso ‘I think’ (n = 8,992).
Figure 2:

Pienso ‘I think’ (n = 8,992).

Figure 3: 
Estoy + gerund ‘I am X-ing’ (present progressive) (n = 5,126).
Figure 3:

Estoy + gerund ‘I am X-ing’ (present progressive) (n = 5,126).

Figure 4: 
Esto ‘this’ (n = 163,710).
Figure 4:

Esto ‘this’ (n = 163,710).

Figure 5: 
Eso ‘that’ (n = 87,152).
Figure 5:

Eso ‘that’ (n = 87,152).

Figure 6: 
Aquello ‘that’ (n = 19,949).
Figure 6:

Aquello ‘that’ (n = 19,949).

Figure 7: 
Aquí ‘here’ (n = 79,282).
Figure 7:

Aquí ‘here’ (n = 79,282).

Figure 8: 
Ahora ‘here’ (n = 64,980).
Figure 8:

Ahora ‘here’ (n = 64,980).

Once the validity of the orality measurement had been confirmed, I applied it to CorDELES. In order to quantify the gross counts in a comparable manner, I created the ORSCORE, which is derived by firstly aggregating all of the frequencies for each text, dividing them by the word count of that text, and then multiplying this figure by 100. For instance, the 19th century Spanish play CPC (the most oral text in the corpus) had 11 instances of private verbs, 6 of the progressive, 22 of demonstrative neuter pronouns, and 14 of the time and place adverbs, totaling 53 tokens. This gross frequency was then divided by CPC’s word count of 2,260, resulting in the quotient 0.02345. To make this figure more accessible, it was multiplied by 100 to reach the final ORSCORE of 2.35. On the other end of the spectrum, the 16th century Dominican court document SDJ had only 1 instance of a private verb and 0 instances of any of the other factors. Its word count was 2,008, creating the quotient 0.0005 and the final ORSCORE of 0.05.

3.5 Testing with mixed-effects models

A series of statistical models were run to test the effects of orality, time, and region on SPE. Each model was generated using the lme4 package (Bates et al. 2015) in R (R Core Team 2021). A simple linear regression model (lm) is used to first establish the relationship between orality and SPE. In other words, a single continuous response variable (the proportion of overt subject pronouns per text) is modeled as a function of the predictor (ORSCORE). A regression line can then be plotted to represent the conditional mean formed by the model. Coefficients are the numerical values that describe this relationship and are the basis for predicted values. The differences between the model’s predicted values and the actual data points are called residuals. These residuals can be used to determine how well the model explains the data by being compared with the residuals generated by a null model. This measurement of model fit is called R2.

The contact hypothesis, however, cannot be evaluated by a simple linear model for several reasons. First, it deals with more than one predictor (time, region, and eventually orality). One of the benefits of looking at more than one predictor in a model is that one can test for interactions. If there is an interaction between two predictors, it means that one predictor influences the response variable differently depending on the second predictor. For instance, an interaction between gender and age might mean that younger men behave differently than older men in a way that cannot be accounted for by just age or just gender. Second, at least one of these predictors is a categorical variable (region). Third, the response variable is also a categorical variable. For the linear model, SPE is turned into a continuous variable (the proportion of over subject pronouns) in order to simply visualize and analyze the relationship between orality and SPE with each text as a single data point. However, when looking at multiple predictors with numerous levels, it is better for the model to be fed each token as a single data point. Because there are only two options for the response variable (null or overt), it can be called binary. The categorical nature of the response variable means a logistic (not linear) regression needs to be run. In other words, the probability of the subject pronoun being overt instead of null is predicted. Lastly, random effects also need to be accounted for. All of the predictors previously listed are fixed effects which means that they are uniformly predictable, constant, and replicable. Random effects, on the other hand, are idiosyncratic and thus unique to each study. For instance, in an experiment, ‘participant’ would be treated as a random effect to control for differences because tokens from Participant A are more likely to pattern closely together than they are with Participant B’s tokens. For a corpus this could be anything like author, text, or newspaper. For CorDELES, the unique code for each text (docID) is specified as a random effect. Mixed-effects models are used to analyze both fixed and random effects. To fit a logistic regression to a mixed-effects model, glmer is used. This time, model fit is assessed by the Akaike information criterion (AIC) whereby the smaller the number is, the better the fit. The regression results for each model, including coefficient size, p-value, and model fit, are reported in Tables 24[9] in the following section.

Table 2:

Orality-overtness linear models.

Dependent variable:
Proportion of overt pronouns
(Total) (No “Outliers”)
ORSCORE 0.100 0.144
(0.023) (0.042)
p = 0.0001*** p = 0.002**
Constant 0.050 0.034
(0.015) (0.020)
p = 0.003** p = 0.097
Observations 37 35
R 2 0.351 0.262
Adjusted R2 0.332 0.239
Residual Std. error 0.064 (df = 35) 0.064 (df = 33)
F statistic 18.906*** (df = 1; 35) 11.693*** (df = 1; 33)
  1. *p < 0.05; **p < 0.01; ***p < 0.001.

Table 3:

Initial mixed-effects model.

Dependent variable:
SPE
Scale (year) 0.458
(0.152)
p = 0.003**
Constant −2.428
(0.149)
p = 0.000***
Observations 3,773
Log likelihood −1,275.090
Akaike Inf. Crit. 2,556.179
Bayesian Inf. Crit. 2,574.886
  1. *p < 0.05; **p < 0.01; ***p < 0.001.

Table 4:

Final mixed-effects model.

Dependent variable:
SPE
Scale (year) 0.050
(0.154)
p = 0.744
Scale (ORSCORE) 0.030
(0.269)
p = 0.0002***
Macro_RegionAmericas 0.629
(0.320)
p = 0.049*
Scale (year):scale (ORSCORE) −0.437
(0.206)
p = 0.034*
Constant −2.548
(0.276)
p = 0.000***
Observations 3,773
Log likelihood −1,268.101
Akaike Inf. Crit. 2,548.202
Bayesian Inf. Crit. 2,585.615
  1. *p < 0.05; **p < 0.01; ***p < 0.001.

4 Results by text

The data in this section comes from the core of CorDELES (totaling 99,282 words): Dominican Republic, Panama, Bolivia, Colombia, and Spain. Section 4.1 establishes the correlation between orality and overt SPE. The contact hypothesis is then analyzed with orality excluded from the model in 4.2 and included in 4.3.

4.1 The orality correlation

The ORSCORE of each text in the corpus is plotted against the proportion of overt subject pronouns in each text in Figure 9. We can see that, indeed, an orality effect is at play in the corpus data as there is a clear positive correlation between ORSCORE and the rate of overt SPE in a text. This effect is corroborated by the results of a linear model. I ran lm in R (R Core Team 2021), with ORSCORE as a predictor of the proportion of overt pronouns, finding a clearly significant relationship (β = 0.100, p > 0.0001, R2 = 0.351 ± 0.02). These results are summarized in the middle column of Table 2. The R2 score of 0.351 means that 35 % of the variance in the data can be explained by this model which is a relatively high level of explanation given all the additional factors that have been known to affect SPE (e.g. focus, switch-reference, person, etc. cf. Section 1).

Figure 9: 
Orality and overtness regression line (n = 37). Each plot in this paper was created using ggplot2 (Wickham 2016) in R (R Core Team 2021).
Figure 9:

Orality and overtness regression line (n = 37). Each plot in this paper was created using ggplot2 (Wickham 2016) in R (R Core Team 2021).

There may appear to be two outliers with respective ORSCOREs of 1.77 and 2.35 in the upper right of the graph that are skewing the data. However, as the rightmost column in Table 2 shows, the p-value is still clearly significant (p < 0.002) when these two texts are excluded. In fact, the coefficient is actually larger, as illustrated by the steeper slope of the regression line in Figure 10, showing the strength of the relationship. Further, the two apparent outliers are the Bolivian oral interview (IIA) and a rapid-paced Spanish play (CPC), the two most impressionistically oral texts of the corpus, as demonstrated by examples (3) and (4). This again confirms the power of the measurement in quantifying orality.

Figure 10: 
Orality and overtness regression line (‘outliers’ excluded) (n = 35).
Figure 10:

Orality and overtness regression line (‘outliers’ excluded) (n = 35).

(3)
An excerpt from IIA:
Mt:10 ¿Y usté sabe mi nombre?
‘And you know my name?’
ss: Eeeh, sí, Torres el apellido…
‘Uh, yes, Torres the surname…’
mt: ¿Usté sabe mi nombre?
‘You know my name?’
ss: Sí, Maclobia Torres.
‘Yes, Maclobia Torres.’
mb: ¡Eso!
‘That’s it!’
ss: ¿Y el segundo apellido?
‘And the second surname?’
mt: Bediriqui, Torres Bediriqui.
‘Bediriqui, Torres Bediriqui.’
  1. 10

    The speaker abbreviations stand for the interviewees Manuel Barra (MB) and Maclobia Torres (MT) and the interviewer Sandro Sessarego (SS).

(4)
An excerpt from CPC:
D.’ CAR.11 (Con indignacion.) Está usted asesinando á mi hija.
‘(With indignation.) You are murdering my daughter’
VIDAL. Señora, usted no entiende de negocios.
‘Ma’am, you don’t understand business.’
CAROL. ¡Mamá, por Dios!
‘Mom, for God’s sake!’
D.’ CAR. Yo entiendo de todo, y cuando pasan rábanos los compro.
‘I understand everything, and when radishes come, I buy them.’
VIDAL. Yo no compro… hay tendencias á la baja.
‘I don’t buy…there are downward trends.’
  1. 11

    D.’ CAR., CAROL., and VIDAL., are abbreviations for the names of the characters Doña Carolina, her daughter Carolina, and her son-in-law Señor Vidal.

Since orality has been demonstrated to have a significant effect on SPE, its influence on the rest of the corpus can now be investigated.

4.2 SPE over time

First, let us look at the effects of time and region on SPE without orality being accounted for. The following bar charts (Figure 11a–e) show the proportion of null and overt subject pronouns per text. Each bar is a text identified by an abbreviation of its title, e.g. ‘ENT’ refers to the 16th century Dominican text titled Entrémes (cf. Table 1 and Appendix A for a full listing). These texts are ordered chronologically. The non-literary genre is illustrated with a solid border, the literary genre with a dotted border. The bars with a dashed border are the supplemental texts from Afro-Hispanic speakers.

Figure 11: 
The bar charts in (a–e) give the proportion of SPE in Bolivia (n = 512), Colombia (n = 1,223), Dominican Republic (n = 573), Panama (n = 575), and Spain (n = 1,043).
Figure 11:

The bar charts in (a–e) give the proportion of SPE in Bolivia (n = 512), Colombia (n = 1,223), Dominican Republic (n = 573), Panama (n = 575), and Spain (n = 1,043).

The proportion of overt subjects appears to slightly increase in the last two centuries for most of the countries (setting aside the supplemental texts from Colombia), the only exception being the Dominican Republic which remains pretty consistent apart from ASD. That being said, these trends need to be taken with a grain of salt. Due to text scarcity, Bolivia and Panama are both missing multiple cells from this period, including the entire 17th century for Bolivia and the 18th for Panama. Further, even though there is a large gap between the last two Panamanian texts (MPE and HS), they are separated by only three years, not nearly long enough for such a rise to be the result of diachronic change alone. With that taken into consideration, there is no longer any obvious sizeable increase for Panama. This leaves Colombia as the only clear Latin American country with a noticeable increase in overt subjects. However, Spain also shows a steady increase during this period, contradicting a colonial contact account. On the other hand, since there are only a maximum of two texts per century-country combination, it is necessary to first rule out any other potential effects before drawing any conclusions.

Although there does not appear to be any genre distinction in any of the countries,[12] I ran a mixed-effects logistic regression model using glmer from the lme4 package (Bates et al. 2015) (cf. Section 3.5) to see if it could pick up any underlying patterns that are unclear from visual inspection of the bar charts. SPE was run as a binary dependent variable while country, year (z-scored), and genre (literary vs. non-literary) were fixed effects, and docID was a random effect.[13] Two other regional groupings with fewer levels were also run: region (Caribbean, South America, Spain) and macro-region (Spain vs. the Americas). Over 20 versions of the model were run, testing for every possible combination of fixed effects with and without interactions. Some iterations of the model did not have enough data to converge. These were versions with multiple interactions or interactions with country due to the numerous levels involved. Of the models that converged, year was the only predictor to ever emerge significant (β = 0.458, SE = 0.152, p < 0.003). Similarly, the model with the best fit according to AIC scores was the model that only specified year as a fixed effect. These results are reported in Table 3. In sum, a positive relationship between year and overt SPE was found, i.e. pronouns are more likely to be overt over time.

Even though genre does not have a significant effect in the model, something similar to genre seems to be influencing the data. The texts with some of the highest overt SPE rates also show high levels of dialogue: ASD is an 18th century Dominican court document that transcribes the testimony of witnesses describing events; HS is a 19th century Panamanian series of poems that frequently address a specific audience in 2nd person; the poetry from the supplemental 19th century Afro-Colombian text CPMT similarly uses frequent dialogue;[14] IIA is the 21st century transcript of an Afro-Bolivian oral interview; DQ is a sample from the 17th century Spanish novel Don Quixote with a lot of dialogue; and CPC is the 19th century Spanish play with a lot of rapid back-and-forth. Given IIA and CPC also had the two highest ORSCOREs in the corpus (cf. Section 4.1), orality seems to be the key to this pattern. Before verifying orality’s role by introducing it into the model, it is important to rule out other potential sources for this discourse-heavy trend.

All of these texts seem to have rich environments for discourse-switching in which the referent for each subject frequently changes clause to clause. Given that switch-reference has a well-known effect on SPE (e.g. Abreu 2012; Alfaraz 2015; Otheguy and Zentella 2007; Silva-Corvalán 1994; Travis 2005), I had already tagged each pronoun for whether it shares reference with the previous subject token or not (cf. Section 2). However, co-reference levels did not correlate with overt SPE in any of the countries. If the high discourse levels are not a function of switch-reference, then maybe genre was on the right track and this pattern is the result of textual differences.

The next logical step following genre would be to look at subgenre/text-type, specifically plays since they are intuitively expected to have the highest levels of dialogue. There are four plays in the corpus (ENT from 16th century Dominican Republic, ARJD from 18th century Spain, CPMTpl from 19th century (Afro-)Colombia, and CPC from 19th century Spain), and they make up a wide range of overt pronominal rates: 11 %, 3.8 %, 10 %, and 22 %, respectively. This variation is particularly striking considering the lowest and highest of the plays’ frequencies are from the same country and only a century apart. Recall, however, that Rosemeyer (2019) points out that it is misleading to assume that all plays should behave uniformly because they may differ in their degree of orality. For BP, stylistic change in the level of orality represented in plays was creating ‘apparent’ change in the diachronic patterning of certain wh-interrogatives (cf. Section 2). In order to evaluate whether orality is doing the same thing to SPE rates in this corpus, I use the measurement of orality outlined in Section 3.4 to analyze its effects in CorDELES in the next section.

4.3 An interaction between orality and time

In Section 4.1 I discovered a relationship between orality and overt SPE in CorDELES. In the same corpus, I also found a significant correlation between year and overt SPE in Section 4.2. The question follows as to whether the former pattern is a real trend responsible for an artificial inflation of the latter. This would mirror Rosemeyer’s (2019) findings but be detrimental for the ability of this corpus to accurately investigate the contact hypothesis. There is another possible eventuality in which the inclusion of orality would give the mixed-effects model more clarity, revealing nuances to the relationships SPE has with year and region that could not be captured with the orality noise unaccounted for. Recall that the R2 value for the linear model run in Section 4.1 was only 0.351, meaning only 35 % of the variance in the data can be explained by the relationship between orality and proportion of overt pronouns. Despite the strong relationship in Figure 9, there is still clearly a high degree of variation in overt SPE rates for texts of a similar ORSCORE. For instance, the three texts DLYD (from 17th century Panama), ADLA (from 19th century Bolivia), and DQ (from 17th century Spain) all had similar ORSCOREs of 0.48, 0.42, and 0.44, respectively, whereas their proportion of overt subjects varied appreciably with rates of 2.9 %, 10 %, and 17 %, respectively. Much of this variation is likely due to linguistic influences such as pragmatic conditioning (e.g. focus, emphasis, contrast), but it also leaves room for the possibility that there are underlying regional (and perhaps further diachronic) trends in the corpus still to be uncovered. The next step is to try and tease those apart from this robust orality effect. Either way,[15] the inclusion of orality in the model is paramount to get an accurate analysis of the diachronic and regional trends in CorDELES.

The best way to do this is to include orality as a predictor in the mixed model. Glmer from lme4 (Bates et al. 2015) was again run in R (R Core Team 2021) with SPE as a binary dependent variable and docID as a random effect. Again, every possible combination of predictors and interactions that included orality was run. Once more, the number of tokens did not span far enough to cover iterations with multiple interactions or interactions with country. Predictably, every model that did converge found ORSCORE to be significant. What is new is that certain combinations found each regional grouping and/or its interaction with ORSCORE to be significant. Of all these models, the one with the lowest AIC (i.e. the model that best described the spread of data) had ORSCORE, year (both z-scored), and macro-region (Spain vs. the Americas) as fixed effects with an interaction between year and ORSCORE.[16] The results of this model are summarized in Table 4 which reveals that this final model returns three significant findings. First, as expected, ORSCORE is the most significant predictor with an effect size of 1.030 (this is the coefficient that determines the slope of the regression line, cf. Section 3.5) and a p-value less than 0.001.

Second, now that orality is included in the model, year is no longer significant on its own; however, its interaction with orality is (β = −0.43702, SE = 0.20634, p < 0.034177). The negative coefficient means that for every year that increases, the effect of orality on overtness decreases. In other words, the texts are becoming more oral over time, but their levels of overtness are also increasing beyond what orality can explain on its own. This trend is clear with the use of plogis in R (R Core Team 2021) to calculate the predicted probabilities for this interaction given a specific ORSCORE. In other words, one can provide a start year, end year, and precise ORSCORE and the function will return the proportion of overt SPE that the model predicts for the given orality level for each year. For instance, for the median ORSCORE of 0.45, the rate of overt SPE is predicted to increase from 5.7 % to 11 % from 1,500 to 1,900. These predictions are plotted in Figure 12. Note that the gaps between lines become smaller over time, narrowing toward each other. This is the effect of the interaction between orality and year at play: the difference in SPE rates between orality levels is not as large in 1,900 as it is in 1,500. Surprisingly, while the minimum (0.05), median (0.45), and mean (0.66) ORSCOREs all show a steady increase over time, the maximum ORSCORE (2.35)[17] predicts a decrease from 79 % to 30 %. This prediction does not match the actual data as none of the SPE proportions pass 30 % to begin with. This discrepancy is likely because there are only three texts with ORSCOREs above 1.00, two of which are more than double and triple the mean ORSCORE. Since the model has so little data for such high ORSCOREs (and none before the year 1877), it likely cannot make accurate diachronic predictions for ORSCOREs that large. This is not to say that the interaction is not present for high ORSCOREs, just that they are more likely to pattern like the rest of the ORSCOREs affected by the interaction. In other words, the overt SPE rates predicted for an ORSCORE of 1.08 would increase over time,[18] but the gap between them and the rates for a 0.66 ORSCORE would still shrink.

Figure 12: 
Predicted SPE across ORSCOREs over time.
Figure 12:

Predicted SPE across ORSCOREs over time.

In addition to the interaction between year and orality, an effect of macro-region has also come out significant in the new model (β = 0.62880, SE = 0.31997, p < 0.049389). This means that the American data favors overt SPE more than Spain. Although this is hardly surprising as the dialectal differences in SPE have been well-attested (thus motivating the present study, cf. Section 1), it was troubling that the original model did not find a difference. That being said, while the effect is consistent with the regional half of the contact hypothesis, it does not fully answer it. The contact hypothesis specifically predicts an interaction between macro-region and year or between year, orality, and macro-region. This would have shown that the two regions do not just behave differently, but came to behave differently diachronically, i.e. after the catalyst of language contact in the Americas. However, neither of those models found these interactions significant. Indeed, the three-way interaction model did not find anything significant at all, even orality. Their AIC scores were accordingly higher than the final model’s, meaning they did not explain the data’s distribution as well.

In sum, the new model proves that there is a diachronic effect in the corpus which is tied to orality. It has also demonstrated that a regional trend seems to have been obscured by orality, something not apparent in either Figure 11a–e or the initial models. The following section will flesh out the implications of these findings and the role of orality in general.

5 Discussion and further directions

The goal of this paper was to first establish whether orality had an effect on SPE, and if so, evaluate how this relationship affects a corpus. This corpus was CorDELES which was built to investigate a language contact origin for the overproduction of overt SPE in certain varieties of Latin American Spanish. Orality was in fact found to positively correlate with the proportion of overt SPE per text. While its inclusion in the mixed-effects model used to analyze the corpus does not fully support the contact hypothesis, it does reveal a previously obscured distinction between Spain and the rest of the data as well as a curious interaction between orality and time. Specifically, this interaction means that the effect orality has on SPE has lessened over time.

There are five major insights from these orality results in CorDELES. Firstly and most broadly, these results support Rosemeyer (2019) and Szmrecsanyi’s (2016) arguments that environmental or ‘apparent’ change in historical corpus data need to be accounted for. It is easy to mistake artificial stylistic changes for what appear to be actual diachronic changes, but the trappings of a genre are not always constant and cannot be assumed to be so. That genre is not a direct correlate for orality levels is made clear by the four Obeso texts. Impressionistically, the play should be the most oral, yet it is the poetry whose orality is so high that it surpasses the bulk of the data’s. Current and future corpus study needs to make more of an effort in accounting not only for genre but also orality in order to better balance corpora over time. Secondly, these results also build on Rosemeyer’s pinpointing of fluctuating orality as this kind of deceptive environmental change since the effect of year appears to be more significant before orality is taken into account. However, ‘unapparent’ change also appears to be covered up by orality as a macro-region distinction between Spain and the Americas only emerges significant once orality is included in the model. This finding also affects the diachronic increase. Before orality, the increase did not discriminate between Spain and the rest of the data which was a problem for the contact hypothesis. However, the effect of orality seems to have created an ‘apparent’ but not genuine diachronic rise in overtness in Spain. This is supported by the fact that the most recent text from Spain (CPC) is the text with the highest orality out of the entire corpus. In other words, in addition to finding artificial patterns caused by changes in orality, I also initially found no discernable regional patterns because of the correlation between orality and overt SPE. More practically, this study also supports the validity of Rosemeyer’s orality measurement and encourages its adaptation for other corpus work. Thirdly, this positive correlation speaks to the reliability of historical null subject research. It suggests that pronouns are more likely to be realized as overt in oral speech than written texts. While this is not an entirely unexpected development, it nonetheless forms a chink in the already frayed armor of using historical corpora to track spoken change. However, all hope is not lost.

An unexpected finding emerged in the form of an interaction between orality and year. It is tempting to interpret this interaction as another orality-led case of ‘apparent’ but not real change, i.e. the increase in overt SPE over time is really just an increase in orality over time. Written texts, then, would have undergone a stylistic change to more closely reflect spoken language just as was the case for some of the patterns for BP wh-interrogatives (Rosemeyer 2019). In other words, speech would have always used a consistent SPE rate and the style shift towards more oral texts simply represents the written register catching up. At least, this would be the case for the time span of the corpus which ends in 1900, since the frequently attested overproduction of overt SPE in the Caribbean Spanish spoken today (cf. Section 1) means change needs to have occurred at some point. Crucially, this means that this ‘apparent’ change account is not inconsistent with the contact hypothesis. The latter change may have simply taken longer to spread to the (mostly standard) Spanish used when writing such that 1900 is too early a cut-off point for the corpus. Indeed, this timeline would parallel the change in BP which only really began in the mid-19th century (Duarte 1993). Alternatively, this attested rise might be a change from below. Specifically, the overproduction of overt SPE by L2 learners and their subsequent generations would first spread to the speech of others. In written texts, this change would subsequently be captured in the most oral texts first. This type of change was also found in Rosemeyer’s (2019) study.

Should a story of colonial contact-initiated simplification over time be borne out, the slowness of the increase in overt pronouns could be due to the underlying dynamics of change in null subjects. Kauhanen (2022), following Yang (2002), created a model to predict the behavior of L1 and L2 variational learners, using the case study of null subject acquisition in Afro-Peruvian Spanish. According to his model, the proportion of L2 speakers in a population that the critical threshold would require for full simplification (100 % overt subjects) is 0.9 (on a scale of 0–1.0). Estimates of the actual historical proportion for Afro-Peruvians vary between 0.2 and 0.6, nowhere close to the necessary proportion. As this number is exceedingly high and difficult to reach, it is unlikely that any of the countries under study would have reached this threshold, explaining why the increase in overtness (or partial simplification) observed in this corpus is so slow.

Another potential explanation for the interaction is that a new factor that influences SPE rates emerged at some point over the corpus. As it gained influence over time, the other factors that determined SPE (e.g. orality) lost some of their influence in turn. For the contact hypothesis, this new factor would be the influx of L2 learners and their subsequent generations. This third account is not borne out by the corpus results at present, since a regional interaction with time could not be found. The extension of CorDELES to Peru, Venezuela, and Cuba may shed more light on the matter. Regardless, the investigation of the contact hypothesis in a corpus that is controlled for orality (i.e. a corpus where all texts are of similar orality levels) would be ideal. If overt SPE still increases in the same time period covered by CorDELES, the ‘apparent’ change account is false and the contact hypothesis is partially supported. For it to be fully supported, the American data would also have to behave distinctly from Spain in a meaningful way.

This leads to the one last loose end that needs to be tied up. The story for simplification initiated by the origin of AHLAs is broadly supported by the significance of Spain versus the Americas in the final model. However, an interaction with time could not be found, even when accounting for orality. It is possible that the model simply does not have enough data yet to capture any such distinctions. This can be solved by extending CorDELES to include data from Cuba, Peru, and Venezuela, feeding more data to the mixed models. A secondary purpose in extending the corpus now is to also confirm this orality effect in the rest of the data. The limitation of only two texts per country-century combination likely also makes it difficult to disentangle country versus individual text-level variation. The answer there would be to at least double the size of CorDELES. However, it was difficult to fill cells with just two texts so this could result in an even heavier skew in the corpus. What is more plausible is that a region-based (Caribbean, South America, Spain) effect may still be borne out.

6 Conclusions

This paper first sought to address whether there is a correlation between orality and SPE. Indeed, a significant positive correlation between orality and overt SPE was found. The second research question asked whether the relationship between orality and SPE affected the analysis of a corpus, and the answer is a resounding yes. This outcome was achieved by assessing the results of a mixed model evaluating the theory that nontargetlike language learning by enslaved Africans during the colonial period led to the increased use of overt subject pronouns in communities throughout Latin America (Sessarego 2013; Sessarego and Gutiérrez-Rexach 2017). Concretely, the hypothesis was that overt SPE should increase over time at a higher rate in the Caribbean than South America and higher still than in Spain. The results of the model changed significantly once orality was included, uncovering a regional distinction between Spain and the Americas and an interaction between orality and time. Several possible explanations for this interaction were put forth, all of which encourage taking orality in account in larger corpora to attempt to replicate and confirm this effect. Similarly, although the contact hypothesis could not be fully supported by the corpus as stands, the results found are not inconsistent with a history of language contact either. The results of this study have discovered a relationship between SPE and orality that has far-reaching consequences for both future corpus work and pronominal subject study.


Corresponding author: Gemma McCarley, University of Toronto, Toronto, Canada, E-mail:

Award Identifier / Grant number: 851423

Acknowledgments

I would like to thank George Walkden, Henri Kauhanen, Raquel Montero, and Molly Rolf for discussions. The research reported in this paper was carried out as part of the project STARFISH; this project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no 851423).

  1. Data and Code Availability: All data and code necessary to replicate these results, as well as metadata on textual sources, can be obtained from https://github.com/GemmaMcCarley/null_subjects/tree/main.

Appendix A

Text Shortened title Country Year Author Source
dr16ent Entremés DR 1588 Cristóbal de Llerena CervantesVirtual
dr16sdj Sobre lo de la justicia DR 1500 SALALM
dr17dphj Discurso politico histórico jurídico DR 1658 Montemayor y Córdoba de Cuenca, Juan Francisco BDH
dr18livie La idea del valor de la Isla Española DR 1785 Antonio Sánchez Valverde BDH
dr18asd Autos seguidos en la Audiencia de Santo Domingo DR 1752 BDH
dr19gal Galaripsos DR 1883–89 Gaston Fernando Deligne Print
dr19ald A los Dominicanos DR 1857 Buenaventura Báez dLOC
pa16hgni Historia general y natural de las Indias Panamá 1535 Gonzalo Fernández de Oviedo y Valdés Cervantes Virtual
pa16car Carta del Adelantado Pascual Andagoya Panamá 1546 Pascual de Andagoya BDH
pa17lldp Llanto de Panamá a la muerte de don Enrique Panamá 1638 Mateo Ribera et al. Print
pa17dlyd Discurso legal y defensa Panamá 1695 Diego Ladrón de Guevara, Obispo de Quito BDH
pa19hs Hojas Secas Panamá c.1875 Amelia Denis de Icaza La Revista Cultural Lotería
pa19mpe Mensaje del Presidente del Estado Panamá 1872 Buenaventura Correoso La Revista Cultural Lotería
cu16hdli Historia de las Indias Cuba 1561 Bartolomé de las Casa Cervantes Virtual
cu16drf Declaración de Rodrigo de Figueroa Cuba 1520 Rodrigo de Figueroa CHARTA
cu17edp Espejo de paciencia Cuba 1608 Silvestre de Balboa Print
cu17lcdh La ciudad de la Habana, Cuba 1626–1630 BDH
cu18pjfc El príncipe jardinero y fingido Cloridano Cuba 1739 Santiago de Pita Cervantes Virtual
cu18spph El sesquicentenario del papel periódico de la Havana Cuba 1790 dLOC
cu19adue Autobiografia de un esclavo Cuba 1839 Juan Francisco Manzano OMEGALFA
cu19gdlh Gaceta de la Habana Cuba 1852 University of Miami Digital Collections
pe16hnmi Historia natural y moral de las Indias Perú 1591 José de Acosta Cervantes Virtual
pe16ndp Noticia del Perú Perú 1533 Miguel de Estete Cervantes Virtual
pe17cevp Coloquio entre la vieja y Periquillo sobre una procesión celebrada en Lima Perú 1645–1698 Juan del Valle y Caviedes Cervantes Virtual
pe17cpvv Carta de Pedro Vázquez de Velasco Perú 1665 Pedro Vázquez de Velasco BDH
pe18pad Paulina o el amor desinteresado Perú 1725–1803 Pablo de Olavide Cervantes Virtual
pe18mc Montemar Cartas Perú 1761–1765 Pedro Carrillo de Albornoz y Bravo de Lagunas University of Illinois Library
pe19myt Mujer y tigre Perú 1890 Ricardo Palma Cervantes Virtual
pe19crp Cartas de Ricardo Palma Perú 1893–1897 Ricardo Palma Cervantes Virtual
co16evii Elegías de varones ilustres de Indias Colombia 1589 Juan de Castellanos BDYCL
co16oyc Ordenanzas y comisiones para el Reino de Nueva Granada y Obispado de Quito Colombia c.1563 BDH
co17vdm Viaje del mundo Colombia 1614 Pedro Ordóñez de Ceballos Cervantes Virtual
co17gnrg Genealogias del Nuevo Reino de Granada Colombia 1674 Juan Flórez de Ocariz Cervantes Virtual
co18ppym Pensamientos políticos y memoria sobre la población del Nuevo Reino de Granada Colombia 1792 Pedro Fermin de Vargas y Sarmiento Cervantes Virtual
co18gsfb Gazeta de Santa Fe de Bogotá Colombia 1787 Cervantes Virtual
co19ihdc Ingermina o la hija de Calamar Colombia 1844 Juan José Nieto Gil Universidad de Antioquia Repositorio de proyectos
co19syl Sombra y Luz Colombia 1874–1899 Luis Antonio Robles Suárez Universidad del Rosario Repositorio institucional
bo16rvp Relacion al Virrey del Peru Bolivia 1500–1600 BDH
bo18hvip Historia de la villa imperial de Potosi Bolivia 1705–1736 Bartolomé Arzáns de Orsúa y Vela Brown University Library
bo19jdlr Juan de la Rosa Bolivia 1885 Nataniel Aguirre Cervantes Virtual
bo19adla El Atalaya de los Andes Bolivia 1839 Bernardino Palacios Cervantes Virtual
ve16gdui Geografía y descripción universal de las Indias Venezuela 1574–1598 Juan López de Velasco Cervantes Virtual
ve16nda Noticias de America Venezuela 1500s Pedro de Mendoza Cervantes Virtual
ve17nhlc Noticias historiales de las conquistas de Tierra Firme en las Indias Occidentales Venezuela 1626 Fray Pedro Simón Lau Haizeetara
ve17pr Pesquisa realizada Venezuela 1693 Juan de los Santos Saber UCV
ve18eoid El Orinoco ilustrado y defendido Venezuela 1745 José Gumilla Internet Archive
ve18altu Arca de letras y theatro universal Venezuela 1783 Fr. Juan Antonio Navarrete Cervantes Virtual
ve19vh Venezuela heroica Venezuela 1881 Eduardo Blanco Internet Archive
ve19gdc Gazeta de Caracas Venezuela 1808 Cervantes Virtual
sp16lah Libro de la anathomía del hombre Spain 1551 Bernardino Montaña de Monserrate TeXTReD
sp16can Cartas del abad de Nájera al emperador Carlos V Spain 1525 el abad de Nájera BDH
sp17dq Don Quijote Spain 1605 Miguel de Cervantes Project Gutenberg
sp18arjd Abeja racional en el jardin de los donayres Spain 1756 Pedro Jiménez y Fernández BDH
sp18eau El apologista universal Spain 1786 Pedro Centeno BDH
sp19cpc 4 por 100 :comedia en un acto y en prosa Spain 1885 Emilio Sánchez Pastor BDH
sp19qdev Quedan declarados en venta todos los bienes que hubiesen pertenecido a las Comunidades y Corporaciones religiosas extinguidas Spain 1836 Filosofía en español

Primary Sources

Biblioteca Virtual Miguel de Cervantes. 1999. Virtual Library Miguel de Cervantes. https://www.cervantesvirtual.com/.Search in Google Scholar

[CORDIAM]. Academia Mexicana de la Lengua. n.d. Corpus Diacrónico y Diatópico del Español de América. www.cordiam.org.Search in Google Scholar

DANE. 2010. Censo General 2005. Bogotá: Departamento Administrativo Nacional de Estadística.Search in Google Scholar

Davies, Mark. 2002. Corpus del Español: Historical/Genres. http://www.corpusdelespanol.org/hist-gen/.Search in Google Scholar

Digital Library of the Caribbean. 2004. Digital Library of the Caribbean (dLOC). https://dloc.com/.Search in Google Scholar

La Biblioteca Nacional de España. 2008. Biblioteca Nacional de España – Biblioteca Digital Hispánica (BDH). http://bdh.bne.es/bnesearch/Inicio.do.Search in Google Scholar

Real Academia Española. 2013. Corpus del Diccionario histórico de la lengua española (CDH) [en línea]. https://apps.rae.es/CNDHE.Search in Google Scholar

Real Academia española. n.d. Banco de datos (CORDE) [en línea]. In Corpus diacrónico del español. <http://www.rae.es>.Search in Google Scholar

Real Academia Española. n.d. De vos y otros. Diccionario de la lengua española, 23.a ed., [versión 23.8 en línea] RAE. https://dle.rae.es/ (Accessed 06 March 2025).Search in Google Scholar

References

Abreu, Laurel. 2012. Subject pronoun expression and priming effects among bilingual speakers of Puerto Rican Spanish. In Kimberly Geeslin & Manuel Díaz-Campos (eds.), Selected proceedings of the 14th hispanic linguistics symposium, 1–8. Somerville, MA: Cascadilla Proceedings Project.Search in Google Scholar

Alfaraz, Gabriela. 2015. Variation of overt and null subject pronouns in the Spanish of Santo Domingo. In Ana Maria Carvalho, Rafael Orozco & Naomi Lapidus Shin (eds.), Subject pronoun expression in Spanish: A cross-dialectal perspective, 3–17. Washington, DC: Georgetown University Press.Search in Google Scholar

Bates, Douglas, Martin Mächler, Ben Bolker & Steve Walker. 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67(1). 1–48. https://doi.org/10.18637/jss.v067.i01.Search in Google Scholar

Biber, Douglas. 2001. Using corpus-based methods to investigate grammar and use: Some case studies on the use of verbs in English. In Rita C. Simpson & John M. Swales (eds.), Corpus linguistics in North America, 101–115. Ann Arbor: University of Michigan Press.Search in Google Scholar

Biber, Douglas & Susan Conrad. 2019. Register, genre, and style (2nd edn., Cambridge textbooks in linguistics). Cambridge: Cambridge University Press.10.1017/9781108686136Search in Google Scholar

Biber, Douglas & Edward Finegan. 1989. Drift and the evolution of English style: A history of three genres. Language 65(3). 487–517. https://doi.org/10.2307/415220.Search in Google Scholar

Bini, Milena. 1993. La adquisición del italiano: Más allá de las propiedades sintácticas del parámetro pro-drop. In Juana Muñoz Liceras (ed.), La lingüística y el análisis de los sistemas no nativos, 126–139. Ottawa: Dovehouse Editions Canada.Search in Google Scholar

Bowser, Frederick P. 1974. The African slave in colonial Peru 1524–1650. Stanford, CA: Stanford University Press.Search in Google Scholar

Cacoullos, Rena Torres & Catherine Travis. 2018. Assessing change and continuity. In Bilingualism in the community: Code-switching and grammars in contact, 136–159. Cambridge: Cambridge University Press.10.1017/9781108235259Search in Google Scholar

Camacho, José. 2013. Null subjects. Cambridge: Cambridge University Press.10.1017/CBO9781139524407Search in Google Scholar

Cameron, Richard. 1993. Ambiguous agreement, functional compensation, and nonspecific tú in the Spanish of San Juan, Puerto Rico, and Madrid, Spain. Language Variation and Change 5. 305–334. https://doi.org/10.1017/s0954394500001526.Search in Google Scholar

Cameron, Richard & Nida Flores-Ferrán. 2004. Perseveration of subject expression across regional dialects of Spanish. Spanish in Context 1. 41–65. https://doi.org/10.1075/sic.1.1.05cam.Search in Google Scholar

Carvalho, Ana M., Rafael Orozco & Naomi Shin (eds.). 2015. Subject pronoun expression in Spanish: A cross-dialectal perspective. Washington: Georgetown University Press.Search in Google Scholar

Cerrón-Palomino, Álvaro. 2018. Variable subject pronoun expression in Andean Spanish: A drift from the acrolect. Onomázein 1(42). 53–73. https://doi.org/10.7764/onomazein.42.02.Search in Google Scholar

Dahl, Östen. 2004. The growth and maintenance of linguistic complexity. Amsterdam: Benjamins.10.1075/slcs.71Search in Google Scholar

Duarte, Maria Eugênia. 1993. Do pronome nulo ao pronome pleno: A trajetória do sujeito no português do Brazil [from null pronoun to full pronoun: The trajectory of the subject in Brazilian Portuguese]. In Ian Roberts & Mary A. Kato (eds.), Português brasileiro: uma viagem diacrônica, 107–128. Campinas: Da Unicamp.Search in Google Scholar

Erker, Daniel & Gregory R. Guy. 2012. The role of lexical frequency in syntactic variability: Variable subject personal pronoun expression in Spanish. Language 88. 526–557. https://doi.org/10.1353/lan.2012.0050.Search in Google Scholar

Escobar, Anna María. 2011. Spanish in contact with Quechua. In Manuel Díaz-Campos (ed.), The Handbook of hispanic Sociolinguistics, 324–352. Hoboken: Blackwell Publishing.10.1002/9781444393446.ch16Search in Google Scholar

Escobar, Anna María. 2012. Spanish in contact with Amerindian languages. In osé Ignacio Hualde, Antxon Olarrea & Erin O’Rourke (eds.), The Handbook of hispanic Sociolinguistics, 65–88. Hoboken: Blackwell Publishing.Search in Google Scholar

Foulet, Lucien. 1928. Petite syntaxe de l’ancien français. Paris: Champion, troisième édition revue Réédition 1982.Search in Google Scholar

Gelderen, Elly van. 2011. The linguistic cycle. Oxford: Oxford University Press.Search in Google Scholar

Granda, Germán de. 1978. Estudios lingüísticos hispánicos, afro-hispánicos y criollos. Madrid: Editorial Gredos.Search in Google Scholar

Hawkins, Roger & Hajime Hattori. 2006. Interpretation of English multiple wh-questions by Japanese speakers: A missing uninterpretable feature account. Second Language Research 22. 269–301. https://doi.org/10.1191/0267658306sr269oa.Search in Google Scholar

Hlavac, Marek. 2022. stargazer: Well-Formatted regression and summary statistics tables. R package version 5.2.3. https://CRAN.R-project.org/package=stargazer.Search in Google Scholar

Kauhanen, Henri. 2022. A bifurcation threshold for contact-induced language change. Glossa: A Journal of General Linguistics 7(1). 1–32.10.16995/glossa.8211Search in Google Scholar

Klee, Carol & Andrew Lynch. 2009. El español en contacto con otras lenguas. Washington DC: Georgetown University Press.10.1353/book13061Search in Google Scholar

Lipski, John M. 1990. Trinidad Spanish: Implications for Afro-Hispanic language. Nieuwe West-Indische Gids/New West Indian Guide 64(1-2). 7–27. https://doi.org/10.1163/13822373-90002023.Search in Google Scholar

McCarley, G. Forthcoming. University of Konstanz Doctoral dissertation.Search in Google Scholar

Margaza, Panagiota & Aurora. Bel. 2006. Null subjects at the syntax–pragmatics interface: Evidence from Spanish interlanguage of Greek speakers. In Mary Grantham O’Brien, Christine Shea & John Archibald (eds.), Proceedings of the 8th generative Approaches to second language acquisition conference (GASLA 2006), 88–97. Somerville, MA: Cascadilla Proceedings Project.Search in Google Scholar

Martinez-Sanz, Cristina. 2011. Null and overt subjects in a variable system: The case of Dominican Spanish. University of Ottawa Doctoral dissertation.Search in Google Scholar

Mazzola, Giulia, Stefano De Pascale & Malte Rosemeyer. 2023. A computational approach to detect discourse traditions and register differences: A case study on historical French. Presented at the 12th historical sociolinguistics network conference (02 June 2023). Brussels.Search in Google Scholar

Orozco, Rafael. 2018. Spanish in Colombia and New York City: Language contact meets dialectal convergence. Amsterdam and Philadelphia: John Benjamins.10.1075/impact.46Search in Google Scholar

Orozco, Rafael & Gregory Guy. 2008. El uso variable de los pronombres sujetos: ¿Qué pasa en la costa Caribe colombiana? In Maurice Westmoreland & Juan Antonio Thomas (eds.), Selected Proceedings of the fourth workshop on Spanish Sociolinguistics, 70–80. Somerville: Cascadilla Proceedings Project.Search in Google Scholar

Orozco, Rafael & Manuel Díaz-Campos. 2016. Dialectos del español de América: Colombia y Venezuela. In Javier Gutiérrez-Rexach (ed.), Enciclopedia de lingüística hispánica, 341–352. London and New York: Routledge.10.4324/9781315713441-104Search in Google Scholar

Orozco, Rafael & Luz Marcela Hurtado. 2021a. Variable subject pronoun expression revisited: This is what the Paisas do. Proceedings of the Linguistic Society of America 6(1). 713–727. https://doi.org/10.3765/plsa.v6i1.5006.Search in Google Scholar

Orozco, Rafael & Luz Marcela Hurtado. 2021b. A variationist study of subject pronoun expression in Medellín, Colombia. Languages 6(1). 5. https://doi.org/10.3390/languages6010005.Search in Google Scholar

Otheguy, Ricardo & Ana Celia Zentella. 2007. Apuntes preliminares sobre el contacto lingüístico y dialectal en el uso pronominal del español en Nueva York. In Kim Potowski & Richard Cameron (eds.), Spanish in contact: Policy, social and linguistic inquiries, 275–295. Amsterdam and Philadelphia: John Benjamins.10.1075/impact.22.20othSearch in Google Scholar

de Prada Pérez, Ana. 2009. Subject expression in Minorcan Spanish: Consequences of contact with Catalan. The Pennsylvania State University Doctoral dissertation.Search in Google Scholar

de Prada Pérez, Ana. 2015. First person singular subject pronoun expression in Spanish in contact with Catalan. In Ana Maria Carvalho, Rafael Orozco & Naomi Lapidus Shin (eds.), Subject pronoun Expression in Spanish: A cross-dialectal perspective, 121–142. Washington: Georgetown University Press.Search in Google Scholar

Pérez-Leroux, Ana T. & William R. Glass. 1999. Null anaphora in Spanish second language acquisition: Probabilistic versus generative approaches. Second Language Research 15(2). 220–249. https://doi.org/10.1191/026765899676722648.Search in Google Scholar

R Core Team. 2021. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.Search in Google Scholar

Roberts, Ian & Anders Holmberg. 2010. Introduction: Parameters in minimalist theory. In Theresa Biberauer, Anders Holmberg, Ian Roberts & Michelle Sheehan (eds.), Parametric variation: Null Subjects in minimalist theory, 1–57. Cambridge: Cambridge University Press.Search in Google Scholar

Rosemeyer, Malte. 2018. El orden de palabras en las interrogativas-Q. Un análisis contrastivo del español caribeño y portugués brasileño. Revista Internacional de Linguistica Iberoamericana 16(2). 135–148. https://doi.org/10.31819/rili-2018-163211.Search in Google Scholar

Rosemeyer, Malte. 2019. Actual and apparent change in Brazilian Portuguese wh-interrogatives. Language Variation and Change 31(2). 165–191. https://doi.org/10.1017/s0954394519000097.Search in Google Scholar

Rosemeyer, Malte. 2022. Modeling change in interaction on the basis of quantitative historical data. In Presented at the Role of Pragmatics in language change (8 April 2022), Copenhagen.Search in Google Scholar

Sessarego, Sandro. 2011. Introducción al idioma afroboliviano: Una conversación con el awicho Manuel Barra. La Paz: Plural editores.Search in Google Scholar

Sessarego, Sandro. 2013. Afro-hispanic contact varieties as conventionalized advanced second languages. IBERIA 5(1). 99–125.Search in Google Scholar

Sessarego, Sandro & Javier Gutiérrez-Rexach. 2017. Revisiting the null subject parameter: New insights from Afro-Peruvian Spanish. Isogloss 3(1). 43–68. https://doi.org/10.5565/rev/isogloss.26.Search in Google Scholar

Silva-Corvalán, Carmen. 1994. Language contact and change: Spanish in Los Angeles. New York: Oxford University Press.10.1093/oso/9780198242871.001.0001Search in Google Scholar

Szmrecsanyi, Benedikt. 2006. Morphosyntactic persistence in spoken English: A corpus study at the intersection of variationist sociolinguistics. Berlin/New York: Mouton de Gruyter.10.1515/9783110197808Search in Google Scholar

Szmrecsanyi, Benedikt. 2016. About text frequencies in historical linguistics: Disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory 12(1). 153–171. https://doi.org/10.1515/cllt-2015-0068.Search in Google Scholar

Sorace, Antonella. 2011. Pinning down the concept of “interface” in bilingualism. Linguistic Approaches to Bilingualism 1(1). 1–33. https://doi.org/10.1075/lab.1.1.01sor.Search in Google Scholar

Toribio, Almeida J. 2000. Setting parametric limits on dialectal variation in Spanish. Lingua: International Review of General Linguistics 110(5). 315–341. https://doi.org/10.1016/s0024-3841(99)00044-3.Search in Google Scholar

Travis, Catherine. 2005. The yo-yo effect: Priming in subject expression in Colombian Spanish. In Randall Gess & Edward J. Rubin (eds.), Theoretical and experimental Approaches to Romance linguistics: Selected Papers from the 34th linguistic symposium on Romance languages, 2004, 329–349. Amsterdam and Philadelphia: John Benjamins.10.1075/cilt.272.20traSearch in Google Scholar

Travis, Catherine. 2007. Genre effects on subject expression in Spanish: Priming in narrative and conversation. Language Variation and Change 19. 101–135. https://doi.org/10.1017/s0954394507070081.Search in Google Scholar

Trudgill, Peter. 2011. Sociolinguistic typology: Social determinants of linguistic complexity. Oxford: Oxford University Press.Search in Google Scholar

Tsimpli, Ianthi Maria & Maria Dimitrakopoulou. 2007. The Interpretability Hypothesis: Evidence from wh-interrogatives in second language acquisition. Second Language Research 23. 215–242. https://doi.org/10.1177/0267658307076546.Search in Google Scholar

Tsimpli, Ianthi Maria & Nikolaos Lavidas. 2019. Object omission in contact: Object clitics and definite articles in the West Thracian Greek (Evros) dialect. Journal of Language Contact 12. 141–190. https://doi.org/10.1163/19552629-01201006.Search in Google Scholar

Walkden, George & Anne Breitbarth. 2019. Interpreting (un)interpretability. Theoretical Linguistics 45(3–4). 309–317. https://doi.org/10.1515/tl-2019-0022.Search in Google Scholar

Weerman, Fred. 1993. The diachronic consequences of first and second language acquisition: The change from OV to VO. Linguistics 31(5). 903–931. https://doi.org/10.1515/ling.1993.31.5.903.Search in Google Scholar

Wickham, Hadley. 2016. ggplot2: Elegant graphics for data analysis. New York: Springer-Verlag.10.1007/978-3-319-24277-4_9Search in Google Scholar

Yang, Charles. 2002. Knowledge and learning in natural language. Oxford: Oxford University Press.Search in Google Scholar

Received: 2023-10-08
Accepted: 2024-05-31
Published Online: 2025-07-31

© 2025 the author(s), published by De Gruyter, Berlin/Boston

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 4.10.2025 from https://www.degruyterbrill.com/document/doi/10.1515/jhsl-2024-0017/html
Scroll to top button