Corpus-based discourse analysis: from meta-reflection to accountability

Monika Bednarek; Martin Schweinberger; Kelvin K. H. Lee

doi:10.1515/cllt-2023-0104

Article Open Access

Corpus-based discourse analysis: from meta-reflection to accountability

Monika Bednarek , Martin Schweinberger and Kelvin K. H. Lee

Published/Copyright: April 16, 2024

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Corpus Linguistics and Linguistic Theory Volume 20 Issue 3

Abstract

Recent years have seen an increase in data and method reflection in corpus-based discourse analysis. In this article, we first take stock of some of the issues arising from such reflection (covering concepts such as triangulation, objectivity/subjectivity, replication, transparency, reflexivity, consistency). We then introduce a new ‘accountability’ framework for use in corpus-based discourse analysis (and perhaps beyond). We conceptualise such accountability as a multi-faceted phenomenon, covering various aspects of the research process. In the second part of this article, we then link this framework to a new cross-institutional initiative – the Australian Text Analytics Platform (ATAP) – which aims to address a small part of the framework, namely the transparency of analyses through Jupyter notebooks. We introduce the Quotation Tool as an example ATAP notebook of particular relevance to corpus-based discourse analysis. We reflect on how this notebook fosters accountability in relation to transparency of analysis and illustrate key applications using a set of different corpora.

Keywords: corpus-based discourse analysis; triangulation; replication; transparency; Jupyter notebooks

1 Introduction

When we were asked by the editors of this special issue to reflect on trends and new directions in the area where corpus linguistics meets discourse studies, one particular aspect came to mind: the issue of data and method reflection or what we might call “meta-reflection”, i.e. reflection about corpus-based discourse analysis (DA). In this respect, Marchi and Taylor (2018: 8) speak of a “drive to increase reflexivity across the field”. From the beginning, studies bringing together corpus linguistics with discourse analysis have arguably had a general interest in this area, for example with respect to triangulation. Thus, Hardt-Mautner (1995) – perhaps the earliest example of corpus-based Critical Discourse Analysis – notes “[e]ven when the computer has entered the fray, triangulation remains a valuable methodological principle” (24). This openness of corpus-based discourse analysis to method reflection is possibly based on “the distinct combination of qualitative and quantitative techniques that has made us more apt to think about method per se” (Baker 2018: 291). A recent volume (Taylor and Marchi 2018) is explicitly dedicated to “dusty corners” (neglected aspects), “blind spots” (under-analysed or undetected aspects) and “pitfalls” (in research design choices), reflecting on issues such as partiality, incompleteness, and bias (Marchi and Taylor 2018: 9–12).

Such meta-reflection includes reflection about using corpus linguistics to enable corpus-based discourse analysts to achieve the same results with the same or different corpora. This has been discussed with respect to reproducibility or replicability:^[1] Marchi and Taylor suggest that “[t]he quantitative techniques of CL give greater generalizability, allow for replicability and strengthen the overall reliability and validity of the research” (Marchi and Taylor 2009: 4). Baker states that “replicability is important – if the same results are achieved with the same or different data sets, then we have better evidence for ‘total’ accountability” (Baker 2018: 282). Marchi and Taylor (2018: 7–8) point out the importance of creating a “reproducible” process that is systematically accounted for, so that others can check results and conclusions.

This interest within corpus-based discourse analysis dovetails with recent general interest in linguistics (e.g. Bochynska et al. 2023; Sönning and Werner 2021) and corpus linguistics (e.g. McEnery and Brezina 2022) in reproducibility, repeatability, or replicability. Generally, proposals or critiques concern the availability of data/software and the transparency/clarity of reporting of methods and results (e.g. Doyle 2005: 2–4; Egbert 2023; Hober et al. 2023; McEnery and Brezina 2022: 68, 185; McEnery and Hardie 2012: 2; Paquot and Callies 2020: 121; Sönning and Werner 2021: 1182). Further discussion of replication/reproducibility is provided in Schweinberger and Haugh (in press). In this article, we will limit our discussion of meta-reflection and related issues such as triangulation and replication specifically to corpus-based discourse analysis. This term refers here to the combination of corpus linguistics and discourse analysis, including but not limited to corpus-assisted discourse studies/CADS (see e.g. Partington 2008), and to the combination of corpus linguistics and Critical Discourse Analysis (see e.g. Nartey and Mwinlaaru 2019).^[2] In Section 2, we describe some of the relevant method reflection in this field and make our own suggestion for greater accountability, before introducing Jupyter notebooks as part of a new initiative (Section 3) and discussing one relevant notebook (Section 4). Throughout, our focus remains on corpus-based discourse analysis; however, given the above-mentioned general interest in related issues beyond this field, the article is likely to also have relevance for other linguists.

2 Triangulation, transparency, and accountability in corpus-based discourse analysis

As indicated above, method reflection is important for corpus-based DA and one area that has seen particular attention is that of triangulation. In corpus-based DA, triangulation has been recognized as concerning methodologies, data sources, researchers, theories (Marchi and Taylor 2009, drawing on Denzin 1970) as well as disciplines (Ancarno 2018). For scope, we will not consider the latter here (combining disciplines). Note also that the publications we reference below are illustrative rather than exhaustive.

In general, triangulation is a validation technique which is sometimes favoured as a substitute for replicability (Bloor 1997: 38). Within corpus-based DA, the dominant view of triangulation is that this field of research inherently involves triangulation – namely, the triangulation of (sets of) methodologies used in corpus linguistics with (sets of) methodologies used in discourse analysis.^[3] This is also how the first textbook on Using Corpora in Discourse Analysis discusses triangulation (Baker 2006: 15–17) and this is arguably the underlying assumption in Baker et al.’s (2008) “synergy” of corpus-based Critical Discourse Analysis, with a nine-stage model proposed for moving between quantitative and qualitive analysis. In a different vein, Bednarek (2009) suggests a “three-pronged approach” that brings together macro-, meso- and micro-levels of discourse analysis by combining manual analysis of individual texts with semi-automated analysis of small corpora, and with large-scale corpus analyses. This is explicitly discussed in terms of triangulation. There is of course wider interest in how to combine analyses of corpora with analysis of individual texts or other types of closer qualitative analysis (e.g. Baker 2009: 89, 2020; Baker and Levon 2015: 223). Researchers sometimes select (“downsample”) corpus texts for qualitative analysis based on different approaches (see Baker 2020: 88–89), including through the tool ProtAnt, which offers a principled means of downsampling texts, reducing researcher bias, identifying outlier texts, and enabling replication (Anthony and Baker 2015; Anthony et al. 2023). A framework for how to use data triangulation in corpus-based DA has also been proposed (Jaworska and Kinloch 2018).

This interest in triangulation and proposed models/frameworks for such an undertaking has been extended to explicit experimentation. The first such study was undertaken by Marchi and Taylor (2009), described by the authors as a “quasi-experiment into triangulation” (Marchi and Taylor 2009: 1), with a focus on researcher triangulation. The question underpinning their experiment was: “Would two researchers starting with the same corpus and research question and (broadly) theoretical/methodological framework come to the same/similar conclusions?” (Marchi and Taylor 2009: 1). Findings are differentiated into convergence (results confirming each other), dissonance (results are incompatible) and complementarity (results illuminate different aspects). This early work inspired similar studies in recent years (Baker 2015; Baker and Egbert 2016; Baker and Levon 2015).^[4] In each study, the same research question is tackled by different scholars although the specifics of the set-up vary, e.g. whether the same data, corpus tool, or techniques are used. On the whole, these studies uncover a large amount of individual variation rather than a standard analysis, and even where results are similar, they may have been uncovered using different techniques. Analyses are shown to result in some converging (shared) as well as a larger number of unique (but mainly complementary) findings.

An important part of this meta-reflection and experimentation in corpus-based discourse analysis has been explicit reflection on subjectivity and objectivity, such as the assumption that corpus linguistics provides greater objectivity or reduces bias (e.g. Baker 2012; Baker and McEnery 2015; Marchi and Taylor 2009, 2018). In general, the various studies demonstrate that corpus and discourse analyses vary depending on multiple choices that can be made during the research process. This includes, inter alia, datasets; corpus-based versus corpus-driven directions of research; particular techniques (e.g. keywords vs collocation); specific thresholds/cutoffs and other methodological parameters; extent and type of qualitative analysis; researcher interests; how corpus findings are interpreted, etc. Thus, there appears to be general agreement that corpus-based discourse analysis “cannot in and of itself bring greater objectivity” (Marchi and Taylor 2018: 8). Baker (2012: 255) therefore asks “for an increased commitment to researcher reflexivity” (see also Marchi and Taylor 2018). In addition, Baker and McEnery (2015: 9) suggest to “aim for wider transparency about methodological decisions and a more nuanced set of stated claims about the benefits of using computational methods”. McEnery (2016: 31) recommends accurate reporting to enable replication and critical exploration (see also Egbert and Baker 2016). Marchi and Taylor (2018: 12) call for “a more general set of principles which can guide our work”, to monitor research choices and their impacts and present them in ways that facilitate replication.

To manage subjectivity in corpus-based discourse analysis, Baker (2009) proposes that analysts adopt transparency and consistency. For Baker (2009: 83–84), transparency refers to being clear about methodological choices as well as making corpora and corpus data (frequency lists, concordances) available. Transparency also covers researchers stating or making explicit their position regarding the object of research, such as their goals, standpoints, and other relevant influences (Baker 2009: 87). For him, this also includes reflexivity with respect to how one’s starting position may have changed in the course of the research. Consistency in turn refers to maintaining the same methodological choices regarding corpus techniques throughout the research project. Bednarek and Caple (2017: 23) use consistency in qualitative discourse analysis of corpus texts – in their case, the coding of news values.^[5]

Corpus-based discourse researchers have attempted to increase both transparency and consistency in various ways. Some studies include a step where two or more researchers undertake the same analysis and then measure inter-rater agreement (e.g. Bednarek et al. 2022; Jaworska and Kinloch 2018: 116) or where the same researcher undertakes the same analysis more than once and measures intra-rater reliability (e.g. Gray 2016: 37). Others create a coding manual (to improve consistency) and, to increase transparency, make this openly available on a website (e.g. Bednarek and Caple 2017) or via the Open Science Framework/OSF (e.g. Bednarek et al. 2022). Where corpus-based discourse analysts have used programming rather than off-the-shelf software, some have made the code available that underpins the analysis (e.g. McGlashan 2021).

In sum, many important concepts are involved in meta-reflection in corpus-based discourse analysis: triangulation, objectivity/subjectivity, replication, transparency, reflexivity, consistency. However, one concept that is so far missing here is that of accountability. In this context, Marchi and Taylor (2018: 12) argue for replacing the aim to achieve greater objectivity with the aim of increasing accountability (including consistency, transparency) and self-reflexivity.

Building on the discussion above, Table 1 therefore proposes what we call an “accountability framework”. Accountability is not limited here to Leech’s (1992: 112) principle of total accountability which refers to using all relevant evidence from the corpus/data (McEnery and Brezina 2022; McEnery and Hardie 2012: 14–16). Neither is it limited to consistency and transparency. Rather, we use accountability as a broad cover term which includes being accountable for all relevant aspects of corpus-based DA, deliberately drawing on the everyday meaning of accountability. Being accountable means being transparent about research aspects, but also being able to justify them, being responsible for the decisions made or positions taken, and critically reflecting on them. As evident from Table 1, we conceptualise this as a multi-faceted phenomenon, covering various aspects of the research process. The framework is a first step and likely incomplete; the dot points in the table signal that we hope others will extend this framework – including but not limited to making links with open science initiatives and principles.

Table 1:

Accountability framework.

Corpus	–Has the language of the corpus and its national context been specified and not treated as unmarked norm (cf. the Bender rule)? –Is there a sufficient description and critical evaluation of the corpus (including corpus design, building, annotation, etc.)? –If there is a corpus manual, has it been made available online, and has the link been included? –Can corpus data be made available or are there ethical or legal issues that prevent this? –Have Fair and Care principles been considered (if relevant)? –Have other ethical issues been considered in corpus design? –…
Methods, techniques, analyses	–Have relevant research foci and lacunae been considered (e.g. via the topology heuristic or in other ways)? –Have methodological choices been explained and justified? –Have relevant parameters been provided and justified? (e.g. tokenisation, tool settings/parameters, statistical measures, cut-offs/thresholds, etc.) –Has analysis code or code notebooks been made available? –Have corpus outputs (e.g. frequency lists, concordance lines) been made available? –Are supporting documents for qualitative analysis (e.g. coding manuals) openly available? –Has the consistency of qualitative analysis been addressed – e.g. via multiple analysis by same (intra-rater reliability) or different analysts (inter-rater reliability) or via coding manuals or other means? –…
Theories	–Have the ontological and epistemological underpinnings of the theories been considered and explained? –Has the applicability of the theories to the data and context been justified? –Have the limitations of the theories been identified? –…
Researcher(s)	–Have researchers reflected on their positioning, goals, and potential biases concerning the object of study and on the relevant limitations of their perspective? –Are explicit positioning statements relevant and if so, have they been included? –If applicable, have principles been implemented for supporting analysts who deal with challenging/confronting texts? –…
Triangulation	–Has triangulation of data, methods, theories or researchers been considered and are any of these appropriate/possible within project constraints? –…
Interested/affected parties	–Has there been a reflection on who might be an interested or affected party? –Have researchers considered how they might engage with any interested/affected parties? (e.g. co-design, collaboration, consulting, other, none?) –Are there ethical considerations beyond corpus design that have been considered in the project with respect to interested/affected parties? –…

Given the scope of this article, we cannot elaborate on all points, but – given the discussion above – most should be relatively clear.^[6] A few comments must suffice: When we talk about specifying the language of the corpus and its national context in the “corpus” row, we want to highlight the need to not assume any language or national context as unmarked or as having universal relevance. Corpus-based discourse analysis is arguably dominated by research on British English data, and it is imperative not to treat such data as the unmarked case. Similarly, when we talk about the underpinnings, applicability, and limitations of theories in the “theories” row, we want to encourage reflections regarding the origins (e.g. “Western”) and data (e.g. language variety, text types) from which those theories, concepts, or frameworks derive. This could also mean unpacking the assumptions and presuppositions that are built into such theories. Further, when we talk about supporting analysts in the “researcher(s)” row, one thing to think about is how to support those who analyse challenging/confronting data (e.g. sexual abuse, racism, misogyny, trans- or homophobia, health contexts, etc.). Finally, this row also interacts with the last row (“interested/affected parties”), as the researcher themselves might be an interested/affected party or an “out-group” researcher analysing the representation of social groups to which they do not belong (see Bray 2023).

The questions that we ask in Table 1 are meant to act either as a set of guiding principles or as a reflection tool and we acknowledge that not every project will be able to implement all relevant aspects given the constraints and pressures under which researchers operate and the different levels of support and capacity available. We also do not mean to imply that we have always considered all points in Table 1 or that we will always be able to do this in the future! Implementation would require considerable support from different levels. For example, triangulation alone faces significant challenges such as time, expertise and the space to write about different approaches (Egbert and Baker 2016). Some opportunities are already available, for example repositories for depositing supplementary materials (e.g. OSF, Github – see Schweinberger and Haugh in press), corpus linguistics journals that encourage methodological commentary (see Egbert and Baker 2016: 206), journals that allow for publication of corpus or data descriptions, and standardised notation systems (e.g. collocation parameters notation; see Brezina 2018: 275). Others still need to be implemented, such as customary procedures for reciprocal double coding (Brezina 2018: 270) or the appropriate institutional recognition of both corpora and corpus manuals as valuable research outputs. In sum, the burden of implementation should not fall on individual researchers; rather, supporting structures and cultural change are needed. In the remainder of this article, we introduce a cross-institutional initiative – the Australian Text Analytics Platform (ATAP) – which aims to address a small part of the framework in Table 1, namely the transparency of analyses through making code available via Jupyter notebooks (explained in 3.1).

3 The Australian Text Analytics Platform

3.1 The initiative

Making code available arguably constitutes a relatively recent trend in corpus-based DA and is hampered by a computational skills shortage. Even where the code is made available, the question is how many corpus-based discourse analysts have the skill to read, evaluate, and/or implement this code? To address this skills shortage and to increase transparency, researchers in Australia are collaborating on the Australian Text Analytics Platform (ATAP). While ATAP does not exclusively focus on corpus-based discourse analysis, we want to take this opportunity to introduce this initiative as potentially helpful to this field.

In brief, ATAP is an initiative which fosters cooperation among data and text analytics users and providers, and supports researchers from diverse academic backgrounds, including beginners, in adopting user-friendly code-based text analysis. It involves collaboration among the Universities of Queensland (including the Language Technology and Data Analytics Laboratory/LADAL), Sydney (Sydney Corpus Lab, Sydney Informatics Hub), and Australia’s Academic and Research Network/AARNet. It is also connected to the Language Data Commons of Australia HASS RDC through which it collaborates with additional institutions/organizations and supports language-related technologies, data collection infrastructure, and Indigenous capability programs. ATAP’s primary goal is to enhance research flexibility, offer valuable upskilling resources, and improve research workflow transparency and reproducibility. It thus offers a range of resources for text data collection, processing, visualization, and analysis (e.g. training, office hours, collaborations). Code-based resources include the development of user-friendly, shareable, transparent, and interactive Jupyter notebooks. Before we introduce an ATAP notebook that may be of particular interest to corpus-based discourse analysts, we provide an explanation of such notebooks given that they may be unfamiliar to many researchers in this field.

3.2 Jupyter notebooks

Jupyter notebooks are free open-sourced online web applications that are designed to create and share documents which combine text with executable code and the output of the code. These notebooks support a variety of programming languages including Python, Julia, and R. They are often stored on GitHub and other online repositories. In September 2018, there were more than 2.5 million public Jupyter notebooks hosted on GitHub alone (Perkel 2018: 145). Jupyter notebooks can be uploaded to online platforms which allow to run the code in notebooks independent of personal computers or they can be run locally (i.e. without connecting to the Internet), although extra setup is required for this. It is also possible to convert Jupyter notebooks to other file formats (e.g. html, LaTeX, and pdf), which makes them shareable outside of wherever they are stored (Beg et al. 2021: 38).

In general, a notebook is composed of cells (i.e. areas of the notebook), of which there are three types: code cells, text cells, and raw cells. Code cells contain code which can be executed to run analyses, output results from analysis, produce visualisations, etc. Text cells make use of markdown, a method for formatting text. Raw cells are mainly used for configuration, e.g. for specifying if a notebook should be converted into a pdf or html document (Pimentel et al. 2021: 4). Thus, Jupyter notebooks typically contain both computer code (in the relevant programming language) and text elements (paragraphs of text, links, equations etc.). Code cells are (usually) preceded by explanatory text (e.g. description of the code). What the code produces (the output) is presented just below the code cells. This means that notebooks are simultaneously human-readable and executable documents. Thus, a notebook can be summarised as “an interactive literate programming document and an application that executes the document” (Pimentel et al. 2021: 3).

Jupyter notebooks are used in ATAP because they have several advantages: they make it easy to document, share, and reproduce codes used for analyses (Pimentel et al. 2021; Shen 2014). With notebooks, it is possible to carry out an entire study within a single document while maintaining a complete and executable record of the processes involved (Beg et al. 2021: 38, 40). Due to the ability to contain text and tables, notebooks can also provide information about annotation schemas, provide or discuss examples as well as automatically store and document technical information (Beg et al. 2021: 40).

The combination of explanation and code allows analysts to describe the intended process and outcomes of the code which, in turn, allows others to better understand how it works. This is helpful for troubleshooting if any errors are encountered when the code is adapted. Finally, it is possible to modify and then immediately execute/run the code. This means that it is possible to see the results of the modifications which lead to a better understanding of the code. Because of these features, Schweinberger and Haugh (in press) argue that the use of notebooks in linguistics can assist in making corpus analyses and the workflows they are based on more transparent, reproducible, and efficient.

To date, ATAP has developed a range of Jupyter notebooks and is developing more. A full list is available at https://www.atap.edu.au/tools/, with tools addressing different researchers including those interested in corpus-based DA, but also those interested in stylistic or literary analyses or those interested in natural language processing (NLP) techniques such as topic modelling, sentiment analysis, document classification, geolocation, etc. Some notebooks can be considered as upskilling tutorials (e.g. those on the LADAL platform which cover a broad range of areas), while others are stand-alone tools. The latter tool suite includes:

Document Similarity Tool – enables the identification of duplicated content among the texts in a corpus and permits the exclusion of duplicates after manual review.
Semantic Tagger – can be used to add semantic tags to the texts in a corpus, undertake analysis/visualisation and export tagged texts.
Keywords Analysis – keywords analysis on two or multiple corpora, as well as statistical tests to investigate word use across corpora.
Quotation Tool – identifies who is quoted as well as quoted content in newspaper corpora, integrating named entity recognition.

While most of the notebooks are relevant to different text types, the Quotation Tool (Jufri and Sun 2022) targets newspaper texts – a very common data source in corpus-based DA. Given that quotation extraction is a type of analysis not normally enabled in “classic” off-the-shelf corpus software but is of clear relevance to corpus-based DA, we will introduce this tool in more detail below.

4 The Quotation Tool

4.1 Introducing the Quotation Tool

We have chosen the Quotation Tool as a sample notebook because of the relevance of quoted speech to corpus-based DA. In Baker et al.’s (2008: 282) list of strategies of representation (adapted from Wodak 2001), quotation falls under the strategies identified as “Perspectivation, framing or discourse representation”. As they note, “Quotation patterns […] play an important role in implementing particular perspectives, and hence, ideologies” (Baker et al. 2008: 295). Therefore, it is important to analyse “who is written about, how much space they are given and whether they are directly or indirectly quoted” (Baker et al. 2008: 294), and whether sources are represented negatively, for instance as “inarticulate, extremist, illogical, aggressive or threatening” (Baker et al. 2008: 295). In other approaches to discourse analysis, reported speech and thought is also regarded as important, in particular in newspaper discourse (see Bednarek 2016: 31–32; White 2012). Fairclough (1988) is a classic CDA paper that introduces some of the relevant issues. Bednarek (2016) suggests that there is general agreement that the following questions can be differentiated in relation to the integration of voices (sources) in news discourse:

Whose voice is it (who is the source)? How are they identified? Where are they located and how do we access them? How are these voices integrated? What reporting expression, if any, is used to introduce the content?

The Quotation Tool is of clear relevance to these questions, because it allows users to upload a news corpus and to automatically identify who is cited (the source), to classify this source (as entity type), to identify which reporting expression is used to cite them (e.g. according to, said, claimed), the type of quotation used to integrate the reported content (e.g. direct, indirect), and to extract the quoted content itself (for additional analyses, e.g. manually or as corpus in another tool). It then becomes possible for discourse analysts who examine larger datasets (which would not be amenable to manual analysis) to identify what sources are cited, how these sources are identified, whether readers hear/experience what sources said or a paraphrase by the journalist, and which sources and which content is associated with neutral versus other reporting expressions (e.g. those expressing attitude like admit, warn or those expressing potential doubt like claim). This enables analysis of whose perspective is included and whether there is any bias in any quotation patterns. For discursive news values analysis (Bednarek and Caple 2014, 2017), the Quotation Tool can inform analysts of whether the cited sources are “ordinary people” (news value of personalisation) or “elites” (news value of eliteness).

Before we illustrate some of the results from using the Quotation Tool, we first briefly explain its basic features, using a small training dataset/corpus comprising 100 articles (utf-8 encoded text files) from the University of Sydney’s student newspaper Honi Soit. We collected one article per week from the newspaper’s online News section (http://honisoit.com/category/news/) over a period of 2 years (2021–2022), as explained in Lee (2024). The Honi Soit editors gave us permission to use this corpus for training purposes.

The Quotation Tool notebook has been kept deliberately simple and contains just five steps:

Step 1 – Setup: imports and initiates the Quotation Tool and the necessary libraries;
Step 2 – Load the data: uploads the corpus file, including previewing its contents (cf. Figure 1);
Step 3 – Extract the quotes: extracts the quotes and applies named entity recognition to speakers and quoted content (but not other elements); includes preview of the extracted data (not shown here);
Step 4 – Display the quotes: shows the results of the analysis in one text (Figure 4); allows analysis of whole corpus for the most frequent (top) entity names (e.g. John Doe) and entity types (e.g. PERSON) in speakers and quoted content; visualisations of the latter can be exported (Figures 2 and 3);
Step 5 – Save the quotes: saves the results into a spreadsheet for downloading to users’ computers and additional analysis outside the Quotation Tool notebook environment.

Figure 1:

Previewing corpus contents in the notebook.

Figure 2:

The five most frequent speaker entity names.

Figure 3:

The five most frequent speaker entity types.

The quotation extraction has been adapted (with permission) from code developed by Maite Taboada’s Gender Gap Tracker team (e.g. Asr et al. 2021), while other tools used in the notebook include spaCy (https://spacy.io/), Natural Language Toolkit (https://www.nltk.org/index.html), pandas (https://pandas.pydata.org/) and Jupyter Widgets (https://ipywidgets.readthedocs.io/en/stable/). Before we demonstrate uses of the Quotation Tool further (in Section 4.3), we provide a brief discussion of some of the notebook features with respect to transparency of analysis (falling within the second row “Methods, techniques, analysis” in Table 1).

4.2 Reflection on accountability and transparency

How does this notebook foster accountability in relation to transparency of analysis? On a general level, such analytical transparency is facilitated by the fact that this notebook contains both explanations and code. This allows users to understand what action the relevant code performs and would be particularly important for users who do not read code. The explanations also include information on how the code can be changed by users, for example changing the number of rows that are previewed by changing the “n” variable. This makes the code more transparent, as the “n” variable is both explained and made adaptable. However, it must be acknowledged that some of the code will still be non-transparent to those who cannot read it. They can understand what the code achieves, but cannot evaluate the code itself for how it achieves the outcome. A final general feature is the use of previews, which allow users to see the extracted data and identify any potential issues with it. Such previews can also make the code more transparent, as results are displayed and can be inspected (cf. Figure 4).

Figure 4:

Extract from text preview showing results of quote extraction and named entity recognition.

In addition to these general features, this notebook includes several links to resources where users can read about (aspects of) the underlying code used to extract the quotes. Figure 5 shows the link to the relevant team’s GitHub page as well as to an article that evaluates the Quotation Tool’s accuracy. Elsewhere in the notebook, users are able to directly access an appendix (from the Gender Gap team), which contains technical information about the quotation extraction pipeline.

Figure 5:

Introduction to the Quotation Tool notebook.

The notebook also includes various coloured boxes which provide information about the libraries that are installed (Figure 6) and the tools that are used (Figure 7). In some cases, additional information about the relevant tool is also provided, for instance with respect to how texts are split into tokens (Figure 7) or with respect to capitalization (Figure 8). This is important so that users can understand the choices that are integrated within the tools, to be able to interpret the output, and to identify potential limitations.

Figure 6:

Information about installed libraries.

Figure 7:

Information about tools and tokenization.

Figure 8:

Information about capitalization.

Other coloured boxes contain important information for users, such as providing a definition of each column header in a table or information about how to deal with particular issues from a technological perspective (e.g. large file upload).

In sum, transparency is addressed through several different measures, either directly within the notebook or by providing users with links to relevant documentation outside the notebook. When users do not change the code but use it with all default values, this analysis should be reproducible (with the same dataset) and replicable (with a different dataset). Where users change the code – as they are able to – they would need to take responsibility for documenting the changes that have been made in a transparent way to ensure reproducibility and replicability. One remaining disadvantage is that some notebook users will not be able to read and evaluate the code itself, but will have to rely on the accompanying material and its accessibility to them. In addition, there is a trade-off between including too much and too little information in the notebook itself. Too much explicatory text within the notebook may overwhelm the notebook user, while links to documents outside the notebook require additional effort by the user and may deter them from engaging with the relevant information. In general, we have opted for a mix of both.

4.3 Potential uses in corpus-based DA

Having discussed its general features, this section now illustrates some potential uses of the Quotation Tool in corpus-based DA. One of the obvious uses of the notebook is to analyse a specific corpus for its quotation patterns and to interpret these patterns. This seems a relatively straightforward use of the tool, which we only briefly discuss here. To give some examples, the downloaded spreadsheet (produced in step 5) could be filtered by reporting expression and all sources and propositions could be retrieved that are cited using negative attitudinal reporting verbs such as WARN or ADMIT. These could be compared to those cited with neutral reporting expressions. Potential classification systems that can be used as starting point for identifying reporting expressions can be found in Caldas-Coulthard (1994) or Bednarek (2006: 57–58). Similarly, one could compare sources cited directly with those cited indirectly, or one could identify and classify all the sources and types of entities that are cited. Thus, relevant quotation patterns could be retrieved in a corpus of news about a particular topic and analysed with respect to bias and ideology.

Additionally, the Quotation Tool can be used to obtain information about journalistic norms or conventions that may hold across different newspaper corpora. This can be useful as background information for discourse analyses of various types. We will illustrate this use of the tool by extracting quotes from five different newspaper corpora:

The Honi Soit training dataset introduced above (100 articles from an Australian student newspaper);
A subset of the AusBrown corpus (Collins and Yao 2019), namely 169 articles from the Press section (139 articles from 1990; 30 articles from 2006);
The Cycling corpus described in Bednarek and Caple (2017: 138–143) containing 1,687 news items about cycling from 12 Australian, US, and UK newspapers in the years 2004, 2005, 2009, 2013, 2014;
The Diabetes News Corpus described in Bednarek and Carr (2019), consisting of 694 newspaper articles about diabetes (577 news items, 117 other items) from 12 Australian newspapers (2013–2017);
The Australian Obesity Corpus, including approximately 26,000 articles from 12 Australian newspapers that mention the words obese or obesity (both news and other items; see Vanichkina and Bednarek 2022).

These datasets all contain Australian newspaper items, although some include non-news genres (e.g. opinion) and some also include UK and US items. We used the Quotation Tool to compare a range of aspects across these corpora, with the aim of identifying the frequency of entities (as sources and in quoted content), of quotation types, and of reporting expressions. All results in Tables 2 –5 were retrieved by working with the downloaded spreadsheet, but can now be easily retrieved in summative form (see Australian Text Analytics Platform 2024).

Table 2:

Speaker entities.

	AusBrown subset	Cycling	Diabetes	Honi Soit	Obesity
PERSON	1,273 (#1)	3,175 (#1)	1,707 (#1)	290 (#1)	61,481 (#1)
ORG	511 (#2)	1,384 (#2)	557 (#2)	236 (#2)	30,551 (#2)
GPE	181 (#3)	639 (#3)	253 (#3)	33 (#3)	10,158 (#3)
NORP	81 (#4)	70 (#5)	55 (#4)	10 (#4)	3,302 (#4)
LOC	22 (#5)	70 (#5)	8 (#5)	2 (#5)	375 (#5)
FAC	19 (#6)	102 (#4)	4 (#6)	0 (#6)	294 (#6)

Table 3:

Quote entities.

	AusBrown subset	Cycling	Diabetes	Honi Soit	Obesity
ORG	700 (#1)	999 (#2)	343 (#2)	236 (#1)	17,539 (#1)
GPE	521 (#2)	1,040 (#1)	350 (#1)	84 (#2)	15,983 (#2)
PERSON	492 (#3)	780 (#3)	329 (#3)	70 (#3)	12,378 (#3)
NORP	202 (#4)	147 (#6)	186 (#4)	53 (#4)	7,850 (#4)
LOC	97 (#5)	228 (#5)	28 (#5)	7 (#5)	1,540 (#5)
FAC	49 (#6)	353 (#4)	6 (#6)	5 (#6)	657 (#6)

Table 4:

Quotation types.

	AusBrown subset	Cycling	Diabetes	Honi Soit	Obesity
SVC	1,591 (#1)	3,241 (#1)	1,428 (#1)	314 (#1)	63,974 (#1)
CSV	439 (#2)	742 (#4)	308 (#5)	66 (#4)	15,856 (#5)
QCQSV	407 (#3)	998 (#3)	912 (#2)	81 (#3)	29,449 (#3)
Heuristic	370 (#4)	1,038 (#2)	523 (#3)	269 (#2)	34,331 (#2)
QCQ	174 (#5)	474 (#6)	490 (#4)	46 (#5)	17,192 (#4)
Floating quote	174 (#5)	474 (#6)	490 (#4)	46 (#5)	17,192 (#4)
SVQCQ	52 (#6)	664 (#5)	33 (#8)	16 (#9)	2,539 (#9)
AccordingTo	50 (#7)	149 (#9)	107 (#6)	30 (#6)	5,636 (#6)
QCQVS	38 (#8)	357 (#7)	72 (#7)	29 (#7)	4,278 (#7)
CVS	30 (#9)	171 (#8)	44 (#9)	23 (#8)	3,780 (#8)
VCS	1 (#10)	5 (#11)	1 (#11)	2 (#10)	230 (#10)
VSC	1 (#10)	6 (#10)	–	–	177 (#11)
QSCQV	–	–	–	–	2 (#16)
SCV	–	1 (#13)	–	–	43 (#12)
SQCQV	–	–	4 (#10)	–	19 (#14)
VQCQS	–	–	–	–	3 (#15)
VQSCQ	–	–	–	–	2 (#16)
VSQCQ	–	2 (#12)	–	–	37 (#13)

Table 5:

Ten most frequent reporting expressions.

Rank	AusBrown	Cycling	Diabetes	Honi Soit	Obesity
1	said (1,943)	said (4,691)	said (1,924)	said (233)	said (54,136)
2	told (123)	says (365)	says (384)	told (50)	says (31,668)
3	says (87)	told (237)	according [to] (108)	says (37)	according [to] (5,675)
4	say (72)	say (164)	say (71)	according [to] (31)	say (4,967)
5	according [to] (50)	added (160)	told (35)	stated (16)	told (4,618)
6	saying (28)	according [to] (149)	think (33)	noted (15)	think (2,243)
7	warned (24)	saying (60)	suggests (24)	argued (14)	suggests (1,588)
8	thought (18)	wrote (49)	warn (18)	explained (14)	saying (1,425)
9	confirmed (17)	adding (33)	saying (17)	added (12)	writes (1,399)
			thought (17)	say (12)
			thought (17)	saying (12)
10	announced (15)	claimed (30)	warned (16)	announced (11)	wrote (825)
10	claimed (15)	claimed (30)	warned (16)	announced (11)	wrote (825)

We start by examining trends in entities (raw frequencies and ranks): Tables 2 and 3 show the results for speaker entities (sources) and quote entities (entities mentioned within quoted content) in the five corpora (excluding blank cells). These results are based on the six spaCy entity types included by default in the notebook (PERSON = people; ORG = companies, agencies, institutions, etc.; GPE = countries, cities, states; NORP = nationalities, religious or political groups; FAC = buildings, airports, highways, etc.; LOC = non-GPE locations, mountain ranges, bodies of water). Numbers derive from the named entity recognition, where the same source is sometimes classified as more than one entity, for instance in the case of role labels (e.g. NSW Police Minister David Elliott: David Elliott = PERSON; Police = ORG; NSW = GPE) and were retrieved using a text filter. It is evident that the three most highly-ranked speaker entities are the same across all corpora, with only small differences among the remaining entities. No doubt there are errors in the entity recognition; however, the ranking conforms to expectations, given that we would expect people to be sources more often than other entity types and given that ORG and GPE/NORP are likely associated with the construction of the news values of eliteness and/or proximity (Bednarek and Caple 2017: 82–83; 91–93). For quote entities, there is similar overlap in the rankings, with three newspaper corpora having identical rankings for all entity types and two corpora (Cycling; Diabetes) having slight variations. Thus, there are journalistic norms concerning the kinds of entities that are quoted as sources as well as those that are mentioned in quoted content.

We now consider the quotation structure, with Table 4 listing all quotation types and their (raw) frequency and rank in each corpus. The categories come from the quotation extraction developed by the Gender Gap team and are explained in Asr et al. (2021)’s supporting information. Briefly, AccordingTo = use of according to, S = speaker, V = verb, C = content of quote, Q = quotation mark; floating quote = quotation by same speaker without new reporting expression. As can be seen, there are also definite norms at play considering quotation types: SVC is ranked first across all corpora, while most other quotation types also show rankings that are either the same or similar (within one to two ranks) across corpora. Only CSV (ranks 2, 4, 5) and SVQCQ (ranks 5, 6, 8, 9) show slightly more variation, some of which may derive from the diversity of news genres/topics in each corpus.

It is also interesting to identify frequency patterns in the reporting expressions that are used, because such expressions can be tied to evaluation/attitude (e.g. Bednarek 2006; White 2012). Again, there is much overlap concerning the ranking of the ten most frequent reporting expressions in each corpus (cf. Table 5, excluding blank cells). For instance, said is ranked first in all corpora and is thus the most commonly used reporting expression, while the present tense form says is ranked second or third across all. Told (ranks 2, 3, 5) and according [to] (ranks 3–6) are also important in all datasets, while say has the same rank (#4) in all but the student newspaper (Honi Soit), which may not consistently follow established journalistic norms. Other differences may arise from genre or topic variation. In any case, there are clear usage norms that can be identified with the Quotation Tool, allowing us to identify common and uncommon quotation patterns in news against which a particular news corpus could then be compared.

In addition, we can use the notebook for the purpose of triangulation (of methods). We will illustrate this briefly with the Diabetes News Corpus. In a previous study (Bednarek and Carr 2021), we used WordSmith (Scott 2020) to identify the four most frequent reporting expressions in this corpus (said, says, according to and say; cf. also Table 5), and to classify each source (excluding pronouns) that occurs with these four expressions using a coding manual (https://osf.io/jrhx2/). The results of this original classification are shown in Table A.1 in the appendix. We can now use the output from the Quotation Tool to triangulate this analysis. To do so, one of the authors (Kelvin Lee) used the same coding manual to classify all speakers occurring in the “speaker” column in the downloaded spreadsheet. We filtered out pronouns (he, I, it, she, some, someone, that, them, they, this, we, which, who, you), blank cells, and instances associated with said, says, according to and say – leaving us with 658 instances to analyse (resulting in 659 codings, with one instance double-coded). The purpose of the triangulation was to see if the patterns would be similar with respect to additional reporting expressions; hence the exclusion of the expressions that were previously analysed.

Since we are reporting these results here purely for the purposes of illustration, inter- or intra-rater reliability was not measured. In addition to the categories from the coding manual we added two additional categories: “error” and “unclear”. The “error” category refers to cases where speakers were incorrectly identified by the Quotation Tool (e.g. mice), while the “unclear” category refers to cases where the spreadsheet output did not suffice to enable speaker classification. In such cases one would have to use an additional tool for analysis (e.g. a concordancer) or check the relevant text in the corpus. The results of this qualitative analysis are presented in Table A.2 in the appendix and show that frequency patterns of source categories differ somewhat, indicating that particular reporting expressions may be associated with particular source categories. When the results of both codings are combined (Table 6), however, the ranking of the source categories is virtually identical to that based on the most frequent expressions, with only health advocacy groups and lay people changing places (but very close to each other). Triangulation with the Quotation Tool thus gives us additional confidence in the original findings. All three tables are included in the appendix for further consultation.

Table 6:

Frequency and percentage for each source category (combined).

Source category	Total frequency of codings
Research findings and announcements	597 (23.26 %)
Medical and health experts	475 (18.50 %)
Health advocacy groups	350 (13.63 %)
Lay people	398 (15.50 %)
Politicians, government officials and government initiatives	205 (7.99 %)
Businesses/companies	97 (3.78 %)
Research organisations	79 (3.08 %)
Professional experts	157 (6.12 %)
Celebrities	19 (0.74 %)
Media outlet or story	14 (0.55 %)
Guidelines and information sheets	8 (0.31 %)
Unclear	107 (4.17 %)
Other	25 (0.97 %)
Error	36 (1.40 %)
Total	2,567 (100 %)

A final use of the notebook for corpus-based DA is that we can use the quoted content itself as corpus to identify linguistic characteristics of such content. To do so, we take the column that contains the quoted content, extract this content, and turn it into a corpus. We can then analyse the five quoted content corpora for frequent features, for example part-of-speech tags, semantic tags, grammatical words, etc. (cf. Tables A.4–A.6 in the appendix). There is no room here to discuss these results adequately and further analyses would be necessary, but these tables indicate some similar trends regardless of the topic of the corpus. For instance, common POS tags (Table A.4) across all or most corpora include singular/mass nouns, prepositions/subordinating conjunctions, determiners, adjectives, personal pronouns, singular proper nouns, plural nouns, adverbs, verbs (in base form; past participle; singular present tense; past tense; gerund/present participle), and coordinating conjunctions. Excluding the grammatical bin and unmatched cases, the top 15 semantic tag lists (Table A.5) show a preference for Pronouns, Existing, Time: Period, Getting and possession, Personal names, Evaluation: Authentic (all corpora) as well as People, Likely, and Numbers (four corpora). The overlapping grammatical words (Table A.6) likely reflect general frequency trends in English, but do inform us about the importance of first person plural pronouns (we) in quoted content, as they occur among the top 15 most frequent grammatical words in all but one corpora. That these tendencies occur regardless of the topic informs us about general language trends in newspaper quotes on the whole.

In sum, the Quotation Tool has a range of potential uses for corpus-based discourse analysis of English-language newspaper texts, and we hope that this notebook – alongside other ATAP notebooks – will prove helpful to other researchers. Of course, manual analysis will be more accurate and is preferable for small text collections. In addition, the tool does work with medium-to large datasets (e.g. 26,000 articles) but may be slow or require several attempts, and it currently does not work with “big data”. Thus, the tool may be most suitable for medium-to large corpora that are not amenable to manual coding.

5 Conclusions

The present paper showcased the use of Jupyter notebooks for improving transparency in corpus-based DA, as part of a novel accountability framework we introduced here for the first time. We do want to briefly acknowledge the limitations of such notebooks here.

Although supporting the reproducibility/repeatability of analyses, this aspect can be difficult to attain when making use of Jupyter notebooks if code cells are executed out of the intended top-to-bottom order or because dependencies change over time (see Beg et al. 2021: 41). This means that the results of functions change as functions in the packages or libraries that the notebook uses are edited or changed. However, tracking changes (Vastola 2023) and collaborating on Jupyter notebooks with traditional version control systems like Git can be challenging due to the notebook’s JSON-based format. In addition, many analyses rely on random sampling which will render analyses non-reproducible if it is not assured that the same random samples are drawn when re-running the code (Wang et al. 2020: 289).

Another suite of limitations relates to the fact that notebooks may represent an unfamiliar, “strange” tool (Beg et al. 2021: 41–42) for those with little or no computational experience, and can have a steep learning curve. This issue ties in with another potentially problematic aspect: the quality and tidiness of the code. Jupyter notebooks can become cluttered with outputs, comments, and unorganized code, making them less readable and maintainable. Further, issues emerging from using large datasets and computationally intensive methods remain serious drawbacks, with Jupyter notebooks being unable to handle large datasets or “timing-out” if computational processing of data takes longer than a pre-specified time limit (see Vastola 2023). Another problem is the rapidly changing computational ecosystem, since best practices and environments for running and sharing Jupyter notebooks are evolving and adapting at a remarkable speed. A general recommendation is to make them findable, accessible, interoperable and reusable (FAIR), with practical recommendations on how to achieve this available from the Australian Research Data Commons (2023).

Finally, like many other computational tools, the NLP methods used in Jupyter notebooks are based and tested on large datasets representing major, mostly western-European, languages and alphabets, which can lead to issues when trying to apply them beyond such contexts. In sum, these various limitations of Jupyter notebook need to continuously be weighed against their advantages (see Section 3.2), and we do not want to rule out that future ATAP tools will be using different applications as new developments occur.

In addition to critical reflection on Jupyter notebooks, there is a need for a more general discussion about the integration of computational methods in corpus-based DA. The limited but increasing reflective work on combining corpus linguistics and discourse analysis shows that the field is now robust enough to handle such (Marchi and Taylor 2018). Baker (2018: 281) confirms that such a reflection “suggests a maturing of the field of corpora and discourse studies”. This should also include critical reflection on other issues, such as those included in the accountability framework we introduced above. Our discussion here was limited in scope and did not encompass all facets of the framework. Rather, we focused on a specific area and introduced a cross-institutional initiative: the Australian Text Analytics Platform, which currently employs Jupyter notebooks. Our emphasis has been on notebooks relevant to corpus-based DA, with the Quotation Tool notebook serving as an example of its link to accountability and potential applications. Of course, Jupyter notebooks do not address all aspects of the accountability framework. Researchers should also consider how they might deal with other elements in corpus-based discourse analysis, including the decisions made when building corpora, annotating data, defining variables or selecting features of interest, generating hypotheses, identifying discourses, and so on (see Table 1). Different options may need to be investigated for bringing together and making accountable different steps that are brought together in a single research project. For instance, Schweinberger and Haugh (in press) propose a GitHub repository which is comprised of documentation, code, and an interactive notebook, but other workflow representations are clearly possible and deserve further exploration.

Could the aim of striving for accountability be considered naïve? We aimed to encourage researchers to at least think about these aspects and see which they can address and how, given the constraints among which they are operating in a particular research project. We anticipate that future research and critical scrutiny can expand upon this article, delving into other aspects of the accountability framework in greater depth and refining it further. Just to give one example, we have not touched upon issues around open access dissemination of project findings. Clearly, there are a number of relevant open scholarship projects and initiatives in both applied linguistics and linguistics that can be considered in this respect.^[7] In conclusion, we must acknowledge the pressing concerns surrounding additional developments in corpus-based DA. This includes the imperative to centre marginalized voices (e.g. Nartey 2022, 2023) and promote greater diversity among both practitioners and the data they analyse.

Corresponding author: Monika Bednarek, School of Humanities, The University of Sydney, Sydney, NSW, Australia, E-mail: monika.bednarek@sydney.edu.au

Funding source: Australian Research Data Commons (Platforms Program)

Funding source: Australian Research Data Commons (HASS and Indigenous Research Data Commons)

Acknowledgements

We are very grateful to Maite Taboada for permitting us to adapt the code for quotation extraction developed by the Gender Gap Tracker team at Simon Fraser University for use in the ATAP Quotation Tool. We acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. The Jupyter notebook development was rendered possible through the Australian Text Analytics Platform (https://doi.org/10.47486/PL074) and the Language Data Commons of Australia via the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001). These projects received investment from the Australian Research Data Commons (ARDC), which is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

References

Ancarno, Clyde. 2018. Interdisciplinary approaches in corpus linguistics and CADs. In Charlotte Taylor & Anna Marchi (eds.), Corpus approaches to discourse: A critical review, 130–156. London & New York: Routledge.10.4324/9781315179346-7Search in Google Scholar

Anthony, Laurence & Paul Baker. 2015. ProtAnt: A tool for analysing the prototypicality of texts. International Journal of Corpus Linguistics 20(3). 273–292. https://doi.org/10.1075/ijcl.20.3.01ant.Search in Google Scholar

Anthony, Laurence, Nicholas Smith, Sebastian Hoffmann & Paul Rayson. 2023. Understanding corpus text prototypicality: A multifaceted problem. Paper presented at ICAME 44 conference. North-West University. 17–21 May.Search in Google Scholar

Applied Linguistics, Press. 2024. Applied Linguistics Press [list of open scholarship resources]. https://www.appliedlinguisticspress.org/home/os-resources (accessed 18 March 2024).Search in Google Scholar

Asr, Fatemeh Torabi, Mohammad Mazraeh, Alexandre Lopes, Vasundhara Gautam, Junette Gonzales, Prashanth Rao & Maite Taboada. 2021. The gender gap tracker: Using natural language processing to measure gender bias in media. PLoS One 16(1). e0245533. https://doi.org/10.1371/journal.pone.0245533.Search in Google Scholar

Australian Research Data Commons. 2023. FAIR for Jupyter notebooks: A practical guide. https://ardc.edu.au/resource/fair-for-jupyter-notebooks-a-practical-guide/ (accessed 18 March 2024).Search in Google Scholar

Australian Text Analytics Platform. 2024. Quotation Tool notebook: Help pages. https://github.com/Australian-Text-Analytics-Platform/quotation-tool/blob/main/documents/quotation_help_pages.pdf (accessed 18 March 2024).Search in Google Scholar

Baker, Paul. 2006. Using corpora in discourse analysis. London: Continuum.10.5040/9781350933996Search in Google Scholar

Baker, Paul. 2009. Issues in teaching corpus-based discourse analysis. In Linda Lombardo (ed.), Using corpora to learn about language and discourse, 73–79. Bern & New York: Peter Lang.Search in Google Scholar

Baker, Paul. 2012. Acceptable bias? Using corpus linguistics methods with critical discourse analysis. Critical Discourse Studies 9(3). 247–256. https://doi.org/10.1080/17405904.2012.688297.Search in Google Scholar

Baker, Paul. 2015. Does Britain need any more foreign doctors? Inter-analyst consistency and corpus-assisted (critical) discourse analysis. In Maggie Charles, Nicholas Groom & Suganthi John (eds.), Corpora, grammar, text and discourse: In honour of Susan Hunston, 283–300. Amsterdam & Philadelphia: John Benjamins.10.1075/scl.73.13bakSearch in Google Scholar

Baker, Paul. 2018. Conclusion: Reflecting on reflective research. In Charlotte Taylor & Anna Marchi (eds.), Corpus approaches to discourse: A critical review, 281–292. London & New York: Routledge.Search in Google Scholar

Baker, Paul. 2020. Analysing representations of obesity in the Daily Mail via corpus and down-sampling methods. In Jesse Egbert & Paul Baker (eds.), Using corpus methods to triangulate linguistic analysis, 85–108. London & New York: Routledge.10.4324/9781315112466-4Search in Google Scholar

Baker, Paul & Jesse Egbert (eds.). 2016. Triangulating methodological approaches in corpus linguistic research. London & New York: Routledge.10.4324/9781315724812Search in Google Scholar

Baker, Paul & Erez Levon. 2015. Picking the right cherries? A comparison of corpus-based and qualitative analyses of news articles about masculinity. Discourse & Communication 9(2). 221–336. https://doi.org/10.1177/1750481314568542.Search in Google Scholar

Baker, Paul & Tony McEnery. 2015. Introduction. In Paul Baker & Tony McEnery (eds.), Corpora and discourse studies: Integrating discourse and corpora, 1–19. Basingstoke & New York: Palgrave Macmillan.10.1057/9781137431738_1Search in Google Scholar

Baker, Paul, Costas Gabrielatos, Majid Khosravinik, Michał Krzyzanowski, Tony McEnery & Ruth Wodak. 2008. A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society 19(3). 273–306. https://doi.org/10.1177/0957926508088962.Search in Google Scholar

Bednarek, Monika. 2006. Evaluation in media discourse: Analysis of a newspaper corpus. London & New York: Continuum.Search in Google Scholar

Bednarek, Monika. 2009. Corpora and discourse: A three-pronged approach to analyzing linguistic data. In Michael Haugh, Kate Burridge, Jean Mulder & Pam Peters (eds.), Selected proceedings of the 2008 HCSNet workshop on designing the Australian national corpus: Mustering languages. Sommerville: Cascadilla Proceedings Project.Search in Google Scholar

Bednarek, Monika. 2016. Voices and values in the news: News media talk, news values and attribution. Discourse, Context & Media 11. 27–37. https://doi.org/10.1016/j.dcm.2015.11.004.Search in Google Scholar

Bednarek, Monika & Helen Caple. 2014. Why do news values matter? Towards a new methodological framework for analyzing news discourse in critical discourse analysis and beyond. Discourse & Society 25(2). 135–158. https://doi.org/10.1177/0957926513516041.Search in Google Scholar

Bednarek, Monika & Helen Caple. 2017. The discourse of news values: How news organisations create newsworthiness. Oxford & New York: Oxford University Press.10.1093/acprof:oso/9780190653934.001.0001Search in Google Scholar

Bednarek, Monika & Georgia Carr. 2019. Guide to the diabetes news corpus (DNC). https://osf.io/jrhx2/ (accessed 4 July 2023).Search in Google Scholar

Bednarek, Monika & Georgia Carr. 2021. Australian diabetes news media coverage. Australian Diabetes Educator 23(4). https://ade.adea.com.au/australian-diabetes-news-media-coverage/ (accessed 4 July 2023).Search in Google Scholar

Bednarek, Monika, Andrew S. Ross, Olga Boichak, Yaegan J. Doran, Georgia Carr, Eduardo G. Altmann & Tristram J. Alexander. 2022. Winning the discursive struggle? The impact of a significant environmental crisis event on dominant climate discourses on Twitter. Discourse, Context & Media 45(100564). 1–13. https://doi.org/10.1016/j.dcm.2021.100564.Search in Google Scholar

Beg, Marijan, Juliette Taka, Thomas Kluyver, Alexander Konovalov, Min Ragan-Kelley, Nicolas M. Thiéry & Hans Fangohr. 2021. Using Jupyter for reproducible scientific workflows. Computing in Science & Engineering 23(2). 36–46. https://doi.org/10.1109/mcse.2021.3052101.Search in Google Scholar

Bender, Emily M. 2019. The #BenderRule: On naming the languages we study and why it matters. The Gradient. https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/ (accessed 2 June 2023).Search in Google Scholar

Bloor, Michael. 1997. Techniques of validation in qualitative research: A critical commentary. In Gale Miller & Robert Dingwall (eds.), Context and method in qualitative research, 37–50. London: Sage.10.4135/9781849208758.n3Search in Google Scholar

Bochynska, Agata, Liam Keeble, Caitlin Halfacre, Joseph V. Casillas, Irys-Amélie Champagne, Kaidi Chen, Melanie Röthlisberger, Erin M. Buchanan & Timo B. Roettger. 2023. Reproducible research practices and transparency across linguistics. Glossa Psycholinguistics 2(1). 1–36. https://doi.org/10.5070/G6011239.Search in Google Scholar

Bray, Carly. 2023. Applying decolonial research principles in corpus-based critical discourse analysis of Aboriginal and Torres Strait Islander peoples and issues. Paper presented at the 7th meeting of the International Society for the linguistics of English (ISLE 7). Australia: University of Queensland 19–22 June 2023.Search in Google Scholar

Brezina, Vaclav. 2018. Statistical choices in corpus-based discourse analysis. In Charlotte Taylor & Anna Marchi (eds.), Corpus approaches to discourse: A critical review, 259–280. London & New York: Routledge.10.4324/9781315179346-12Search in Google Scholar

Caldas-Coulthard, Carmen Rosa. 1994. On reporting reporting: The representation of speech in factual and factional narratives. In Malcolm Coulthard (ed.), Advances in written text analysis, 295–308. London: Routledge.Search in Google Scholar

Caple, Helen, Changpeng Huan & Monika Bednarek. 2020. Multimodal news analysis across cultures. Cambridge: Cambridge University Press.10.1017/9781108886048Search in Google Scholar

Collins, Peter & Xinyue Yao. 2019. AusBrown: A new diachronic corpus of Australian English. ICAME Journal 43(1). 5–21. https://doi.org/10.2478/icame-2019-0001.Search in Google Scholar

Denzin, Norman K. 1970. The research act in sociology: A theoretical introduction to sociological methods. London & Chicago: Butterworths.Search in Google Scholar

Doyle, Paul. 2005. Replicating corpus-based linguistics: Investigating lexical networks in text. In Proceedings from corpus linguistics 2005. Birmingham: University of Birmingham. https://www.birmingham.ac.uk/documents/college-artslaw/corpus/conference-archives/2005-journal/lexiconodf/coling2005paper.pdf (accessed 16 June 2023).Search in Google Scholar

Egbert, Jesse. 2023. “I tried”: Transparency in reporting methods. Linguistics with a Corpus. https://linguisticswithacorpus.wordpress.com/2023/10/31/i-tried-transparency-in-reporting-methods/ (accessed 18 March 2024).Search in Google Scholar

Egbert, Jesse & Paul Baker. 2016. Research synthesis. In Paul Baker & Jesse Egbert (eds.), Triangulating methodological approaches in corpus-linguistic research, 183–208. London & New York: Routledge.Search in Google Scholar

Egbert, Jesse & Paul Baker (eds.). 2020. Using corpus methods to triangulate linguistic analysis. London & New York: Routledge.10.4324/9781315112466Search in Google Scholar

Fairclough, Norman. 1988. Discourse representation in media discourse. SocioLinguistics 17. 125–139.Search in Google Scholar

Gray, Bethany. 2016. Lexical bundles. In Paul Baker & Jesse Egbert (eds.), Triangulating methodological approaches in corpus-linguistic research, 33–56. London & New York: Routledge.Search in Google Scholar

Hardt-Mautner, Gerlinde. 1995. ‘Only connect’: Critical discourse analysis and corpus linguistics. UCREL Technical Paper 6. Lancaster: University of Lancaster. http://ucrel.lancs.ac.uk/papers/techpaper/vol6.pdf (accessed 4 July 2023).Search in Google Scholar

Hober, Nicole, Tülay Dixon & Tove Larsson. 2023. Towards increased reliability and transparency in projects with manual linguistic coding. Corpora 18(2). 245–258. https://doi.org/10.3366/cor.2023.0284.Search in Google Scholar

Jaworska, Sylvia & Karen Kinloch. 2018. Using multiple data sets. In Charlotte Taylor & Anna Marchi (eds.), Corpus approaches to discourse: A critical review, 110–129. London & New York: Routledge.10.4324/9781315179346-6Search in Google Scholar

Jufri, Sony & Chao Sun. 2022. Quotation tool. v1.0.0 Australian text analytics platform. Software. Available at: https://github.com/Australian-Text-Analytics-Platform/quotation-tool.Search in Google Scholar

Lee, Kelvin K. H. 2024. Using constructed week sampling to compile a newspaper corpus. Sydney Corpus Lab. https://sydneycorpuslab.com/using-constructed-week-sampling-to-compile-a-newspaper-corpus/ (accessed 18 March 2024).Search in Google Scholar

Leech, Geoffrey. 1992. Corpora and theories of linguistic performance. In Jan Svartvik (ed.), Directions in corpus linguistics, 105–122. Berlin: De Gruyter Mouton.Search in Google Scholar

Lorenzo-Dus, Nuria. 2023. Digital grooming. Discourses of manipulation and cyber-crime. New York: Oxford University Press.10.1093/oso/9780190845193.001.0001Search in Google Scholar

Marchi, Anna & Charlotte Taylor. 2009. If on a winter’s night two researchers…: A challenge to assumptions of soundness of interpretation. Critical Approaches to Discourse Analysis across Disciplines 3(1). 1–20.Search in Google Scholar

Marchi, Anna & Charlotte Taylor. 2018. Introduction: Partiality and reflexivity. In Charlotte Taylor & Anna Marchi (eds.), Corpus approaches to discourse: A critical review, 1–15. London & New York: Routledge.10.4324/9781315179346-1Search in Google Scholar

McEnery, Tony. 2016. Keywords. In Paul Baker & Jesse Egbert (eds.), Triangulating methodological approaches in corpus-linguistic research, 20–32. London & New York: Routledge.Search in Google Scholar

McEnery, Tony & Vaclav Brezina. 2022. Fundamental principles of corpus linguistics. Cambridge: Cambridge University Press.10.1017/9781107110625Search in Google Scholar

McEnery, Tony & Andrew Hardie. 2012. Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.10.1017/CBO9780511981395Search in Google Scholar

McGlashan, Mark. 2021. Networked discourses of bereavement in online COVID-19 memorials. International Journal of Corpus Linguistics 26(4). 557–582. https://doi.org/10.1075/ijcl.21135.mcg.Search in Google Scholar

Musgrave, Simon. 2021. What are the FAIR and CARE principles and why should corpus linguists know about them? Sydney Corpus Lab. https://sydneycorpuslab.com/what-are-the-fair-and-care-principles-and-why-should-corpus-linguists-know-about-them/ (accessed 4 July 2023).Search in Google Scholar

Nartey, Mark. 2022. Centering marginalized voices: A discourse analytic study of the Black lives matter movement on Twitter. Critical Discourse Studies 19(5). 523–538. https://doi.org/10.1080/17405904.2021.1999284.Search in Google Scholar

Nartey, Mark (ed.). 2023. Voice, agency and resistance: Emancipatory discourses in action, [Special issue]. Critical Discourse Studies 19(5).10.4324/9781003373674Search in Google Scholar

Nartey, Mark & Isaac N. Mwinlaaru. 2019. Towards a decade of synergizing corpus linguistics and critical discourse analysis: A meta-analysis. Corpora 14(2). 203–235. https://doi.org/10.3366/cor.2019.0169.Search in Google Scholar

Paquot, Magali & Marcus Callies. 2020. Promoting methodological expertise, transparency, replication, and cumulative learning: Introducing new manuscript types in the International Journal of Learner Corpus Research. International Journal of Learner Corpus Research 6(2). 121–124. https://doi.org/10.1075/ijlcr.00014.edi.Search in Google Scholar

Partington, Alan. 2008. The armchair and the machine: Corpus-assisted discourse studies. In Carol Taylor Torsello, Katherine Ackerley & Erik Castello (eds.), Corpora for university language teachers, 95–118. Bern: Peter Lang.Search in Google Scholar

Perkel, Jeffrey M. 2018. Why Jupyter is data scientists’ computational notebook of choice. Nature 563(7732). 145–146. https://doi.org/10.1038/d41586-018-07196-1.Search in Google Scholar

Pimentel, João Felipe, Leonardo Murta, Vanessa Braganholo & Juliana Freire. 2021. Understanding and improving the quality and reproducibility of Jupyter notebooks. Empirical Software Engineering 26. 65. https://doi.org/10.1007/s10664-021-09961-9.Search in Google Scholar

Scott, Mike. 2020. WordSmith Tools (version 8). Stroud: lexical Analysis Software Ltd. Software. Available at: https://lexically.net/wordsmith/.Search in Google Scholar

Schweinberger, Martin & Michael Haugh. in press. Reproducibility and transparency in interpretive corpus pragmatics. International Journal of Corpus Linguistics.Search in Google Scholar

Shen, Helen. 2014. Interactive notebooks: Sharing the code. Nature 515. 151–152. https://doi.org/10.1038/515151ax.Search in Google Scholar

Sönning, Lukas & Valentin Werner. 2021. The replication crisis, scientific revolutions, and linguistics. Linguistics 59(5). 1179–1206. https://doi.org/10.1515/ling-2019-0045.Search in Google Scholar

Stubbs, Michael. 1996. Text and corpus analysis: Computer-assisted studies of language and culture. Oxford: Blackwell.Search in Google Scholar

Taylor, Charlotte & Anna Marchi (eds.). 2018. Corpus approaches to discourse: A critical review. London & New York: Routledge.10.4324/9781315179346Search in Google Scholar

Vanichkina, Darya & Monika Bednarek. 2022. Australian obesity corpus manual. https://osf.io/h6n82 (accessed 3 March 2022).Search in Google Scholar

Vastola, John. 2023. Why I stopped using Jupyter notebooks and why you should too. Medium. https://levelup.gitconnected.com/why-i-stopped-using-jupyter-notebook-and-why-you-should-too-b1e564d49ea1 (accessed 12 May 2023).Search in Google Scholar

Wang, Jiawei, Tzu-yang Kuo, Li Li & Andreas Zeller. 2020. Restoring reproducibility of Jupyter notebooks. In ICSE ’20: Proceedings of the ACM/IEEE 42nd international Conference on software engineering: Companion proceedings, 288–289.10.1145/3377812.3390803Search in Google Scholar

White, Peter R. R. 2012. Exploring the axiological workings of ‘reporter voice’ news stories—attribution and attitudinal positioning. Discourse, Context & Media 1(2–3). 57–67. https://doi.org/10.1016/j.dcm.2012.10.004.Search in Google Scholar

Wodak, Ruth. 2001. The discourse historical approach. In Ruth Wodak & Michael Meyer (eds.), Methods of critical discourse analysis, 63–94. London: SAGE.10.4135/9780857028020.n4Search in Google Scholar

Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/cllt-2023-0104).

Received: 2023-10-22

Accepted: 2024-04-02

Published Online: 2024-04-16

Published in Print: 2024-10-28

This work is licensed under the Creative Commons Attribution 4.0 International License.

Supplementary Material

Articles in the same Issue

https://doi.org/10.1515/cllt-2023-0104

Keywords for this article

corpus-based discourse analysis; triangulation; replication; transparency; Jupyter notebooks

Creative Commons

BY 4.0