Doing Corpus Linguistics

Zihan Xia; Gaoqiang Lu

doi:10.1515/csh-2024-0032

Article Open Access

Doing Corpus Linguistics

and

Published/Copyright: November 20, 2025

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Corpus-based Studies across Humanities Volume 3 Issue 2

Reviewed Publication:

Eniko Csomay William J. Crawford Doing Corpus Linguistics ( 2nd Edition ). New York: Routledge, 2024, p. 190. ISBN: 9781032425771, (hbk).

When supported by computational methods, corpus-based approaches provide empirical evidence for analysing linguistic features (lexical and grammatical) and textual variations (Biber et al. 1998, 3–16). Corpus linguistics studies are a statistically-based objective research method that aims to minimise researcher bias. This has led to its growing adoption in language studies (e.g. Fang and Cao 2015; Brezina 2018). While solid theoretical foundations have been developed, there is still a significant gap in bridging theories and practical applications within the field of corpus-based research.

Doing Corpus Linguistics by Eniko Csomay and William Crawford was first published in 2015. The second edition has been enhanced with additional exercises, new exemplars, and updated bibliographic references. This updated version offers a comprehensive guide for using corpus methods to carry out register analysis. As the authors observe there is a technical side to corpus linguistics that is “best acquired through practice and experience with corpora” (p. xi).

The book consists of three main parts. Part 1 (Chapters 1–2) provides a foundational overview of key concepts in linguistics, corpus linguistics, and variation and a framework for register analysis. Part 2 (Chapters 3–4) shifts the focus to the practical aspects of corpus analysis, introducing the search tools and language units that can be employed to investigate existing corpora. Finally, Part 3 (Chapters 5–9) explains how to create specialised corpora, conduct statistical analysis, integrate techniques into your own project, and identify further areas of potential exploration.

Chapter 1, serves as an introduction and delineates the fundamental framework of corpus linguistics and language variation. Corpus linguistics conducts systematic investigations of extensive linguistic texts sharing contextual characteristics to examine patterns of language use across different contexts. This corpus-based methodology reveals collocational patterns and the frequency distributions of lexical items and provides empirical evidence for validating prescriptive grammatical rules in authentic language use. As the text states: “it is empirical, analysing the actual patterns of use in natural language texts” (p. 8). Chapter 2 focuses on the theoretical framework and methodological foundations of Register Functional Analysis. The authors establish register analysis as an analytical framework for understanding language variation, comprising three essential components: 1) an analysis of the context in which a text is produced, 2) an analysis of the linguistic features that are found in the texts, and 3) a functional interpretation of the relationship between the context and the language produced in a given context.

Chapter 3 explores the specific methods of corpus analysis, including Keyword in Context (KWIC), Keyword, Collocates, N-gram, and the application of Part-of-Speech (POS) tagging. Through fruitful examples, readers can learn how to utilise tools such as COCA and AntConc to retrieve and analyse lexical and grammatical patterns across various registers, thereby providing practical analytical techniques for linguistic research. Chapter 4 provides practical opportunities for readers to conduct corpus linguistics projects using publicly available corpora (e.g. English-Corpora.org). These corpora were selected for their cost-effectiveness, accessibility, and comprehensive coverage, making them suitable for researchers, teachers, and students to conduct diverse linguistic studies. The chapter emphasises that it is important “to understand the situational characteristics of the texts when representing a register” when performing cross-corpus analyses to ensure the validity and accuracy of research findings (p. 63).

By providing detailed steps and practical projects, chapter 5 guides readers through the process of constructing specialised corpora tailored to specific research questions. The chapter underscores the importance of selecting appropriate research topics, ensuring corpus balance, and the judicious use of software tools to conduct effective linguistic analyses. Additionally, it offers fundamental statistical analysis terms, concepts, and assumptions. Chapters 6 and 7 introduce the application of statistical analysis in corpus linguistics and commonly used statistical tests (e.g. ANOVA, Chi-square, Pearson correlation, and Cohen’s d). These two chapters aim to assist readers in understanding how to employ statistical methods to uncover the associations between patterns of language use and their situational characteristics, because, as the authors state, “we need quantitative measures (e.g., the frequency of a particular language feature) to see how commonly these patterns occur” (p. 98). Additionally, these chapters provide a detailed explanation of how to conduct these tests using statistical software (e.g. SPSS) and offer methods for interpreting the test results. Chapter 8 serves as a comprehensive summary of the entire book, detailing the specific steps involved in conducting register functional analysis. The authors recommend first describing the situational characteristics of the texts, followed by an analysis of linguistic features, and ultimately offering functional interpretations of the identified language patterns to gain a deeper understanding of the underlying reasons for these patterns. Chapter 9 looks ahead to the future directions of corpus linguistics, emphasising the importance of multidimensional analysis methods. Additionally, the authors encourage researchers to explore emerging fields such as learner corpora, forensic linguistics, and natural language processing.

In summary, this book provides a comprehensive guide on how to conduct a complete corpus study, covering technical aspects and basic statistical methods. The book transitions seamlessly from theoretical concepts to hands-on practice, bridging the gap between theory and research, with a notable emphasis on “doing corpus linguistics”. The guidance offered in Chapters 3 and 8 on using corpus tools for register analysis stands out as a unique and valuable feature, addressing a gap in similar publications. Therefore, this book serves as both a valuable introductory textbook and a guide for researchers embarking on their first corpus linguistics analysis, offering detailed guidance on methodology and applications.

However, there are some areas that could be strengthened. Firstly, if more diverse research cases were incorporated, such as stance analysis (Siu et al. 2024a) and lexical bundles analysis (Siu et al. 2024b), the book could enhance its practical value. Secondly, given that this book was written when AI models like GPT 4o and Claude 3.5 were already widely adopted, its limited discussion of the impact of these tools on corpus research is a notable omission, particularly regarding the ongoing debates about distinguishing between AI-generated and human-written texts (Kong and Liu 2024). Apart from that, expanding on the typical challenges and obstacles researchers encounter in actual projects, coupled with in-depth analyses of specific case failures, would better equip readers to navigate methodological pitfalls in their research endeavours (Lu and Fang 2025). The implementation of these suggested improvements would substantially enhance the book’s utility as an introductory guide to corpus research.

Corresponding author: Gaoqiang Lu, City University of Hong Kong, Hong Kong, Hong Kong, E-mail: gaoqianlu2-c@my.cityu.edu.hk

References

Alex Chengyu, Fang, and Jing Cao. 2015. Text Genres and Registers: The Computation of Linguistic Features. Berlin, Heidelberg: Springer.Search in Google Scholar

Biber, Douglas, Conrad, Susan, and Reppen, Randi. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press.10.1017/CBO9780511804489Search in Google Scholar

Brezina, Vaclav. 2018. Statistics in Corpus Linguistics: A Practical Guide. Cambridge: Cambridge University Press.10.1017/9781316410899Search in Google Scholar

Kong, Xinwan, and Chengyu, Liu. 2024. “A Comparative Genre Analysis of AI-generated and scholar-written Abstracts for English Review Articles in International Journals.” Journal of English for Academic Purposes 71: 101432. https://doi.org/10.1016/j.jeap.2024.101432.Search in Google Scholar

Lu, Gaoqiang, and Alex Chengyu, Fang. 2025. “Challenges in Corpus Linguistics: Rethinking Corpus Compilation and Analysis.” Applied Linguistics. https://doi.org/10.1093/applin/amaf028.Search in Google Scholar

Siu, Wing Yee Barbara, Afzaal, Muhammad, and Aldayel, Hessah Saleh. 2024. “A Corpus-based Comparison of Linguistic Markers of Stance and Genre in the Academic Writing of Novice and Advanced Engineering Learners.” Humanities and Social Sciences Communications 11 (1): 284. https://doi.org/10.1057/s41599-024-02757-4.Search in Google Scholar

Siu, Wing Yee Barbara, Afzaal, Muhammad, Aldayel, Hessah Saleh, and Curle, Samantha. 2024b. “Unlocking the Mysteries of Academic Writing: A Corpus-based Analysis of Lexical Bundles in L2 English for Engineering Students.” Sage Open 14 (4). https://doi.org/10.1177/21582440241299997.Search in Google Scholar

Published Online: 2025-11-20

Published in Print: 2025-11-25

This work is licensed under the Creative Commons Attribution 4.0 International License.

Articles in the same Issue

https://doi.org/10.1515/csh-2024-0032

Creative Commons

BY 4.0