Di Cristofaro, M: Corpus Approaches to Language in Social Media

Xiaoshu Yuan

doi:10.1515/csh-2025-0002

Artikel Open Access

Di Cristofaro, M: Corpus Approaches to Language in Social Media

Xiaoshu Yuan

Veröffentlicht/Copyright: 10. Februar 2025

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen

Aus der Zeitschrift Corpus-based Studies across Humanities Band 3 Heft 1

Reviewed Publication:

Di Cristofaro, M. 2023. Corpus Approaches to Language in Social Media. NewYork: Routledge, Taylor & Francis, 399pp. ISBN: 978-1032125725. $54.99 (paperback).

In the era of globalization and rapid technological development, social media has emerged as a powerful platform that facilitates global communication and interaction (Jackson 2023), profoundly influencing language evolution and introducing novel linguistic trends and forms of expression (Ekayati et al. 2024). Digital technologies have transformed corpus linguistics methodologies, offering expansive datasets and advanced tools for analyzing social media language. In this context, corpus linguistics has become an indispensable methodology, enabling researchers to examine large-scale digital data and contributing significantly to the understanding of language use on social media platforms.

Corpus approaches to language in social media presents an in-depth exploration of the intersections between corpus linguistics, computer science, and digital humanities, advocating interdisciplinarity as the key principle of contemporary social science research. A key element of this work is Matteo Di Cristofaro’s study of advanced methods for handling digital content from social media, underling their potential to redefine corpus linguistics and advance data-driven analysis in linguistics and social sciences. Researchers in corpus linguistics, linguists, and social scientists interested in data-driven analysis of social media will find the book essential reading.

The volume comprises seven chapters. The opening chapter clearly states that it aims to expand corpus approaches by incorporating digital technicalities essential for understanding the digital environments and techniques used to produce, collect, and process digital textual data (p. 4). The author discusses how digital humanities address language challenges and how corpus linguistics analyzes cognition and social interactions through digital textual data. In this chapter, the first section offers a definition of a minimal set of terms and the key elements of social media. The second section introduces the fields of digital humanities and corpus linguistics, concluding with an overview of the book’s structure and aims. This structure deftly guides readers toward a more nuanced understanding of the subsequent content and enhances their engagement with the material that follows.

In Chapter 2, the author examines the use of social media as digital research data, focusing on the intersection of digital environments with cognitive and social dynamics. The chapter also emphasizes the impact of digital technologies on language usage within social media platforms. Additionally, it addresses legal and ethical issues, including open-source principles, copyright, and the challenges tied to data usage in research. It then draws attention to how the paradigm shift digital technologies have brought to corpus linguistics, stressing how a corpus traditionally includes metadata, textual markup, and annotation to give researchers a better understanding and interpretative details for querying and filtering the data. (p. 44).

Chapter 3 offers a detailed guide for applying a corpus-based approach to data analysis, establishing a foundational understanding of corpus linguistics by exploring key tools and concepts. It commences with evaluating frequently used corpus tools, followed by a precise definition of core concepts in corpus linguistics such as ‘type’, ‘token’, and ‘lemma’. The author then probes standard functions of these tools, including frequency lists, dispersion, concordances, KWIC (key word in context), collocations, keywords, and stoplists – functions that are indispensable in contemporary linguistic research, enabling researchers to extract, interpret, and manipulate large datasets effectively using a corpus-based approach. This chapter concludes by introducing novel approaches from computer science and other disciplines that expand the statistical toolkit for social scientists, such as sentiment analysis and topic modeling, stressing the significance of interdisciplinary methods in corpus linguistics.

Chapters 4 and 5, which are the most technically loaded, provide a detailed explanation of building corpora, including corpus design, data collection, and data processing, after reiterating the essential notions and techniques involved in data analysis. Chapter 4 starts with introducing somewhat technical notions such as command-line interface and programming languages in which Python is mostly used for data processing. However, it is emphasized that this volume is not intended to alienate readers unfamiliar with Python. Instead, it focuses on the rationale behind the code, allowing social scientists to understand the relationship between the code and the analytical considerations of language, regardless of their coding expertise (p. 102). The chapter also outlines commonly used data formats (e.g., CSV, XML, HTML, and JSON). Furthermore, Di Cristofaro elaborates on the critical processes of preserving, processing, and formatting data to ensure that research is thoroughly documented. Essentially, Chapter 4 delves into the concept of engaging with data, which may leave readers questioning the origins of the data.

Chapter 5 subsequently addresses this inquiry, devoted to data collection on social media. The author presents a small set of software tools and techniques useful for data collection, including general-purpose scrapers and platform-specific scrapers. When dealing with data from web sources, Di Cristofaro distinguishes two main operations: crawling and scraping, i.e., browsing the web to look for relevant data and copy-pasting those chunks of data to be saved on one’s hard drive. (p.144) Several data collection tools are then presented, including Lancasbox, which has recently been implemented with crawling and scraping abilities, Archivebox, Trafilatura, and BeautifulSoup. The next section introduces four data collection tools, each developed for scraping data from one (or more) specific social media platforms (e.g., Twitter, Instagram, Facebook, and YouTube) along with detailed technical explanations. Following the introduction of data collection, the author continues to explain how to clean, format, and process data as required by the research purpose and the corpus tool employed, which somewhat overlaps with Chapter 4.

Chapter 6 mainly focuses on three case studies, each providing details as to how the practices and methods outlined in previous chapters have been applied in real text analysis scenarios (p. 313). Each case study is divided into five sections: background, context, corpus design, data processing, and corpus analysis. The first case “Analysing crypto-drug market fora” tackled how trust is discursively built among members of online anonymous communities created for drug trading. The second case “Analysing the language of far-right groups on Twitter and Facebook” investigated how far-right groups construe their group identity and ideology online. The third case “The Communicative Modus Operandi of Online Child Sexual Groomers” contributed to advancing computer technologies designed to safeguard children from online predators by detecting the distinctive linguistic and discursive patterns typically used by groomers in online communication (e.g., Lorenzo-Dus, Kinzel, and Di Cristofaro 2020). This chapter outlines the research process and links the methods to concepts from earlier chapters while offering alternative approaches using the tools and techniques discussed, thereby serving as a powerful illustration of the significance of the aforementioned broad view of corpus approaches.

Following the thorough exploration of case studies, the final chapter provides a sententious conclusion and underscores four key points. First, understanding the data allows one to correctly use it and find new ways to engage with it. An exchange among disciplines may provide benefits that a single discipline could not provide (p. 373). Second, digital literacy is a prerequisite since the exponential growth of digital technology made corpus linguists aware that their existing skills were too specific and that additional basic knowledge was required (p. 374). Third, efforts should be made to interdisciplinary cooperation between corpus linguistics and its sibling disciplines (e.g., computer science) to address future challenges in the field and to fully harness the power of digital data in social science research. Lastly, corpus approaches must experience paradigm shifts in response to the ever-changing digital landscape.

Overall, this book is a valuable contribution to the existing literature with a focus on corpus-based exploration of social media. The volume excels in its detailed investigation of practical applications, from data collection to the processing and organization of digital content. It examines the intricacies of metadata, text markup, and annotation, offering a clear roadmap to conduct corpus analyses. Moreover, the book covers ethical considerations such as copyright and privacy issues, ensuring that readers are equipped to navigate the ethical challenges of working with social media data. The most standout contribution of this book is the case studies included, illustrating how corpus approaches can be applied to tackle real-world questions through the lens of social media, enabling novice researchers to carry out practical research based on a “DIY” social media corpus.

However, some minor limitations can also be seen. First, there is some overlap between Chapters 4 and 5 regarding data processing as mentioned above. Second, this volume solely focuses on Western social media and scrapers, while those in other parts of the world are neglected, e.g., Douyin, Xiaohongshu, and Weibo. Third, with the advancement of digital technologies, space should also be given to the potential applications of advanced machine learning or AI-driven methods in data analysis, which could furnish insightful details for handling large datasets more efficiently. Nonetheless, this book, which is written in a lucid style, is accessible to a broad spectrum of readership, including lay readers, scholars, and post-graduate students interested in the areas of social media, corpus linguistics, computer science, and humanities.

Corresponding author: Xiaoshu Yuan, Institute of Language Sciences, Shanghai International Studies University, Shanghai, China, E-mail: xiaoshuyuan@shisu.edu.cn

References

Ekayati, R., B. Sibarani, S. A. Ginting, R. Husein, and T. S. Amin. 2024. “Digital Dialects: The Impact of Social Media on Language Evolution and Emerging Forms of Communication.” International Journal of Evaluation and Research in Education 3 (2): 605–9.Suche in Google Scholar

Jackson, O. 2023. “The Influence of Social Media on Language Change and Development.” Frontiers of Language and Communication Studies 5 (1): 37.Suche in Google Scholar

Lorenzo-Dus, N., A. Kinzel, and M. Di Cristofaro. 2020. “The Communicative Modus Operandi of Online Child Sexual Groomers: Recurring Patterns in their Language Use.” Journal of Pragmatics 155: 15–27.Suche in Google Scholar

Published Online: 2025-02-10

This work is licensed under the Creative Commons Attribution 4.0 International License.

Artikel in diesem Heft

https://doi.org/10.1515/csh-2025-0002

Creative Commons

BY 4.0