Engaging with bad (meta)data in historical corpus linguistics
-
Turo Vartiainen
Abstract
In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.
Abstract
In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.
Chapters in this book
- 日本言語政策学会 / Japan Association for Language Policy. 言語政策 / Language Policy 10. 2014 i
- Table of contents v
- Acknowledgements vii
- From fallacies and pitfalls to solutions and future directions 1
- Engaging with bad (meta)data in historical corpus linguistics 9
- Named entities as potentially problematic items in corpora 35
- Challenges in the compilation, annotation, and analysis of learner corpus data 55
- Early newspapers as data for corpus linguistics (and Digital Humanities) 68
- Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices 89
- Text length and short texts 106
- Corpus genre categories 126
- Modeling fine-grained sociolinguistic variation 142
- Subject index 171
Chapters in this book
- 日本言語政策学会 / Japan Association for Language Policy. 言語政策 / Language Policy 10. 2014 i
- Table of contents v
- Acknowledgements vii
- From fallacies and pitfalls to solutions and future directions 1
- Engaging with bad (meta)data in historical corpus linguistics 9
- Named entities as potentially problematic items in corpora 35
- Challenges in the compilation, annotation, and analysis of learner corpus data 55
- Early newspapers as data for corpus linguistics (and Digital Humanities) 68
- Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices 89
- Text length and short texts 106
- Corpus genre categories 126
- Modeling fine-grained sociolinguistic variation 142
- Subject index 171