Text length and short texts
-
Aatu Liimatta
Abstract
Variation in text length is an unavoidable confounder in quantitative text-analytic corpus-linguistic studies. Texts can be difficult to compare across text lengths, particularly if many of them are short, due to the difficulty of calculating meaningful frequencies for the lexical items and linguistic features of interest. Traditionally, this has been less of an issue, since texts in many of the genres typically studied in linguistics have been relatively long. However, the rise of social media has brought the issue to the forefront. In this chapter, I describe the problem of text length and short texts together with a number of solutions and workarounds to this and related problems.
Abstract
Variation in text length is an unavoidable confounder in quantitative text-analytic corpus-linguistic studies. Texts can be difficult to compare across text lengths, particularly if many of them are short, due to the difficulty of calculating meaningful frequencies for the lexical items and linguistic features of interest. Traditionally, this has been less of an issue, since texts in many of the genres typically studied in linguistics have been relatively long. However, the rise of social media has brought the issue to the forefront. In this chapter, I describe the problem of text length and short texts together with a number of solutions and workarounds to this and related problems.
Chapters in this book
- 日本言語政策学会 / Japan Association for Language Policy. 言語政策 / Language Policy 10. 2014 i
- Table of contents v
- Acknowledgements vii
- From fallacies and pitfalls to solutions and future directions 1
- Engaging with bad (meta)data in historical corpus linguistics 9
- Named entities as potentially problematic items in corpora 35
- Challenges in the compilation, annotation, and analysis of learner corpus data 55
- Early newspapers as data for corpus linguistics (and Digital Humanities) 68
- Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices 89
- Text length and short texts 106
- Corpus genre categories 126
- Modeling fine-grained sociolinguistic variation 142
- Subject index 171
Chapters in this book
- 日本言語政策学会 / Japan Association for Language Policy. 言語政策 / Language Policy 10. 2014 i
- Table of contents v
- Acknowledgements vii
- From fallacies and pitfalls to solutions and future directions 1
- Engaging with bad (meta)data in historical corpus linguistics 9
- Named entities as potentially problematic items in corpora 35
- Challenges in the compilation, annotation, and analysis of learner corpus data 55
- Early newspapers as data for corpus linguistics (and Digital Humanities) 68
- Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices 89
- Text length and short texts 106
- Corpus genre categories 126
- Modeling fine-grained sociolinguistic variation 142
- Subject index 171