John Benjamins Publishing Company
Chapter 2. From lexical bundles to surprisal and language models
-
and
Abstract
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Abstract
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgements vii
- Chapter 1. Present applications and future directions in pattern-driven approaches to corpus linguistics 1
-
Part I. Methodological explorations
- Chapter 2. From lexical bundles to surprisal and language models 15
- Chapter 3. Fine-tuning lexical bundles 57
- Chapter 4. Lexical obsolescence and loss in English: 1700–2000 81
-
Part II. Patterns in utilitarian texts
- Chapter 5. Constance and variability 107
- Chapter 6. Between corpus-based and corpus-driven approaches to textual recurrence 131
- Chapter 7. Lexical bundles in Early Modern and Present-day English Acts of Parliament 159
-
Part III. Patterns in online texts
- Chapter 8. Lexical bundles in Wikipedia articles and related texts 189
- Chapter 9. Join us for this 213
- Chapter 10. I don’t want to and don’t get me wrong 251
- Chapter 11. Blogging around the world 277
- Index 311
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgements vii
- Chapter 1. Present applications and future directions in pattern-driven approaches to corpus linguistics 1
-
Part I. Methodological explorations
- Chapter 2. From lexical bundles to surprisal and language models 15
- Chapter 3. Fine-tuning lexical bundles 57
- Chapter 4. Lexical obsolescence and loss in English: 1700–2000 81
-
Part II. Patterns in utilitarian texts
- Chapter 5. Constance and variability 107
- Chapter 6. Between corpus-based and corpus-driven approaches to textual recurrence 131
- Chapter 7. Lexical bundles in Early Modern and Present-day English Acts of Parliament 159
-
Part III. Patterns in online texts
- Chapter 8. Lexical bundles in Wikipedia articles and related texts 189
- Chapter 9. Join us for this 213
- Chapter 10. I don’t want to and don’t get me wrong 251
- Chapter 11. Blogging around the world 277
- Index 311