Spelling normalisation of Late Modern English
-
Gerold Schneider
Abstract
To be able to profit from natural language processing (NLP) tools for analysing historical text, an important step is spelling normalisation. We first compare and second combine two different approaches: on the one hand VARD, a rule-based system which is based on dictionary lookup and rules with non-probabilistic but trainable weights; on the other hand a language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period. We obtain the best system by combining both approaches. Re-training VARD on specific time-periods and domains is beneficial, and both systems benefit from a language sequence model using collocation strength.
Abstract
To be able to profit from natural language processing (NLP) tools for analysing historical text, an important step is spelling normalisation. We first compare and second combine two different approaches: on the one hand VARD, a rule-based system which is based on dictionary lookup and rules with non-probabilistic but trainable weights; on the other hand a language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period. We obtain the best system by combining both approaches. Re-training VARD on specific time-periods and domains is beneficial, and both systems benefit from a language sequence model using collocation strength.
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Introduction 1
-
Part I. Phonology
- “A received pronunciation” 21
- The interplay of internal and external factors in varieties of English 43
-
Part II. Morphosyntax
- The myth of American English gotten as a historical retention 67
- Changes affecting relative clauses in Late Modern English 91
- Diffusion of do 117
- A diachronic constructional analysis of locative alternation in English, with particular attention to load and spray 143
-
Part III. Orthography, vocabulary and semantics
- In search of “the lexicographic stamp” 167
- “Divided by a common language”? 185
- Women writers in the 18th century 203
- Eighteenth-century French cuisine terms and their semantic integration in English 219
- Spelling normalisation of Late Modern English 243
-
Part IV. Pragmatics and discourse
- A far from simple matter revisited 271
- What it means to describe speech 295
- Being Wilde 315
- “I am desired (…) to desire” 333
- Index 357
Chapters in this book
- Prelim pages i
- Table of contents v
- Preface vii
- Introduction 1
-
Part I. Phonology
- “A received pronunciation” 21
- The interplay of internal and external factors in varieties of English 43
-
Part II. Morphosyntax
- The myth of American English gotten as a historical retention 67
- Changes affecting relative clauses in Late Modern English 91
- Diffusion of do 117
- A diachronic constructional analysis of locative alternation in English, with particular attention to load and spray 143
-
Part III. Orthography, vocabulary and semantics
- In search of “the lexicographic stamp” 167
- “Divided by a common language”? 185
- Women writers in the 18th century 203
- Eighteenth-century French cuisine terms and their semantic integration in English 219
- Spelling normalisation of Late Modern English 243
-
Part IV. Pragmatics and discourse
- A far from simple matter revisited 271
- What it means to describe speech 295
- Being Wilde 315
- “I am desired (…) to desire” 333
- Index 357