Computational extraction of formulaic sequences from corpora
-
Alexander Wahl
Abstract
We describe a new algorithm for the extraction of formulaic language from corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), it iteratively combines adjacent bigrams into progressively longer sequences based on lexical association strengths. We then provide empirical evidence for this approach via two case studies. First, we compare the performance of MERGE to that of another algorithm by examining the outputs of the approaches compared with manually annotated formulaic sequences from the spoken component of the British National Corpus. Second, we employ two child language corpora to examine whether MERGE can predict the formulas that the children learn based on caregiver input. Ultimately, we show that MERGE indeed performs well, offering a powerful approach for the extraction of formulas.
Abstract
We describe a new algorithm for the extraction of formulaic language from corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), it iteratively combines adjacent bigrams into progressively longer sequences based on lexical association strengths. We then provide empirical evidence for this approach via two case studies. First, we compare the performance of MERGE to that of another algorithm by examining the outputs of the approaches compared with manually annotated formulaic sequences from the spoken component of the British National Corpus. Second, we employ two child language corpora to examine whether MERGE can predict the formulas that the children learn based on caregiver input. Ultimately, we show that MERGE indeed performs well, offering a powerful approach for the extraction of formulas.
Chapters in this book
- Prelim pages i
- Table of contents v
- Foreword vii
- Introduction 1
- Monocollocable words 9
- Translation asymmetries of multiword expressions in machine translation 23
- German constructional phrasemes and their Russian counterparts 43
- Computational phraseology and translation studies 65
- Computational extraction of formulaic sequences from corpora 83
- Computational phraseology discovery in corpora with the mwetoolkit 111
- Multiword expressions in comparable corpora 135
- Collecting collocations from general and specialised corpora 151
- What matters more: The size of the corpora or their quality? 177
- Statistical significance for measures of collocation strength 189
- Verbal collocations and pronominalisation 207
- Empirical variability of Italian multiword expressions as a useful feature for their categorisation 225
- Too big to fail but big enough to pay for their mistakes 247
- Multi-word patterns and networks 273
- How context determines meaning 297
- Detecting semantic difference 311
- Index 325
Chapters in this book
- Prelim pages i
- Table of contents v
- Foreword vii
- Introduction 1
- Monocollocable words 9
- Translation asymmetries of multiword expressions in machine translation 23
- German constructional phrasemes and their Russian counterparts 43
- Computational phraseology and translation studies 65
- Computational extraction of formulaic sequences from corpora 83
- Computational phraseology discovery in corpora with the mwetoolkit 111
- Multiword expressions in comparable corpora 135
- Collecting collocations from general and specialised corpora 151
- What matters more: The size of the corpora or their quality? 177
- Statistical significance for measures of collocation strength 189
- Verbal collocations and pronominalisation 207
- Empirical variability of Italian multiword expressions as a useful feature for their categorisation 225
- Too big to fail but big enough to pay for their mistakes 247
- Multi-word patterns and networks 273
- How context determines meaning 297
- Detecting semantic difference 311
- Index 325