Collocations and statistical analysis of n-grams
-
Gunn Inger Lyse
and Gisle Andersen
Abstract
Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til ‘as opposed to’), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notaries publicus) and phrasal anglicisms (practical jokes; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.
Abstract
Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til ‘as opposed to’), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notaries publicus) and phrasal anglicisms (practical jokes; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.
Chapters in this book
- Prelim pages i
- Table of contents v
- Building a large corpus based on newspapers from the web 1
-
Part I. Exploiting the web as a corpus – Methods and tools
- Corpuscle – a new corpus management platform for annotated corpora 31
- OBT+stat 51
- Exploring corpora through syntactic annotation 67
- Collocations and statistical analysis of n-grams 79
- Automatic topic classification of a large newspaper corpus 111
- A data-driven approach to anglicism identification in Norwegian 131
-
Part II. Corpus-based case studies
- A corpus-based study of the adaptation of English import words in Norwegian 157
- Norm clusters in written Norwegian 193
- Lexical neography in modern Norwegian 221
- Ash compound frenzy 241
- Financial jargon in a general newspaper corpus 257
- Metonymic extension and vagueness 285
- Spatial metaphors in present-day Norwegian newspaper language 307
- Doing historical linguistics using contemporary data 331
- Name index 351
- Subject index 353
Chapters in this book
- Prelim pages i
- Table of contents v
- Building a large corpus based on newspapers from the web 1
-
Part I. Exploiting the web as a corpus – Methods and tools
- Corpuscle – a new corpus management platform for annotated corpora 31
- OBT+stat 51
- Exploring corpora through syntactic annotation 67
- Collocations and statistical analysis of n-grams 79
- Automatic topic classification of a large newspaper corpus 111
- A data-driven approach to anglicism identification in Norwegian 131
-
Part II. Corpus-based case studies
- A corpus-based study of the adaptation of English import words in Norwegian 157
- Norm clusters in written Norwegian 193
- Lexical neography in modern Norwegian 221
- Ash compound frenzy 241
- Financial jargon in a general newspaper corpus 257
- Metonymic extension and vagueness 285
- Spatial metaphors in present-day Norwegian newspaper language 307
- Doing historical linguistics using contemporary data 331
- Name index 351
- Subject index 353