Skip to main content
Presented to you through Paradigm Publishing Services

John Benjamins Publishing Company

Chapter
Licensed
Unlicensed Requires Authentication

Collocations and statistical analysis of n-grams

Multiword expressions in newspaper text
  • and

Abstract

Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til ‘as opposed to’), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notaries publicus) and phrasal anglicisms (practical jokes; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.

Abstract

Multiword expressions (MWEs) are words that co-occur so often that they are perceived as a linguistic unit. Since MWEs pervade natural language, their identification is pertinent for a range of tasks within lexicography, terminology and language technology. We apply various statistical association measures (AMs) to word sequences from the Norwegian Newspaper Corpus (NNC) in order to rank two-and three-word sequences (bigrams and trigrams) in terms of their tendency to co-occur. The results show that some statistical measures favour relatively frequent MWEs (e.g. i motsetning til ‘as opposed to’), whereas other measures favour relatively low-frequent units, which typically comprise loan words (de facto), technical terms (notaries publicus) and phrasal anglicisms (practical jokes; cf. G. Andersen this volume). On this basis we evaluate the relevance of each of these measures for lexicography, terminology and language technology purposes.

Downloaded on 2.5.2026 from https://www.degruyterbrill.com/document/doi/10.1075/scl.49.05lys/html?lang=en
Scroll to top button