OBT+stat
-
Janne Bondi Johannessen✝
Abstract
The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.
Abstract
The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.
Chapters in this book
- Prelim pages i
- Table of contents v
- Building a large corpus based on newspapers from the web 1
-
Part I. Exploiting the web as a corpus – Methods and tools
- Corpuscle – a new corpus management platform for annotated corpora 31
- OBT+stat 51
- Exploring corpora through syntactic annotation 67
- Collocations and statistical analysis of n-grams 79
- Automatic topic classification of a large newspaper corpus 111
- A data-driven approach to anglicism identification in Norwegian 131
-
Part II. Corpus-based case studies
- A corpus-based study of the adaptation of English import words in Norwegian 157
- Norm clusters in written Norwegian 193
- Lexical neography in modern Norwegian 221
- Ash compound frenzy 241
- Financial jargon in a general newspaper corpus 257
- Metonymic extension and vagueness 285
- Spatial metaphors in present-day Norwegian newspaper language 307
- Doing historical linguistics using contemporary data 331
- Name index 351
- Subject index 353
Chapters in this book
- Prelim pages i
- Table of contents v
- Building a large corpus based on newspapers from the web 1
-
Part I. Exploiting the web as a corpus – Methods and tools
- Corpuscle – a new corpus management platform for annotated corpora 31
- OBT+stat 51
- Exploring corpora through syntactic annotation 67
- Collocations and statistical analysis of n-grams 79
- Automatic topic classification of a large newspaper corpus 111
- A data-driven approach to anglicism identification in Norwegian 131
-
Part II. Corpus-based case studies
- A corpus-based study of the adaptation of English import words in Norwegian 157
- Norm clusters in written Norwegian 193
- Lexical neography in modern Norwegian 221
- Ash compound frenzy 241
- Financial jargon in a general newspaper corpus 257
- Metonymic extension and vagueness 285
- Spatial metaphors in present-day Norwegian newspaper language 307
- Doing historical linguistics using contemporary data 331
- Name index 351
- Subject index 353