OBT+stat: A combined rule-based and statistical tagger

Janne Bondi Johannessen; Kristin Hagen; André Lynum; Anders Nøklestad

Chapter

OBT+stat

A combined rule-based and statistical tagger

Janne Bondi Johannessen^✝ , Kristin Hagen , André Lynum and Anders Nøklestad

Published by

View more publications by John Benjamins Publishing Company

To Publisher Page

This chapter is in the book Exploring Newspaper Language

Abstract

The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.

You are currently not able to access this content.

Abstract

You are currently not able to access this content.

Chapters in this book

Prelim pages i
Table of contents v
Building a large corpus based on newspapers from the web 1
Part I. Exploiting the web as a corpus – Methods and tools
Corpuscle – a new corpus management platform for annotated corpora 31
OBT+stat 51
Exploring corpora through syntactic annotation 67
Collocations and statistical analysis of n-grams 79
Automatic topic classification of a large newspaper corpus 111
A data-driven approach to anglicism identification in Norwegian 131
Part II. Corpus-based case studies
A corpus-based study of the adaptation of English import words in Norwegian 157
Norm clusters in written Norwegian 193
Lexical neography in modern Norwegian 221
Ash compound frenzy 241
Financial jargon in a general newspaper corpus 257
Metonymic extension and vagueness 285
Spatial metaphors in present-day Norwegian newspaper language 307
Doing historical linguistics using contemporary data 331
Name index 351
Subject index 353

https://doi.org/10.1075/scl.49.03joh

Chapters in this book

Prelim pages i
Table of contents v
Building a large corpus based on newspapers from the web 1
Part I. Exploiting the web as a corpus – Methods and tools
Corpuscle – a new corpus management platform for annotated corpora 31
OBT+stat 51
Exploring corpora through syntactic annotation 67
Collocations and statistical analysis of n-grams 79
Automatic topic classification of a large newspaper corpus 111
A data-driven approach to anglicism identification in Norwegian 131
Part II. Corpus-based case studies
A corpus-based study of the adaptation of English import words in Norwegian 157
Norm clusters in written Norwegian 193
Lexical neography in modern Norwegian 221
Ash compound frenzy 241
Financial jargon in a general newspaper corpus 257
Metonymic extension and vagueness 285
Spatial metaphors in present-day Norwegian newspaper language 307
Doing historical linguistics using contemporary data 331
Name index 351
Subject index 353

OBT+stat

Abstract

Chapter PDF View

Abstract

Chapters in this book

Chapters in this book

Chapters in this book