Home Linguistics & Semiotics The grammatical annotation of speech corpora
Chapter
Licensed
Unlicensed Requires Authentication

The grammatical annotation of speech corpora

Techniques and perspectives
  • Eckhard Bick
View more publications by John Benjamins Publishing Company
Spoken Corpora and Linguistic Studies
This chapter is in the book Spoken Corpora and Linguistic Studies

Abstract

This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.

Abstract

This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.

Downloaded on 12.2.2026 from https://www.degruyterbrill.com/document/doi/10.1075/scl.61.04bic/html
Scroll to top button