The grammatical annotation of speech corpora
-
Eckhard Bick
Abstract
This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.
Abstract
This chapter discusses the grammatical annotation of speech corpora on the one hand (C-ORAL-Brasil, NURC) and speech-like text on the other (e-mail, chat, tv-news, parliamentary discussions), drawing on Portuguese data for the former and English data for the latter. We try to identify and compare linguistic orality markers (“speechlikeness”) in different genres, and argue that broad-coverage Constraint Grammar parsers such as PALAVRAS and EngGram can be adapted to these features, and used across the text-speech divide. Special topics include emoticons, phonetic variation and syntactic features. For ordinary speech corpora we propose a system of two-level annotation, where overlaps, retractions and phonetic variation are maintained as meta-tagging, while allowing conventional annotation of an orthographically normalized textual layer. In the absence of punctuation, syntactic segmentation can be achieved by exploiting prosodic breaks as delimiters in parsing rules. With the exception of chat data, our modified “oral” CG parsers perform reasonably close to their written language counterparts, even for true transcribed speech, achieving accuracy rates (F-scores) above 98% for PoS tags and 93–95% for syntactic function.
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgements vii
- Introduction: Spoken corpora and linguistic studies 1
-
Section I: Experiences and requirements of spoken corpora compilation
- Methodological issues for spontaneous speech corpora compilation 27
- A multilingual speech corpus of North-Germanic languages 69
- Methodological considerations for the development and use of sign language acquisition corpora 84
-
Section II: Multilevel corpus annotation
- The grammatical annotation of speech corpora 105
- The IPIC resource and a cross-linguistic analysis of information structure in Italian and Brazilian Portuguese 129
- The variation of action verbs in multilingual spontaneous speech corpora 152
-
Section III: Prosody and its functional levels
- Speech and corpora 191
- Corpus design for studying the expression of emotion in speech 210
- Illocution, attitudes and prosody 233
- Exploring the prosody of stance 271
-
Section IV: Syntax and Information Structure
- Prosody and information structure 297
- The notion of sentence and other discourse units in corpus annotation 331
- Syntactic properties of spontaneous speech in the Language into Act Theory 365
- Prosodic constraints for discourse markers 411
- Appendix 468
- Index 496
Chapters in this book
- Prelim pages i
- Table of contents v
- Acknowledgements vii
- Introduction: Spoken corpora and linguistic studies 1
-
Section I: Experiences and requirements of spoken corpora compilation
- Methodological issues for spontaneous speech corpora compilation 27
- A multilingual speech corpus of North-Germanic languages 69
- Methodological considerations for the development and use of sign language acquisition corpora 84
-
Section II: Multilevel corpus annotation
- The grammatical annotation of speech corpora 105
- The IPIC resource and a cross-linguistic analysis of information structure in Italian and Brazilian Portuguese 129
- The variation of action verbs in multilingual spontaneous speech corpora 152
-
Section III: Prosody and its functional levels
- Speech and corpora 191
- Corpus design for studying the expression of emotion in speech 210
- Illocution, attitudes and prosody 233
- Exploring the prosody of stance 271
-
Section IV: Syntax and Information Structure
- Prosody and information structure 297
- The notion of sentence and other discourse units in corpus annotation 331
- Syntactic properties of spontaneous speech in the Language into Act Theory 365
- Prosodic constraints for discourse markers 411
- Appendix 468
- Index 496