Agile corpus creation
-
Holger Voormann
Abstract
In the past decades language corpora have become indispensable tools for linguistic research and the development of linguistic theory. However, it is not yet widely acknowledged that the quality of corpus-based research and theories depends crucially on the quality of the corpora, not only in terms of their content and size but especially as far as the accuracy and richness of the annotations are concerned. Neither has much systematic thought gone into the effectiveness of the traditional corpus creation process regarding this problem. This paper proposes a novel approach to corpus creation – agile corpus creation – that addresses the problem of simultaneously maximizing corpus size as well as the quality and quantity of manual and automatic annotations while minimizing the time and cost involved in corpus creation. The central aspects of agile corpus creation lie in the reorganization of the traditional linear and separate phases of corpus design, data collection, data annotation and corpus analysis and in the recognition of potential sources of errors during corpus creation.
© 2008 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin
Articles in the same Issue
- Semantic preference and semantic prosody re-examined
- Cross-linguistic comparisons of the market metaphors
- Testing search engine frequencies: Patterns of inconsistency
- Discourse and metaphor: A corpus-driven inquiry
- Agile corpus creation
- On the computation of collostruction strength: Testing measures of association as expressions of lexical bias
- NTU corpus of Formosan languages: A state-of-the-art report
- Contents Volume 4 (2008)
Articles in the same Issue
- Semantic preference and semantic prosody re-examined
- Cross-linguistic comparisons of the market metaphors
- Testing search engine frequencies: Patterns of inconsistency
- Discourse and metaphor: A corpus-driven inquiry
- Agile corpus creation
- On the computation of collostruction strength: Testing measures of association as expressions of lexical bias
- NTU corpus of Formosan languages: A state-of-the-art report
- Contents Volume 4 (2008)