Home German Linguistics Chancen und Grenzen von automatischer Annotation
Article
Licensed
Unlicensed Requires Authentication

Chancen und Grenzen von automatischer Annotation

  • Heike Zinsmeister EMAIL logo
Published/Copyright: March 30, 2015

Abstract

Linguistic annotation helps corpus users to retrieve relevant examples in an efficient way. It also supports the identification of latent patterns in the data by encoding generalizations such as parts of speech. Since manual annotation is very time-consuming, many projects use off-the-shelf tools for annotating or at least preprocessing their data. This article discusses pros and cons of such automatic annotation using part-of-speech tagging as an example case. It argues that errors made by annotation tools are systematic in nature and hence predictable to a certain extent. In addition, the article addresses the issue of descriptive adequacy of tagsets. In particular, it discusses how well the Stuttgart-Tübingen Tagset (STTS) describes German parts of speech. Finally, the article briefly addresses normalization, an additional preprocessing step that is sometimes required before automatic annotation tools can be applied.

Literatur

Artstein, Ron & Massimo Poesio (2008): Inter-coder agreement for computational linguistics. In: Computational Linguistics 34, 555–596.10.1162/coli.07-034-R2Search in Google Scholar

Bartz, Thomas, Michael Beißwenger & Angelika Storrer (2013): Optimierung des Stuttgart-Tübingen-Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. In: JLCL 28(1), 155–198.10.21248/jlcl.28.2013.172Search in Google Scholar

Brants, Thorsten (2000): TnT: A statistical part-of-speech tagger. In: Proceedings of the sixth Conference on Applied Natural Language Processing, 224–231.10.3115/974147.974178Search in Google Scholar

Brants, Sabine, Stefanie Dipper, Peter Eisenberg, Sylvia Hansen, Ekkehard König, Wolfgang Lezius, Christian Rohrer, Georg Smith & Hans Uszkoreit (2004): TIGER: Linguistic Interpretation of a German Corpus. In: Journal of Language and Computation 2, 597–620.10.1007/s11168-004-7431-3Search in Google Scholar

Bollmann, Marcel (2013): Automatic Normalization for Linguistic Annotation of Historical Language Data. In: Bochumer Linguistische Arbeitsberichte 13.Search in Google Scholar

Dipper, Stefanie, Karin Donhauser, Thomas Klein, Sonja Linde, Stefan Müller & Klaus-Peter Wegera (2013): HiTS: ein Tagset für historische Sprachstufen des Deutschen. In: JLCL 28(1), 85–137.10.21248/jlcl.28.2013.170Search in Google Scholar

EAGLES (1996): Recommendations for the morphosyntactic annotation of corpora. EAGLES document EAGTCWG-MAC/R. Technischer Bericht. Url: http://www.ilc.cnr.it/EAGLES96/annotate/annotate.html.Search in Google Scholar

Giesbrecht, Eugenie & Stefan Evert (2009): Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus. In: Proceedings of the 5th Web as Corpus Workshop (WAC5), 27–35.Search in Google Scholar

Jurish, Bryan (2010): More than words: Using token context to improve canonicalization of historical German. In: JLCL 25(1), 23–39.10.21248/jlcl.25.2010.127Search in Google Scholar

Kübler, Sandra & Wolfgang Maier (2013): Über den Einfluss von Part-of-Speech-Tags auf Parsing-Ergebnisse. In: JLCL 28(1), 17–44.10.21248/jlcl.28.2013.167Search in Google Scholar

Lemnitzer, Lothar & Heike Zinsmeister (2010): Korpuslinguistik. Eine Einführung. 2. Auflage, Tübingen, Narr.Search in Google Scholar

Lezius, Wolfgang (2002): Ein Werkzeug zur Suche auf syntaktisch annotierten Textkorpora. Doktorarbeit, Universität Stuttgart.Search in Google Scholar

Loftsson, Hrafn (2006): Tagging Icelandic text: an experiment with integrations and combinations of taggers. In: Language Resources and Evaluation 40(2), 175–181.Search in Google Scholar

Lüdeling, Anke, Seanna Doolittle, Hagen Hirschmann, Karin Schmidt & Maik Walter (2008): Das Lernerkorpus Falko. In: Deutsch als Fremdsprache 2(2008), 67–73.Search in Google Scholar

Petrov, Slav, Dipanjan Das & Ryan T. McDonald (2012): A universal part-of-speech tagset. In: The Eighth International Conference on Language Resources and Evaluation (LREC-2012), 2089–2096.Search in Google Scholar

Piotrowski, Michael (2012): Natural Language Processing for Historical Texts. San Rafael, CA: Morgan & Claypool.10.2200/S00436ED1V01Y201207HLT017Search in Google Scholar

Rehbein, Ines (2014): POS error detection in automatically annotated corpora. In: Proceedings of LAW VIII – The 8th Linguistic Annotation Workshop, 20–28.10.3115/v1/W14-4903Search in Google Scholar

Rehbein, Ines & Hagen Hirschmann. 2014. Towards a syntactically motivated analysis of modifiers in German. In: Proceedings of Conference on Natural Language Processing (KONVENS).Search in Google Scholar

Rehbein, Ines & Sören Schalowski. 2013. STTS goes Kiez – Experiments on annotating and tagging urban youth language. In: JLCL 28(1), 199–227.10.21248/jlcl.28.2013.173Search in Google Scholar

Reznicek, Marc, Anke Lüdeling & Hagen Hirschmann (2013): Competing Target Hypotheses in the Falko Corpus: A Flexible Multi-Layer Corpus Architecture. In: Ana Díaz-Negrillo, Nicolas Ballier & Paul Thompson (Hgg.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam, John Benjamins, 101–123.10.1075/scl.59.07rezSearch in Google Scholar

Rezincek, Marc & Heike Zinsmeister (2013): STTS-Konfusionsklassen beim Tagging von Fremdsprachlernertexten. In: JLCL 28(1), 63–83.10.21248/jlcl.28.2013.169Search in Google Scholar

Pustejovsky, James & Amber Stubbs. 2012. Natural language annotation for machine learning. Peking [u. a.], O’Reilly,Search in Google Scholar

Schiller, Anne, Simone Teufel, Christine Stöckert & Christine Thielen (1999): Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Technischer Bericht, Universitäten Stuttgart & Tübingen.Search in Google Scholar

Schmid, Helmut (1994): Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, 44–49.Search in Google Scholar

Schmid, Helmut (1995): Improvements in part-of-speech tagging with an application to German. In: Proceedings of the ACL SIGDAT-Workshop, 47–50.Search in Google Scholar

Schmid, Helmut & Florian Laws (2008): Estimation of Conditional Probabilities with Decision Trees and an Application to Fine-Grained POS Tagging. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 777–784.10.3115/1599081.1599179Search in Google Scholar

Telljohann, Heike, Erhard Hinrichs, Sandra Kübler, Heike Zinsmeister & Kathrin Beck (2012): Stylebook for the Tübingen Treebank of Written German (TüBa-D/Z). Technischer Bericht, Universität Tübingen.Search in Google Scholar

Toutanova, Kristina, Dan Klein, Christopher Manning & Yoram Singer (2003): Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings of HLT-NAACL 2003, 252–259.10.3115/1073445.1073478Search in Google Scholar

van Halteren, Hans, Walter Daelemans & Jakub Zavrel (2001): Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems. In: Computational Linguistics 27(2), 199–229.10.1162/089120101750300508Search in Google Scholar

Westpfahl, Swantje & Thomas Schmidt (2013): POS für(s) FOLK – Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch. In: JLCL 28(1), 139–154.10.21248/jlcl.28.2013.171Search in Google Scholar

Zinsmeister, Heike, Ulrich Heid & Kathrin Beck (2014): Adapting a part-of-speech tagset to non-standard text: The case of STTS. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 4097–4104.Search in Google Scholar

Online erschienen: 2015-3-30
Erschienen im Druck: 2015-3-1

© 2015 Walter de Gruyter GmbH & Co. KG, Berlin/Boston

Downloaded on 25.1.2026 from https://www.degruyterbrill.com/document/doi/10.1515/zgl-2015-0004/pdf
Scroll to top button