Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung

Hagen Hirschmann; Thomas Schmidt

doi:10.1515/zgl-2022-2048

Artikel

Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung

Hagen Hirschmann und Thomas Schmidt

Veröffentlicht/Copyright: 20. April 2022

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Zeitschrift für germanistische Linguistik Band 50 Heft 1

Abstract

This article provides an overview of methodological and technical issues that arise in the collection, indexing and use of spoken learner corpora, i. e. corpora containing spoken utterances of learners of a target language. After an introductory discussion of the most important special features of this type of corpus that distinguish it from written language learner corpora and spoken corpora with L1 speakers, we will go into more detail on questions of corpus design. The main part of the paper is then an overview of the methodological and technical procedures of the individual steps of collecting, indexing, providing and using spoken learner corpora. The main aim of this overview is to highlight practices that can be considered best practices according to the current state of research. Finally, we outline the challenges that still exist for this type of corpus.

LiteraturLiteratur

Baur, Benedikt/Gräfe, Karen/Schmidt, Julia (2014): Dokumentation zur Annotation der Diskurskommentierungen. (https://gewiss.uni-leipzig.de/fileadmin/documents/Annotationsdokumentation_GeWiss.pdf)Suche in Google Scholar

Belz, Malte (2014): Richtlinien zur Annotation von Reparaturen in BeMaTaC. Technischer Bericht. Humboldt-Universität zu Berlin. (https://hu.berlin/bematac-guidelines)Suche in Google Scholar

Boersma, Paul (2001): Praat, a system for doing phonetics by computer. Glot International 5:9/10, 341–345.Suche in Google Scholar

Deppermann, Arnulf (2008): Gespräche analysieren. Eine Einführung. 4. Aufl. Wiesbaden: Verlag für Sozialwissenschaften.10.1007/978-3-531-91973-7Suche in Google Scholar

Clahsen, Harald/Meisel, Jürgen M. & Pienemann, Manfred (1983): Deutsch als Zweitsprache: Der Spracherwerb ausländischer Arbeiter. Tübingen: NarrSuche in Google Scholar

Fandrych, Christian/Meißner, Cordula/Wallner, Franziska (Hg.; 2017): Gesprochene Wissenschaftssprache – digital: Verfahren zur Annotation und Analyse mündlicher Korpora. Tübingen: Stauffenburg.Suche in Google Scholar

Goschler, Juliana & Stefanowitsch, Anatol (2014): Korpora in der Zweitspracherwerbsforschung: Sieben Probleme aus korpuslinguistischer Sicht 10.1515/9783110318593.341Suche in Google Scholar

Gräfe, Karen/Lange, Daisy/Sieradz, Magda/Meißner, Cordula/Slavcheva, Adriana (2015): Gewiss. Handbuch zum Korpus (https://gewiss.uni-leipzig.de/fileadmin/documents/Handbuch.pdf)Suche in Google Scholar

Granger, Sylvaine (2008): Learner corpora. In: Lüdeling, A. & Kytö, M. (Hg.): Corpus Linguistics. An International 5 Handbook. Volume 1. Berlin & New York: De Gruyter, 259–275Suche in Google Scholar

Gut, Ulrike (2014): The Leap Corpus. In: Durand, Jacques; Gut, Ulrike; Kristoffersen, Gjert (Hg.) The Oxford Handbook of Corpus Phonology. Oxford; Oxford University Press.10.1093/oxfordhb/9780199571932.013.026Suche in Google Scholar

Gut, Ulrike & Bayerl, Petra S. (2004): Measuring the Reliability of Manual Annotations of Speech Corpora. In: Proceedings of Speech Prosody 2004, Nara, Japan.10.21437/SpeechProsody.2004-131Suche in Google Scholar

Hedeland, Hanna & Schmidt, Thomas (2012): Technological and methodological challenges in creating, annotating and sharing a learner corpus of spoken German. In: Schmidt, Thomas & Wörner, Kai (eds.): Multilingual Corpora and Multilingual Corpus Analysis. Hamburg Studies on Multilingualism (14). Amsterdam: Benjamins, 25–46. https://doi.org/10.1075/hsm.14.04hed 10.1075/hsm.14.04hedSuche in Google Scholar

Hirschmann, Hagen (2019): Korpuslinguistik. Eine Einführung. Stuttgart; Metzler.10.1007/978-3-476-05493-7Suche in Google Scholar

Imo, Wolfgang & Weidner, Beate (2018): Mündliche Korpora im DaF- und DaZ-Unterricht. In: Kupietz, M. & Schmidt, T. (Hg.): Korpuslinguistik. Band 5 der Reihe Germanistische Sprachwissenschaft um 2020. Berlin & Boston: De Gruyter, 231–253.10.1515/9783110538649-011Suche in Google Scholar

Kisler, Thomas; Reichel, Uwe D.; Schiel, Florian (2017): Multilingual processing of speech via web services. In: Computer Speech & Language (45), 326–347.10.1016/j.csl.2017.01.005Suche in Google Scholar

Kleiner, Stefan; Berend, Nina; Brinckmann, Caren; Knöbl, Ralf (2011): „Deutsch Heute“. Ein sprachgebietsweites Forschungsprojekt zur regionalen Variation in der gesprochenen deutschen Standardsprache. In: Klagenfurter Beiträge zur Sprachwissenschaft 34–36, S. 179-193. https://ids-pub.bsz-bw.de/files/2874/Kleiner_Berend_Brinckmann_Kn%C3%B6bl-Deutsch_heute_2011.pdf Suche in Google Scholar

Kupietz, Marc; Lüngen, Harald; Kamocki, Paweł; Witt, Andreas (2018): The German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (eds.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA), 2018. S. 4353–4360. www.lrec-conf.org/proceedings/lrec2018/pdf/737.pdf Suche in Google Scholar

Lüdeling, Anke & Hirschmann, Hagen (2015): Error annotation systems. In: Granger, Sylviane; Gilquin, Gaëtanelle; Meunier, Fanny (Hg.): The Cambridge Handbook of Learner Corpus Research. Cambridge; Cambridge University Press, 135–158.10.1017/CBO9781139649414.007Suche in Google Scholar

Lüdeling, Anke; Hirschmann, Hagen; Shadrova, Anna & Wan, Shujun (2021): Tiefe Analyse von Lernerkorpora. In: IDS Jahrbuch 2020.10.1515/9783110731514-013Suche in Google Scholar

MacWhinney, Brian (2000): The CHILDES project: Tools for analyzing talk. 3rd edition. Mahwah, NJ: Lawrence Erlbaum Associates. Suche in Google Scholar

Meißner, Cordula & Adriana Slavcheva (2014): Ein Vergleichskorpus der gesprochenen Wissenschaftssprache des Deutschen, Englischen und Polnischen. Zum Design und Aufbau des GeWiss-Korpus. In: Fandrych, Christian; Meißner, Cordula; Slavcheva, Adriana (Hg.): Gesprochene Wissenschaftssprache: Korpusmethodische Fragen und empirische Analysen. Heidelberg: Synchron, 15–38.Suche in Google Scholar

Mukherjee, Joybrato (2009): Anglistische Korpuslinguistik. Eine Einführung. Berlin; Erich Schmidt Verlag.Suche in Google Scholar

Nolda, Andreas (2019): Annotation von Lernerdaten mit EXMARaLDA (Dulko). Technischer Bericht. (https://andreas.nolda.org/publications/nolda_2019_annotation_lernerdaten.pdf)Suche in Google Scholar

Ochs, Elinor (1979): Transcription as theory. In: Ochs, E. & Schieffelin, B. (Hg.) Developmental Pragmatics, New York: Academic Press, 43–72.Suche in Google Scholar

RatSWD [Rat für Sozial- und Wirtschaftsdaten] (2020): Handreichung Datenschutz. 2. vollständig überarbeitete Auflage. RatSWD Output 8 (6). Berlin, Rat für Sozial- und Wirtschaftsdaten (RatSWD). https://doi.org/10.17620/02671.50Suche in Google Scholar

Rehbein, Jochen/Schmidt, Thomas/Meyer, Bernd/Watzke, Franziska/Herkenrath, Annette (2004): Handbuch für das computergestützte Transkribieren nach HIAT. Arbeiten zur Mehrsprachigkeit: Folge B, Sonderforschungsbereich 538 (56). https://nbn-resolving.org/urn:nbn:de:bsz:mh39-23681 Suche in Google Scholar

Reineke, Silke/Schmidt, Thomas/Schedl, Evi/Kaiser, Julia (2017): Maskierung von Audio- und Videoaufnahmen. Version 2.1, Gesprächsanalytisches Informationssystem: Überarbeitung und Ergänzung. (http://prowiki.ids-mannheim.de/pub/GAIS/MasKierung/Maskierung_von_Audio_und_Videoaufnahmen_2.1_GAIS.pdf)Suche in Google Scholar

Reznicek, Marc/Anke Lüdeling/Hagen Hirschmann (2013): Competing target hypotheses in the Falko corpus: A flexible multi-layer corpus architecture. Díaz-Negrillo, Ana; Ballier Nicolas; Thompson, Paul (Hg.): Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: Benjamins, 101-123.10.1075/scl.59.07rezSuche in Google Scholar

Reznicek, Marc/Lüdeling, Anke/Krummes, Cedric/Schwantuschke, Franziska/Walter, Maik/Schmidt, Karin/Hirschmann, Hagen/Andreas, Torsten (2012): Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 2.01. Technischer Bericht. Humboldt-Universität zu Berlin. (https://hu.berlin/falkohandbuch)Suche in Google Scholar

Rohlfing Katharina/Loehr Daniel/Duncan Susan/Brown Amanda/Franklin Amy/Kimbara Irene/Milde, Jan-Torsten/Parrill, Fay/Rose, Travis/Schmidt, Thomas/Sloetjes Han (2006): Comparison of multimodal annotation tools. In: Gesprächsforschung (7), 99–123.Suche in Google Scholar

Sauer, Simon & Lüdeling, Anke (2016): Flexible Multi-Layer Spoken Dialogue Corpora. In: International Journal of Corpus Linguistics 21, 419–438.10.1075/ijcl.21.3.06sauSuche in Google Scholar

Schellhardt, Christin & Schroeder, Christoph (Hg., 2015): MULTILIT. Manual, criteria of transcription and analysis for German, Turkish and English. Tech. Report. Univertät Potsdam. https://publishup.uni-potsdam.de/opus4-ubp/frontdoor/deliver/index/docId/8039/file/multilit_manual.pdf Suche in Google Scholar

Schiller, Anne/Teufel, Simone/Stöckert, Christine/Thielen, Christine (1999): Guidelines für das Tagging deutscher Textcorpora mit STTS. Technischer Bericht. Institut für maschinelle Sprachverarbeitung, Stuttgart. (www.sfs.uni-tuebingen.de/resources/stts-1999.pdf)Suche in Google Scholar

Schmid, Helmut (1994): Probabilistic part-of-speech tagging using Decision Trees.In: Proceedings of the International Conference on New Methods in Language Processing. (/www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf)10.3115/991886.991915Suche in Google Scholar

Schmidt, Thomas (2005): Computergestützte Transkription – Modellierung und Visualisierung gesprochener Sprache mit texttechnologischen Mitteln. Frankfurt a.M.: Peter Lang. Suche in Google Scholar

Schmidt, Thomas (2016): Good practices in the compilation of FOLK, the research and teaching corpus of spoken German. International Journal of Corpus Linguistics (21/3), 396–418. https://doi.org/10.1075/ijcl.21.3.05sch 10.1075/ijcl.21.3.05schSuche in Google Scholar

Schmidt, Thomas (2017): Construction and Dissemination of a Corpus of Spoken Interaction – Tools and Workflows in the FOLK project. In: Corpus Linguistic Software Tools, Journal for Language Technology and Computational Linguistics (JLCL 31/1), by Kupietz, Marc & Geyken, Alexander (Hrsg.), S. 127–154.10.21248/jlcl.31.2016.205Suche in Google Scholar

Schmidt, Thomas/Duncan, Susan/Ehmer, Oliver/Hoyt, Jeffrey/Kipp, Michael/Loehr, Dan/Magnusson, Magnus/Rose, Travis/Sloetjes, Han (2009): An exchange format for multimodal annotations. In: Kipp, Michael/Martin, Jean-Claude/Paggio, Patrizia/Heylen, Dirk (Hg.): Multimodal corpora: from models of natural interaction to systems and applications. Berlin/Heidelberg: Springer, 2009. S. 207–221.10.1007/978-3-642-04793-0_13Suche in Google Scholar

Schmidt, Thomas/Schütte, Wilfried (2010): FOLKER: An Annotation Tool for Efficient Transcription of Natural, Multi-party Interaction. In: Calzolari, Nicoletta/Choukri, Khalid/Maegaard, Bente/Mariani, Joseph/Odjik, Jan/Piperidis, Stelios/Rosner, Mike/Tapias, Daniel (Hg.): Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), may 19–21, 2010, Valletta, Malta. European Language Resources Association (ELRA), 2010. S. 2091–2096. Suche in Google Scholar

Schmidt, Thomas/Wörner, Kai (2014): EXMARaLDA. In: Jacques Durand, Ulrike Gut, and Gjert Kristoffersen (Hg.): The Oxford Handbook of Corpus Phonology. Oxford: OUP 2014, S. 402–419.10.1093/oxfordhb/9780199571932.013.030Suche in Google Scholar

Schmidt, Thomas/Winterscheid, Jenny/Schütte, Wilfried (2015): cGAT. Konventionen für das computergestützte Transkribieren in Anlehnung an das Gesprächsanalytische Transkriptionssystem 2 (GAT2). Mannheim: Institut für Deutsche Sprache. [https://nbn-resolving.org/urn:nbn:de:bsz:mh39-46169]Suche in Google Scholar

Selting, Margret/Auer, Peter/Barden, Birgit/Bergmann, Jörg R./Couper-Kuhlen, Elizabeth/Günthner, Susanne/Meier, Christoph/Quasthoff, Uta M./Schlobinski, Peter/Uhmann, Susanne (1998): Gesprächsanalytisches Transkriptionssystem (GAT). In: Linguistische Berichte 173. S. 91–122. (http://www.mediensprache.net/de/medienanalyse/transcription/gat/gat.pdf)Suche in Google Scholar

Selting, Margret/Auer, Peter/Barth-Weingarten, Dagmar/Bergmann, Jörg R./Bergmann, Pia/Birkner, Karin/Couper-Kuhlen, Elizabeth/Deppermann, Arnulf/Gilles, Peter/Günthner, Susanne/Hartung, Martin/Kern, Friederike/Mertzlufft, Christine/Meyer, Christian/ Morek, Miriam/Oberzaucher, Frank/Peters, Jörg/Quasthoff, Uta/Schütte, Wilfried/Stukenbrock, Anja/Uhmann, Susanne (2009): Gesprächsanalytisches Transkriptionssystem 2 (GAT 2). In: Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion 10, S. 353–402. (http://www.gespraechsforschung-ozs.de/heft2009/px-gat2.pdf)Suche in Google Scholar

Stemle, Egon, Boyd, Adriane, Janssen, Maarten, Lindström Tiedemann, Therese, Mikelić Preradović, Nives, Rosen, Alexandr, Rosén, Dan & Volodina, Elena (2019): Working together towards an ideal infrastructure for language learner corpora. In: Andrea Abel, Aivars Glaznieks, Verena Lyding & Lionel Nicolas (Hg.) Widening the Scope of Learner Corpus Research. Selected papers from the fourth Learner Corpus Research Conference. Corpora and Language in Use – Proceedings 5, Louvain-la-Neuve: Presses universitaires de Louvain, 427–468.Suche in Google Scholar

Trouvain, Jürgen; Bonneau, Anne; Colotte, Vincent; Fauth, Camille; Fohr, Dominique et al. (2016): The IFCASL Corpus of French and German Non-native and Native Read Speech. In: Proceedings of LREC’2016, May 2016, Portorož, Slovenia, 1333–1338. www.coli.uni-saarland.de/~juegler/Publications/trouvain_etal_lrec2016.pdf Suche in Google Scholar

Trouvain, Jürgen; Fauth, Camille; Möbius, Bernd (2016): Breath and non-breath pauses in fluent and disfluent phases of German and French L1 and L2 read speech. In: Proceedings of Speech Prosody (SP8), Boston, 31–35.10.21437/SpeechProsody.2016-7Suche in Google Scholar

Wells, John C. (1997): SAMPA computer readable phonetic alphabet. In: Gibbon, D.; Moore, R.; Winski, R. (Hg.) Handbook of Standards and Resources for Spoken Language Systems. Berlin/New York; Mouton de Gruyter.Suche in Google Scholar

Westpfahl, Swantje; Schmidt, Thomas; Jonietz, Jasmin; Borlinghaus, Anton (2017): STTS 2.0. Guidelines für die Annotation von POS-Tags für Transkripte gesprochener Sprache in Anlehnung an das Stuttgart Tübingen Tagset (STTS). Arbeitspapier. Mannheim: Institut für Deutsche Sprache. (http://nbn-resolving.de/urn:nbn:de:bsz:mh39-60634)Suche in Google Scholar

Winterscheid, Jenny; Deppermann, Arnulf; Schmidt, Thomas; Schütte, Wilfried; Schedl, Evi; Kaiser, Julia (2019): Normalisieren mit OrthoNormal. Konventionen und Bedienungshinweise für die orthografische Normalisierung von FOLKER-Transkripten. Mannheim: Leibniz-Institut für Deutsche Sprache. (https://doi.org/10.14618/ids-pub-9326) Suche in Google Scholar

Wittenburg, Peter/Brugman, Hennie/Russel, Albert/Klassmann, Alex/Sloetjes, Han (2006): ELAN: a Professional Framework for Multimodality Research. In: Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.Suche in Google Scholar

Online erschienen: 2022-04-20

Erschienen im Druck: 2022-04-30

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/zgl-2022-2048