Home Linguistics & Semiotics Extending ELAN into variationist sociolinguistics
Article
Licensed
Unlicensed Requires Authentication

Extending ELAN into variationist sociolinguistics

  • Naomi Nagy EMAIL logo and Miriam Meyerhoff
Published/Copyright: October 22, 2015

Abstract

Prior to the implementation of ELAN (tla.mpi.nl/tools/tla-tools/elan, Wittenburg et al. 2006), it was common for sociolinguists to use multiple software applications, and consequently multiple formats, along the route from recording participants to conducting statistical analyses of the data. We present a method which allows for transcription, extracting, coding, preparation for statistical analysis, calculation of some basic frequency statistics, and creation of a concordance all within one program. ELAN is well established as a valuable tool for language documentation. ELAN is frequently used for transcription and multi-tier mark-up illustrating levels of linguistic structure as well as translations and glosses. We hope that this crossover introduction will encourage the efficiency of documentary linguists among sociolinguists and increase the interest in documenting variation among documentarians. After providing an overview of ELAN’s utility, we focus on extracting (or marking) and coding tokens of linguistic variables for quantitative analysis in the variationist sociolinguistic framework. This seamless connection between recording, transcript and coding of dependent and independent variables improves consistency, efficiency, utility, reliability and the accountability of our coding to the original recording. We illustrate a range of benefits and include step-by-step instructions accompanied by downloadable sample files and video clips to illustrate each step of the process (Extending ELAN tutorial files.zip). We also include instructions on importing existing (legacy) transcripts into ELAN.

References

Boersma, Paul & David Weenink. 2014. Praat: doing phonetics by computer [Computer program]. Version 5.3.80. http://www.praat.org/ (accessed 08 July 2014.)Search in Google Scholar

British Sign Language Corpus Project. 2012. http://www.bslcorpusproject.org/data/ (accessed 5 October 2014.)Search in Google Scholar

EI435LingASL. 22 June 2011. ELAN Tutorial 1.mp4. Retrieved from http://www.youtube.com/watch?v=c54gF4rCePw (accessed 29 July 2015).Search in Google Scholar

Gorman, Kyle, Jonathan Howell & Michael Wagner. 2011. Prosodylab-Aligner: A tool for forced alignment of laboratory speech. Proceedings of Acoustics Week in Canada, Quebec City.Search in Google Scholar

Hazen, Kirk. 2006. IN/ING Variable. In Keith Brown (ed.) Encyclopedia of Language & Linguistics, 2nd edn., volume 5, 581–584. Oxford: Elsevier.10.1016/B0-08-044854-2/04716-7Search in Google Scholar

Hoffman, M. F., & Walker, J. A. 2010. Ethnolects and the city: Ethnic orientation and linguistic variation in Toronto English. Language Variation and Change 22(1). 37–67.10.1017/S0954394509990238Search in Google Scholar

Kisler, Thomas, Florian Schiel & Han Sloetjes. 2012. Signal processing via web services: the use case WebMAUS, Digital Humanities 2012, Hamburg, Germany, pp. 30–34.Search in Google Scholar

MacWhinney, Brian. 2000. The CHILDES Project: Tools for analyzing talk. Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates.Search in Google Scholar

Meyerhoff, Miriam. 2015. Turning variation on its head: Analysing subject prefixes in Nkep (Vanuatu) for language documentation. Asia-Pacific Language Variation 1(1). 79–109.10.1075/aplv.1.1.04meySearch in Google Scholar

Rosenfelder, Ingrid, Joseph Fruehwald, Keelan Evanini, and Jiahong Yuan. 2011. FAVE (Forced Alignment and Vowel Extraction) Program Suite. http://fave.ling.upenn.edu (accessed 5 October 2014).Search in Google Scholar

Tacchetti, Maddalena. 2013. User Guide for ELAN Linguistic Annotator version 4.1.0. Retrieved from http://www.mpi.nl/corpus/manuals/manual-elan_ug.pdf (accessed 29 July 2015).Search in Google Scholar

Wittenburg, Peter, Hennie Brugman, Albert Russel, Alex Klassmann & Han Sloetjes. 2006. ELAN: a Professional Framework for Multimodality Research. In Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.Search in Google Scholar


Supplemental Material

The online version of this article (DOI: 10.1515/lingvan-2015-0012) offers supplementary material available to authorized users. Please note, because of the large-sized videos.zip downloading is a little more time consuming


Appendix A: Useful links

Information and instructions for coding with ELAN:

http://projects.chass.utoronto.ca/ngn/HLVC>Resources>For Researchers

http://projects.chass.utoronto.ca/ngn/pdf/ELAN_Handout_Barras_2013.pdf

Coding assignment with step-by-step instructions:

http://individual.utoronto.ca/ngn/LIN/courses/LIN351/LIN351_project.htm

To use for a class, request data files from the first author.

Video describing each step of the process presented in this paper:

https://uofi.app.box.com/s/yxn8ges7lzggazms9w5agzp9nkq3372o

The following clips from this video are included in the paper and may be downloaded as a .zip file:

01_Title.mp4

02_Overview.mp4

03_IntroducingSociolinguisticVariable.mp4

04_FactorWeights-InterpretingVariableRuleAnalysis.mp4

05_DownloadElan.mp4

06_AdvantagesELAN.mp4

07_WhatELANLooksLike.mp4

08_OverviewCodingSociolinguisticVariables.mp4

09_GoldvarbifyTokenFile.mp4

10_OtherGoodThingsLearn.mp4

11_StartingELAN.mp4

12_CreatingAnnotations.mp4

13_AdjustingAnnotationSize.mp4

14_LoopMode.mp4

15_MakingTiers.mp4

16_TypesELAN.mp4

17_TiersForCoding.mp4

18_SegmentationMode.mp4

19_TranscriptionMode.mp4

20_CodingAnnotationMode.mp4

21_ExportCodedTokens.mp4

22_GoldvarbifyTokenFile2.mp4

Appendix B: Other good things to learn to use in ELAN

  1. Vertical zoom & horizontal zoom in the .wav window (Control + click or Right click on wave display)

  2. Resizing display with slider at bottom right of display

  3. Navigate with “Grid” and “Text” (choose relevant tier from pull-down menu)

  4. Control speed and volume of playback in “Controls”

  5. List of “shortcuts” from the View menu (key combinations)

  6. Change order of tiers (by dragging tier labels)

  7. Delete annotation (select it, Option + D)

  8. Change size of annotation (select it, then Option + Drag edge with mouse)

  9. Templates to set up tiers for many files

The ELAN manual gives clear instructions for all these. We also recommend these tutorials: EI435LingASL (2011), Tacchetti (2013).

Appendix C: Creating ELAN files for legacy transcripts

This protocol was created for a particular corpus of Word.doc transcriptions. [3] You will need to adjust certain aspects, particularly in the first section, to the extent that the original transcribers of your corpus made different formatting decisions.

Clean up.doc transcriptions (in Word) – batch processing

  1. In Word, open all the files that need to be “ELANized”. [Refer to 1. Celeste_transcript_ING_marked_AppC_Step1.doc.]

    Edit as necessary to clean up. You can open several/all of the files and then click select “All Open Documents” from the pull-down menu in the Search window in Word. Or you can create a macro to repeat the process in successive files. Be careful!

    (The red/bold is exactly what you type in the “Find what” and “Replace with” boxes in Word, except the word “space” represents one blank space.)

    1. Remove all tabs: Replace ^t with SPACE

    2. Remove double spaces: Replace SPACE SPACE with SPACE (Repeat as necessary)

    3. Make each sentence (or clause – your choice – these will be your annotation entries) start on a new line:

                  Replace [ with ^p[ (if “[ ]” marks speaker codes)

                  Replace . with . ^p

                  Replace ? with ? ^p

                  Replace ! with ! ^p

                  Replace (…) with .^p

                  Replace with .^p

                  Replace . ^p” with .”^p (for quoted speech)

                  Replace ?^p” with ?”^p

                  Replace !^p” with !”^p

                  Replace ^p SPACE with ^p

                  Replace SPACE ^p with ^p

                  Replace ^p^p with ^p (repeat until Word finds 0)

    4. Optionally, delete: “(laughter)”, “mhm” etc., ONLY if they are a full turn. Then:

                  Replace ^p.^p with ^p

                  Replace - SPACE with SPACE

                  Replace ^p SPACE ^p with ^p

                  Replace [^#] SPACE ^p with nothing

                  Replace [^#] SPACE.^p with nothing

                  Replace [^#] SPACE [ with [

                  Replace [^#^#^#] SPACE ^p with nothing

                  Replace [^#^#^#] SPACE.^p with nothing

                  Replace [^#^#^#] SPACE [ with [

    5. Make sure any other comments are enclosed in double parentheses, if you will be using FAVE to force align your files: (( ))

  2. Save with one clause/phrase/intonation unit per line.

  3. Divide up any super-long clauses on to separate lines (otherwise they will be hard to transcribe and analyze). After commas that indicate a division between clauses or breath groups (but not other commas), you may wish to add a Return (^p).

  4. List the speakers on the first line of the file. Use this format and order:

    XXX Main-participant-name, 1 Interviewer-name, 2 Second-interviewer-name, 3 Any-other-participant-name, 4 etc. (Italicized information is optional.)

    Note the lack of square brackets on this line and that speakercodes are separated by commas.

  5. Save the .doc transcripts as SpeakerCode_cleaned.doc.

Segmenting audio in ELAN and editing individual .doc transcript files

  1. In ELAN, create a new file and associate the appropriate audio or video file (e.g., .wav).

  2. Switch to Options: Segmentation Mode. Segment recording on the default tier – one segment for each line (paragraph mark) in SpeakerCode_cleaned.doc. (You’ll want that transcript.doc open next to ELAN on your screen.) Save the ELAN file as SpeakerCode_segmented.eaf.

  3. Deal with overlapping speech if this is relevant to your research. To represent a line with overlapping speech, it will need three segments, all on one tier, as shown in Figure 2:

    1. Segment 1 covers the time that the first speaker talks before the interruption/overlap.

    2. Segment 2 covers the time when both are talking.

    3. Segments 3 is for the time after the interruption/overlap, when one person continues to talk.

Note: On the line in the text file with both speakers, the speakers must appear in the order listed at the top of the file.

  1. If you make changes in Word as you are aligning, save the .doc transcripts as SpeakerCode_edited#.txt (Tab-delimited text file format; select Other encoding: Unicode UTF-8, NOT Insert line breaks). Use a new # for each version so that you can back-track if necessary.

  2. Save files as SpeakerCode_cleaned2.doc.

  3. Save the .doc transcripts as SpeakerCode_cleaned.txt (Tab-delimited text file format; When prompted,select Other encoding: Unicode UTF-8, NOT Insert line breaks). [See 2. Celeste_cleaned_AppC_Step11.txt.]

Figure 2: Aligning overlapping speech.
Figure 2:

Aligning overlapping speech.

Merge the timestamps from ELAN with the transcript text

Figure 3: Settings for exporting tiers as tab-delimited text.
Figure 3:

Settings for exporting tiers as tab-delimited text.

  1. Export the (timestamp) file from ELAN as a .txt file. Use these settings:

    1. Select the tier(s) that have segments marked on them in the top of the window and click OK (see Figure 3).

    2. When prompted for text encoding, select “Unicode UTF-8” encoding.

    3. Save as SpeakerCode_segmented.txt. This provides only the timestamps (and any notes in the Notes tier). [See Extending_ELAN_tutorial_files/3. Celeste_time_stamps_AppC_Step12.txt.]

    4. It may be helpful to sort the rows of SpeakerCode_segmented.txt in Excel, by duration, then delete the rows that are too short to possibly contain a word, e.g. <150 msec. Then re-sort by Begin Time. (This will get rid of some very short annotations created by very slight overlaps in annotations or accidental double-clicks in Segmenting mode.)

    5. In Excel, combine SpeakerCode_segmented.txt with SpeakerCode_cleaned.txt.

  2. If you have more than one speaker transcribed in one file, download the tabber script from http://www.nowme.ca/lib/hlvc/tabber.html (Mac, Unix) or pctabber from http://projects.chass.utoronto.ca/ngn/HLVC/tabber_for_legacy_files.zip (Windows). This inserts tabs that align each speaker’s turns at a different indent level. Additional instructions are available at the same site for the tabber and at http://projects.chass.utoronto.ca/ngn/HLVC/tabber_for_legacy_files/pctabber_instructions.pdf for pctabber.

On a Mac:

  1. Unzip and save tabber in folder with transcript files.

  2. Open Terminal.app (in Utilities)

  3. Go to directory where .txt files and script are (cd “Directory”).

  4. Type ./tabber “FILENAME” or ./tabber “FOLDERNAME”

In Windows:

  1. Put the pctabber.exe file in the same folder as the ./txt transcripts to be tabbed.

  2. Double click the pctabber.exe file and it will open the command line.

  3. Enter the file names of the .txt files one by one, hitting “Enter” after each file name.

This will create a tabbed_SpeakerCode.txt for each SpeakerCode_cleaned.txt in the folder.
  1. In Excel, combine the two .txt files for one speaker (the time stamps and the transcripts) so that the Begin and End time stamps (from SpeakerCode_segmented.txt) are on the same row as the corresponding annotations (from tabbed_SpeakerCode.txt). [See Extending ELAN tutorial files/4. celeste_transcript_to_import_AppC_Step15.txt.]

  2. Label columns with SpeakerCodes. Save as SpeakerCode_Aligned.txt.

  3. Open ELAN. Import SpeakerCode_Aligned.txt and match up tiers and columns appropriately in the pop-up window. Include the msec Begin Time and msec End Time column. Include the Annotation column for each speaker. Unclick any other columns. Select the line on which the transcription begins in that window (Line 2, because the column labels are on Line 1). Save as SpeakerCode.eaf.

  4. After importing, you will need to relink the SpeakerCode.wav file into your new SpeakerCode.eaf file. In ELAN, Edit: Linked files.

Quality control

  1. Spot check throughout to make sure that the transcription matches the sound file. If it doesn’t, goback and re-segment to over-ride in the problem area, or fix the row alignment in SpeakerCode_Aligned.txt.

Received: 2015-6-12
Accepted: 2015-8-24
Published Online: 2015-10-22
Published in Print: 2015-12-1

©2015 by De Gruyter Mouton

Articles in the same Issue

  1. Frontmatter
  2. Editorials/From the Editors
  3. From the Drawing Board
  4. Phonetics & Phonology
  5. Can we use rendaku for phonological argumentation?
  6. Toward completely automated vowel extraction: Introducing DARLA
  7. Induced speech errors as a tool for language description: a case study from Xong “prenasalized consonants”
  8. Real-time articulatory biofeedback with electromagnetic articulography
  9. Allomorphs of French de in coordination: a reproducible study
  10. Morphology & Syntax
  11. Interactional Construction Grammar
  12. Evidence Based on a dynamic source: Database support for a theory of transitive reciprocals
  13. Three open questions in experimental syntax
  14. The complexity of inflectional systems
  15. Investigating “periphery” from a functionalist perspective
  16. Semantics & Pragmatics
  17. Language structure and social agency: Confirming polar questions in conversation
  18. What can historical linguistics and experimental pragmatics offer each other?
  19. Language Documentation & Typology
  20. Directionals, episodic structure, and geographic information systems: Area/punctual distinctions in Ahtna travel narration
  21. Hidden complexity – The neglected side of complexity and its implications
  22. Semantic typology: New approaches to crosslinguistic variation in language and cognition
  23. Psycholinguistics & Neurolinguistics
  24. Discovering prominence and its role in language processing: An individual (differences) approach
  25. The Influence of Word Retrieval and Planning on Phonetic Variation: Implications for Exemplar Models
  26. Language Acquisition & Language Learning
  27. Second language acquisition and linguistics: A bidirectional perspective
  28. Sociolinguistics & Anthropological Linguistics
  29. An end of egalitarianism? Social evaluations of language difference in New Zealand
  30. Sounding the depths at the confluence of numerosity and language
  31. Connecting linguistic variation and non-linguistic behaviour
  32. Extending ELAN into variationist sociolinguistics
  33. I think your going to like me: Exploring the role of errors in email messages on assessments of potential housemates
  34. Computational & Corpus Linguistics
  35. Data “big” and “small” – Examples from the Australian lexical database
  36. The importance of robust corpora in providing more realistic descriptions of variation in English grammar
  37. Historical Linguistics
  38. A minimalist approach to the emergence of ergativity in Austronesian languages
  39. Cognitive Linguistics
  40. What makes a metaphor an embodied metaphor?
  41. Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts
Downloaded on 6.3.2026 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2015-0012/html
Scroll to top button