Home Bibliographic bias and information-density sampling
Article
Licensed
Unlicensed Requires Authentication

Bibliographic bias and information-density sampling

  • Maja Robbers ORCID logo EMAIL logo and Harald Hammarström
Published/Copyright: September 11, 2024

Abstract

In the present paper, we discuss the bibliographical limits for commonplace typological studies and address how to estimate the resources available for an in-depth study using a full-text corpus of grammatical descriptions, considering different metalanguages, temporal stages of description, theoretical perspectives, and quality of grammatical descriptions. In a case study on motion, we illustrate the above perspectives and show how computer-assisted sampling using large-scale keyword searches for information-dense descriptions is a time-saving resource for the linguistic researcher to create genealogically independent samples. The measures discussed in this study allow for a better appraisal of the state of existing information for typological studies, but the problem of wider access to rare publications remains a significant challenge.

Resumen

En el presente artículo, discutimos los límites bibliográficos para estudios tipológicos comunes y abordamos cómo estimar los recursos disponibles para un estudio a profundidad utilizando un corpus de texto completo de descripciones gramaticales, considerando diferentes metalenguajes, etapas temporales de descripción, perspectivas teóricas y calidad de las descripciones gramaticales. En un estudio de caso sobre el movimiento, ilustramos las perspectivas mencionadas y mostramos cómo el muestreo asistido por computadora mediante búsquedas de palabras clave a gran escala para descripciones densas en información es un recurso que ahorra tiempo para el investigador lingüístico al crear muestras genealógicamente independientes. Las medidas discutidas en este estudio permiten una mejor evaluación del estado de la información existente para estudios tipológicos, pero el problema de un acceso más amplio a publicaciones raras sigue siendo un desafío significativo.


Corresponding author: Maja Robbers, University of Alberta, Linguistics Department, 4-32 Assiniboia Hall, Edmonton, Alberta, T6G 2E7, Canada, E-mail:

Award Identifier / Grant number: 2017.0105

Acknowledgments

This research was made possible thanks to the financial support of the “From dust to dawn: Multilingual grammar extraction from grammars” project funded by Stiftelsen Marcus och Amalia Wallenbergs Minnesfond 2017.0105 awarded to Harald Hammarström (Uppsala University).

  1. Author contributions: H. H. did the bibliographic analysis. M. R. conducted the case study. M. R. and H. H. wrote the text.

  2. Data availability statement: The sampling results are provided in Appendix II. The bibliographic data underlying the analysis is freely available via https://glottolog.org. The open access section of the DReaM corpus is freely available via https://spraakbanken.gu.se/korp/?mode=dream. The full DReaM corpus is available for research via the authors but cannot be placed in an open archive due to copyright. An open search interface is in the process of being put online via the Centre for Digital Humanities at Uppsala University (CDHU).

References

Aikhenvald, Alexandra Y. 2015. The art of grammar: A practical guide. Oxford: Oxford University Press.10.1093/acprof:oso/9780199683215.001.0001Search in Google Scholar

Allassonnière-Tang, Marc, Olof Lundgren, Maja Robbers, Sandra Cronhamn, Filip Larsson, One-Soon Her, Harald Hammarström & Gerd Carling. 2021. Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systems. Nature: Humanities and Social Sciences Communications 8(331). 1–6. https://doi.org/10.1057/s41599-021-01003-5.Search in Google Scholar

Bakker, Dik. 2010. Language sampling. In Jae Jung Song (ed.). Oxford handbook of linguistic typology, 100–127. Oxford: Oxford University Press.10.1093/oxfordhb/9780199281251.013.0007Search in Google Scholar

Bell, Alan. 1978. Language samples. In Joseph H. Greenberg, C. A. Ferguson & E. A. Moravcsik (eds.). Universals of human language, Vol. 1, 123–156. Stanford: Stanford University Press.Search in Google Scholar

Bickel, Balthasar. 2008. A refined sampling procedure for genealogical control. Sprachtypologie und Universalienforschung 61(3). 221–233. https://doi.org/10.1524/stuf.2008.0022.Search in Google Scholar

Boas, Franz & Ella Deloria. 1941. Dakota grammar (National Academy of Science Memoirs 23:2). Washington, DC: National Academy of Science.Search in Google Scholar

Bourdin, Philippe. 1997. On Goal-bias across languages: Modal, configurational and orientation parameters. In Bohumil Palek, Osamu Fujimura & Jiří Václav Neustupný (eds.). Proceedings of LP’96: Typology, item orderings and universals, 185–218. Prague: Karolinum.Search in Google Scholar

Bybee, Joan, Revere Perkins & William Pagliuca. 1994. The evolution of grammar: Tense, aspect, and modality in the languages of the world. Chicago: University of Chicago Press.Search in Google Scholar

Comrie, Bernard. 1989. Language universals and linguistic typology: Syntax and morphology, 2nd edn. Oxford: Basil Blackwell.Search in Google Scholar

Creissels, Denis. 2006. Encoding the distinction between location, source and direction: A typological study. In Maya Hickman & Stephane Robert (eds.). Space in languages, 19–28. Amsterdam: John Benjamins.10.1075/tsl.66.03creSearch in Google Scholar

Creissels, Denis. 2009. Spatial cases. In Andrej Malchukov & Andrew Spencer (eds.). The Oxford handbook of case, 609–625. Oxford: Oxford University Press.10.1093/oxfordhb/9780199206476.013.0043Search in Google Scholar

Croft, William. 2001. Typology and universals, 2nd edn. Cambridge: Cambridge University Press.Search in Google Scholar

Cysouw, Michael. 2005. Quantitative methods in typology. In Gabriel Altmann, Reinhard Köhler & R. Piotrowski (eds.). Quantitative Linguistik: Ein internationales Handbuch, 554–578. Berlin: Mouton de Gruyter.Search in Google Scholar

Dahl, Östen. 2008. An exercise in a posteriori sampling. Sprachtypologie und Universalienforschung 61(3). 208–220. https://doi.org/10.1524/stuf.2008.0021.Search in Google Scholar

Dixon, R. M. W. 2010. Basic linguistic theory, Vol. 1. Oxford: Oxford University Press.10.1093/oso/9780199571055.001.0001Search in Google Scholar

Dryer, Matthew S. 1989. Large linguistic areas and language sampling. Studies in Language 13. 257–292. https://doi.org/10.1075/sl.13.2.03dry.Search in Google Scholar

Dryer, Matthew S. 2000. Counting genera vs. counting languages. Linguistic Typology 4. 334–350.Search in Google Scholar

Dryer, Matthew S. 2005. Genealogical language list. In Bernard Comrie, Matthew S. Dryer, David Gil & Martin Haspelmath (eds.). World atlas of language structures, 584–644. Oxford: Oxford University Press.Search in Google Scholar

Eberhard, David M., Gary F. Simons & Charles D. Fennig. 2021. Ethnologue: Languages of the world, 24th edn. Dallas: SIL International.Search in Google Scholar

Fast, Karin E. 2015. Spatial language in Tungag (Studies in the Languages of Island Melanesia 4). Canberra: Asia-Pacific Linguistics.Search in Google Scholar

Hammarström, Harald. 2009. Sampling and genealogical coverage in WALS. Linguistic Typology 13(1). 105–119. https://doi.org/10.1515/lity.2009.006.Search in Google Scholar

Hammarström, Harald. 2021a. Inventory and content separation in grammatical descriptions of languages of the world. In Gerd Berget, Mark Michael Hall, Daniel Brenn & Sanna Kumpulainen (eds.). Linking theory and practice of digital libraries: 25th international conference on theory and practice of digital libraries, TPDL 2021, virtual event, September 13–17, 2021, proceedings, 129–140. Berlin: Springer.Search in Google Scholar

Hammarström, Harald. 2021b. Gramfinder: Human and machine reading of grammatical descriptions of the languages of the world. In Eduard C. Dragut, Yunyao Li, Lucian Popa & Slobodan Vucetic (eds.). 3rd Workshop on data science with human in the loop, DaSH@KDD, virtual conference, 15 August 2021. DBLP.Search in Google Scholar

Hammarström, Harald. 2022. The rise and fall of grammatical theories in descriptive grammars of the languages of the world. In Elena Volodina, Dana Dannélls, Aleksandrs Berdicevskis, Markus Forsberg & Shafqat Virk (eds.). Live and learn – Festschrift in honor of Lars Borin (GU-ISS Forskningsrapporter från Institutionen för svenska, flerspråkighet och språkteknologi), 55–59. Gothenburg: Gothenburg University.Search in Google Scholar

Hammarström, Harald & Sebastian Nordhoff. 2011. LangDoc: Bibliographic infrastructure for linguistic typology. Oslo Studies in Language 3(2). 31–43. https://doi.org/10.5617/osla.75.Search in Google Scholar

Hammarström, Harald, Thom Castermans, Robert Forkel, Kevin Verbeek, Michel A. Westenberg & Bettina Speckmann. 2018. Simultaneous visualization of language endangerment and language description. Language Documentation & Conservation 12. 359–392.Search in Google Scholar

Hammarström, Harald, One-Soon Her & Marc Tang. 2021. Term-spotting: A quick-and-dirty method for extracting typological features of language from grammatical descriptions. In Simon Dobnik, Richard Johansson & Peter Ljunglöf (eds.). Selected contributions from the eighth Swedish language technology conference (SLTC-2020), 25–27 November 2020, 27–34. Linköping: Linköping Electronic Press.10.3384/ecp184172Search in Google Scholar

Hammarström, Harald, Robert Forkel, Martin Haspelmath & Sebastian Bank. 2022. Glottolog 4.7. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at: http://glottolog.org.Search in Google Scholar

Haspelmath, Martin. 2021a. Explaining grammatical coding asymmetries: Form-frequency correspondences and predictability. Journal of Linguistics 57. 605–633. https://doi.org/10.1017/s0022226720000535.Search in Google Scholar

Haspelmath, Martin. 2021b. Towards standardization of morphosyntactic terminology for general linguistics. In Luca Alfieri, Giorgio Francesco Arcodia & Paolo Ramat (eds.). Linguistic categories, language description and linguistic typology (Typological studies in language 132), 35–57. Amsterdam: John Benjamins.10.1075/tsl.132.02hasSearch in Google Scholar

Haspelmath, Martin & Sven Siegmund. 2006. Simulating the replication of some of Greenberg’s word order generalizations. Linguistic Typology 10(1). 74–82.Search in Google Scholar

Haspelmath, Martin, Matthew S. Dryer, David Gil & Bernard Comrie (eds.). 2005. World atlas of language structures. Oxford: Oxford University Press.Search in Google Scholar

Kopecka, Anetta & Marine Vuillermet (eds.). 2021. Source-goal (a)symmetries across languages [special issue]. Studies in Language, Vol. 45.10.1075/sl.45.1Search in Google Scholar

Lestrade, Sander. 2010. The space of case. Nijmegen: Radboud Universiteit Nijmegen doctoral dissertation.Search in Google Scholar

Miestamo, Matti, Dik Bakker & Antti Arppe. 2016. Sampling for variety. Linguistic Typology 20(2). 233–296. https://doi.org/10.1515/lingty-2016-0006.Search in Google Scholar

Moseley, Christopher & R. E. Asher (eds.). 2006. Atlas of the world’s languages, 2nd edn. London: Routledge.Search in Google Scholar

Nintemann, Julia, Maja Robbers & Nicole Hober. 2020. Here – hither – hence and related categories: A cross-linguistic study. Berlin: De Gruyter Mouton.10.1515/9783110672640Search in Google Scholar

Pantcheva, Marina. 2009. Directional expressions cross-linguistically: Nanosyntax and lexicalization. Nordlyd 36(1). 7–39. https://doi.org/10.7557/12.214.Search in Google Scholar

Pantcheva, Marina. 2010. The syntactic structure of locations, goals, and sources. Linguistics 48(5). 1043–1081. https://doi.org/10.1515/ling.2010.034.Search in Google Scholar

Pantcheva, Marina. 2011. Decomposing Path: The nanosyntax of directional expressions. Tromsø: University of Tromsø doctoral dissertation.Search in Google Scholar

Perkins, Revere D. 1989. Statistical techniques for determining language sample size. Studies in Language 13. 293–315. https://doi.org/10.1075/sl.13.2.04per.Search in Google Scholar

Perkins, Revere D. 2001. Sampling procedures and statistical methods. In Martin Haspelmath, Ekkehard König, Wulf Oesterreicher & Wolfgang Raible (eds.). Morphology: An international handbook on inflection and word-formation (Handbücher zur Sprach- und Kommunikationswissenschaft 20.1), 419–434. Berlin: Walter de Gruyter.Search in Google Scholar

Piantadosi, Steven T. & Edward Gibson. 2014. Quantitative standards for absolute linguistic universals. Cognitive Science 38(4). 736–756. https://doi.org/10.1111/cogs.12088.Search in Google Scholar

Rijkhoff, Jan & Dik Bakker. 1998. Language sampling. Linguistic Typology 2. 263–314. https://doi.org/10.1515/lity.1998.2.3.263.Search in Google Scholar

Rijkhoff, Jan, Dik Bakker, Kees Hengeveld & Peter Kahrel. 1993. Statistical techniques for determining language sample size. Studies in Language 17. 169–203. https://doi.org/10.1075/sl.17.1.07rij.Search in Google Scholar

Robbers, Maja & Nicole Hober. 2018. Verb-framed spatial deixis in Mesoamerican languages and the increasing complexity of SOURCE via Spanish de. Language Typology and Universals 71(3). 397–423. https://doi.org/10.1515/stuf-2018-0016.Search in Google Scholar

Skirgård, Hedvig, Hannah J. Haynie, Damián E. Blasi, Harald Hammarström, Jeremy Collins, Jay J. Latarche, Jakob Lesage, Tobias Weber, Alena Witzlack-Makarevich, Sam Passmore, Angela Chira, Luke Maurits, Russell Dinnage, Michael Dunn, Ger Reesink, Ruth Singer, Claire Bowern, Patience Epps, Jane Hill, Outi Vesakoski, Martine Robbeets, Noor Karolin Abbas, Daniel Auer, Nancy A. Bakker, Giulia Barbos, Robert D. Borges, Swintha Danielsen, Luise Dorenbusch, Ella Dorn, John Elliott, Giada Falcone, Jana Fischer, Yustinus Ghanggo Ate, Hannah Gibson, Hans-Philipp Göbel, Jemima A. Goodall, Victoria Gruner, Andrew Harvey, Rebekah Hayes, Heer Leonard, Roberto E. Herrera Miranda, Nataliia Hübler, Biu Huntington-Rainey, Jessica K. Ivani, Marilen Johns, Erika Just, Eri Kashima, Carolina Kipf, Janina V. Klingenberg, Nikita König, Aikaterina Koti, Richard G. A. Kowalik, Olga Krasnoukhova, Nora L. M. Lindvall, Mandy Lorenzen, Hannah Lutzenberger, Tânia R. A. Martins, Celia Mata German, Suzanne van der Meer, Jaime Montoya Samamé, Michael Müller, Saliha Muradoglu, Kelsey Neely, Johanna Nickel, Miina Norvik, Cheryl Akinyi Oluoch, Jesse Peacock, India O. C. Pearey, Naomi Peck, Stephanie Petit, Sören Pieper, Mariana Poblete, Daniel Prestipino, Linda Raabe, Amna Raja, Janis Reimringer, Sydney C. Rey, Julia Rizaew, Eloisa Ruppert, Kim K. Salmon, Jill Sammet, Rhiannon Schembri, Lars Schlabbach, Frederick W. P. Schmidt, Amalia Skilton, Wikaliler Daniel Smith, Hilário de Sousa, Kristin Sverredal, Daniel Valle, Javier Vera, Judith Voß, Tim Witte, Henry Wu, Stephanie Yam, Jingting Ye, Maisie Yong, Tessa Yuditha, Roberto Zariquiey, Robert Forkel, Nicholas Evans, Stephen C. Levinson, Martin Haspelmath, Simon J. Greenhill, Quentin D. Atkinson & Russell D. Gray. 2023. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances 9(16). https://doi.org/10.1126/sciadv.adg6175.Search in Google Scholar

Song, Jae Jung. 2001. Linguistic typology: Morphology and syntax. Harlow: Pearson.Search in Google Scholar

Stolz, Thomas, Nataliya Levkovych, Aina Urdze, Julia Nintemann & Maja Robbers. 2017. Spatial interrogatives in Europe and beyond: Where, whither, whence. Berlin: De Gruyter Mouton.10.1515/9783110539516Search in Google Scholar

Velupillai, Viveka. 2012. An introduction to linguistic typology. Amsterdam: Benjamins.10.1075/z.176Search in Google Scholar

Verkerk, Annemarie. 2017. The Goal-over-Source principle in European languages: Preliminary results from a parallel corpus study. In Silvia Luraghi, Tatiana Nikitina & Chiara Zanchi (eds.). Space in diachrony, 1–40. Amsterdam: John Benjamins.10.1075/slcs.188.01verSearch in Google Scholar

Virk, Shafqat Mumtaz, Lars Borin, Anju Saxena & Harald Hammarström. 2017. Automatic extraction of typological linguistic features from descriptive grammars. In Kamil Ekštein & Václav Matoušek (eds.). Text, speech, and dialogue: 20th international conference (TSD 2017) (Lecture Notes in computer science 10415), 111–119. Berlin: Springer.10.1007/978-3-319-64206-2_13Search in Google Scholar

Virk, Shafqat Mumtaz, Harald Hammarström, Markus Forsberg & Wichmann Søren. 2020. The DReaM corpus: A multilingual annotated corpus of grammars for the world’s languages. In Proceedings of the 12th language resources and evaluation conference, 871–877. Marseille: European Language Resources Association.Search in Google Scholar

Widmann, Thomas. 2001. Language sampling for typological studies. Aarhus: University of Aarhus MA thesis.Search in Google Scholar

Widmann, Thomas & Peter Bakker. 2006. Does sampling matter? A test in replicability. Linguistic Typology 10(1). 83–95.Search in Google Scholar

Xia, Fei, William D. Lewis, Michael Wayne Goodman, Glenn Slayden, Ryan Georgi, Joshua Crowgey & Emily M. Bender. 2016. Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation 50(2). 1–29. https://doi.org/10.1007/s10579-015-9325-4.Search in Google Scholar


Supplementary Material

This article contains supplementary material (https://doi.org/10.1515/lingvan-2023-0102).


Received: 2023-07-16
Accepted: 2024-07-05
Published Online: 2024-09-11

© 2024 Walter de Gruyter GmbH, Berlin/Boston

Articles in the same Issue

  1. Frontmatter
  2. Editorial
  3. Editorial 2024
  4. Phonetics & Phonology
  5. The role of recoverability in the implementation of non-phonemic glottalization in Hawaiian
  6. Epenthetic vowel quality crosslinguistically, with focus on Modern Hebrew
  7. Japanese speakers can infer specific sub-lexicons using phonotactic cues
  8. Articulatory phonetics in the market: combining public engagement with ultrasound data collection
  9. Investigating the acoustic fidelity of vowels across remote recording methods
  10. The role of coarticulatory tonal information in Cantonese spoken word recognition: an eye-tracking study
  11. Tracking phonological regularities: exploring the influence of learning mode and regularity locus in adult phonological learning
  12. Morphology & Syntax
  13. #AreHashtagsWords? Structure, position, and syntactic integration of hashtags in (English) tweets
  14. The meaning of morphomes: distributional semantics of Spanish stem alternations
  15. A refinement of the analysis of the resultative V-de construction in Mandarin Chinese
  16. L2 cognitive construal and morphosyntactic acquisition of pseudo-passive constructions
  17. Semantics & Pragmatics
  18. “All women are like that”: an overview of linguistic deindividualization and dehumanization of women in the incelosphere
  19. Counterfactual language, emotion, and perspective: a sentence completion study during the COVID-19 pandemic
  20. Constructing elderly patients’ agency through conversational storytelling
  21. Language Documentation & Typology
  22. Conative animal calls in Macha Oromo: function and form
  23. The syntax of African American English borrowings in the Louisiana Creole tense-mood-aspect system
  24. Syntactic pausing? Re-examining the associations
  25. Bibliographic bias and information-density sampling
  26. Historical & Comparative Linguistics
  27. Revisiting the hypothesis of ideophones as windows to language evolution
  28. Verifying the morpho-semantics of aspect via typological homogeneity
  29. Psycholinguistics & Neurolinguistics
  30. Sign recognition: the effect of parameters and features in sign mispronunciations
  31. Influence of translation on perceived metaphor features: quality, aptness, metaphoricity, and familiarity
  32. Effects of grammatical gender on gender inferences: Evidence from French hybrid nouns
  33. Processing reflexives in adjunct control: an exploration of attraction effects
  34. Language Acquisition & Language Learning
  35. How do L1 glosses affect EFL learners’ reading comprehension performance? An eye-tracking study
  36. Modeling L2 motivation change and its predictive effects on learning behaviors in the extramural digital context: a quantitative investigation in China
  37. Ongoing exposure to an ambient language continues to build implicit knowledge across the lifespan
  38. On the relationship between complexity of primary occupation and L2 varietal behavior in adult migrants in Austria
  39. The acquisition of speaking fundamental frequency (F0) features in Cantonese and English by simultaneous bilingual children
  40. Sociolinguistics & Anthropological Linguistics
  41. A computational approach to detecting the envelope of variation
  42. Attitudes toward code-switching among bilingual Jordanians: a comparative study
  43. “Let’s ride this out together”: unpacking multilingual top-down and bottom-up pandemic communication evidenced in Singapore’s coronavirus-related linguistic and semiotic landscape
  44. Across time, space, and genres: measuring probabilistic grammar distances between varieties of Mandarin
  45. Navigating linguistic ideologies and market dynamics within China’s English language teaching landscape
  46. Streetscapes and memories of real socialist anti-fascism in south-eastern Europe: between dystopianism and utopianism
  47. What can NLP do for linguistics? Towards using grammatical error analysis to document non-standard English features
  48. From sociolinguistic perception to strategic action in the study of social meaning
  49. Minority genders in quantitative survey research: a data-driven approach to clear, inclusive, and accurate gender questions
  50. Variation is the way to perfection: imperfect rhyming in Chinese hip hop
  51. Shifts in digital media usage before and after the pandemic by Rusyns in Ukraine
  52. Computational & Corpus Linguistics
  53. Revisiting the automatic prediction of lexical errors in Mandarin
  54. Finding continuers in Swedish Sign Language
  55. Conversational priming in repetitional responses as a mechanism in language change: evidence from agent-based modelling
  56. Construction grammar and procedural semantics for human-interpretable grounded language processing
  57. Through the compression glass: language complexity and the linguistic structure of compressed strings
  58. Could this be next for corpus linguistics? Methods of semi-automatic data annotation with contextualized word embeddings
  59. The Red Hen Audio Tagger
  60. Code-switching in computer-mediated communication by Gen Z Japanese Americans
  61. Supervised prediction of production patterns using machine learning algorithms
  62. Introducing Bed Word: a new automated speech recognition tool for sociolinguistic interview transcription
  63. Decoding French equivalents of the English present perfect: evidence from parallel corpora of parliamentary documents
  64. Enhancing automated essay scoring with GCNs and multi-level features for robust multidimensional assessments
  65. Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias
  66. The role of syntax in hashtag popularity
  67. Language practices of Chinese doctoral students studying abroad on social media: a translanguaging perspective
  68. Cognitive Linguistics
  69. Metaphor and gender: are words associated with source domains perceived in a gendered way?
  70. Crossmodal correspondence between lexical tones and visual motions: a forced-choice mapping task on Mandarin Chinese
Downloaded on 21.9.2025 from https://www.degruyterbrill.com/document/doi/10.1515/lingvan-2023-0102/html?lang=en
Scroll to top button