Mining historical texts for diachronic spelling variants

Filip Graliński; Krzysztof Jassem

doi:10.1515/psicl-2020-0021

Article

Mining historical texts for diachronic spelling variants

Filip Graliński and Krzysztof Jassem

Published/Copyright: March 1, 2021

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Poznan Studies in Contemporary Linguistics Volume 56 Issue 4

Abstract

The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated.

The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.

Keywords: Spelling variants; OCR; word embeddings

References

Bollmann, M., F. Petran and S. Dipper, S. 2011. “Rule-based normalization of historical texts”. Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 34–42.Search in Google Scholar

Dahlmeier, D. and H.T. Ng. 2012. “Better evaluation for grammatical error correction”. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 568–572.Search in Google Scholar

Derwojedowa, M., W. Kieraś, D. Skowrońska and R. Wołosz. 2014. “Zasób leksykalny polszczyzny II poł. XIX wieku a możliwość automatycznej analizy morfologicznej tekstów z tego okresu” [The lexical resources of Polish in the second half of the 19th century and the possibilities of automatic morphological analysis for texts from the period]. In: Leksyka języków słowiańskich w badaniach synchronicznych i diachronicznych [The lexis of Slavic languages in synchronic and diachronic studies]. Toruń: Wydawnictwo Naukowe Uniwersytetu Mikołaja Kopernika. 183–196.Search in Google Scholar

Fink, F., K.U. Schulz and U. Springmann. 2017. “Profiling of OCR’ed historical texts revisited”. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. ACM. 61–66.10.1145/3078081.3078096Search in Google Scholar

Firth, J.R. 1957. “A synopsis of linguistic theory 1930–1955”. In: Studies in Linguistic Analysis. Oxford: Blackwell. 11.Search in Google Scholar

Gotscharek, A., U. Reffle, C. Ringlstetter, K.U. Schulz and A. Neumann. 2011. “Towards information retrieval on historical document collections: The role of matching procedures and special lexica”. International Journal on Document Analysis and Recognition (IJDAR) 14(2). 159–171.10.1007/s10032-010-0132-6Search in Google Scholar

Graliński, F. and P. Wierzchoń. 2018. “Odkrywka, czyli leksykografia diachroniczna live” [Odkrywka, or lexicographic studies live]. In: Bańko, M. and H. Karaś (eds.), Między teorią a praktyką. Metody współczesnej leksykografii [Between theory and practice. Methods of modern lexicography]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego. 59–69.Search in Google Scholar

Harris, Z.S. 1954. “Distributional structure”. Word 10(2–3). 146–162.10.1080/00437956.1954.11659520Search in Google Scholar

Jassem, K., F. Graliński and T. Obrębski. 2017. “Pros and cons of normalizing text with Thrax”. Proceedings of the 8th Language and Technology Conference. 230–235.Search in Google Scholar

Jurish, B. 2010. “More than words: using token context to improve canonicalization of historical German”. Journal for Language Technology and Computational Linguistics 25(1). 23–39.10.21248/jlcl.25.2010.127Search in Google Scholar

Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. In: Burges, C.J.C., L. Bottou, M. Welling, Z. Ghahramani and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26. 3111–3119.Search in Google Scholar

Piotrowski, M. 2012. Natural language processing for historical texts. Morgan & Claypool Publishers.10.2200/S00436ED1V01Y201207HLT017Search in Google Scholar

Rayson, P., D. Archer and N. Smith. 2005. “VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora”. Proceedings of Corpus Linguistics 2005Search in Google Scholar

Reffle, U. and C. Ringlstetter. 2013. “Unsupervised profiling of OCRed historical documents”. Pattern Recognition 46(5). 1346–1357.10.1016/j.patcog.2012.10.002Search in Google Scholar

Wierzchoń, P. 2010. “Torując drogę teorii lingwochronologizacji’ [Paving the way for a theory of linguochronologization]. Investigationes Linguisticae 20. 105–185.10.14746/il.2010.20.9Search in Google Scholar

Woliński, M., M. Miłkowski, M. Ogrodniczuk, A. Przepiórkowski and Ł. Szałkiewicz. 2012. “PoliMorf: a (not so) new open morphological dictionary for Polish”. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012. Istanbul, Turkey. 860–864.Search in Google Scholar

Published Online: 2021-03-01

Published in Print: 2020-12-16

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/psicl-2020-0021

Keywords for this article

Spelling variants; OCR; word embeddings