Mining historical texts for diachronic spelling variants

Filip Graliński; Krzysztof Jassem

doi:10.1515/psicl-2020-0021

Artikel

Mining historical texts for diachronic spelling variants

Filip Graliński und Krzysztof Jassem

Veröffentlicht/Copyright: 1. März 2021

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Poznan Studies in Contemporary Linguistics Band 56 Heft 4

Abstract

The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated.

The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.

Keywords: Spelling variants; OCR; word embeddings

References

Bollmann, M., F. Petran and S. Dipper, S. 2011. “Rule-based normalization of historical texts”. Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 34–42.Suche in Google Scholar

Dahlmeier, D. and H.T. Ng. 2012. “Better evaluation for grammatical error correction”. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 568–572.Suche in Google Scholar

Derwojedowa, M., W. Kieraś, D. Skowrońska and R. Wołosz. 2014. “Zasób leksykalny polszczyzny II poł. XIX wieku a możliwość automatycznej analizy morfologicznej tekstów z tego okresu” [The lexical resources of Polish in the second half of the 19th century and the possibilities of automatic morphological analysis for texts from the period]. In: Leksyka języków słowiańskich w badaniach synchronicznych i diachronicznych [The lexis of Slavic languages in synchronic and diachronic studies]. Toruń: Wydawnictwo Naukowe Uniwersytetu Mikołaja Kopernika. 183–196.Suche in Google Scholar

Fink, F., K.U. Schulz and U. Springmann. 2017. “Profiling of OCR’ed historical texts revisited”. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. ACM. 61–66.10.1145/3078081.3078096Suche in Google Scholar

Firth, J.R. 1957. “A synopsis of linguistic theory 1930–1955”. In: Studies in Linguistic Analysis. Oxford: Blackwell. 11.Suche in Google Scholar

Gotscharek, A., U. Reffle, C. Ringlstetter, K.U. Schulz and A. Neumann. 2011. “Towards information retrieval on historical document collections: The role of matching procedures and special lexica”. International Journal on Document Analysis and Recognition (IJDAR) 14(2). 159–171.10.1007/s10032-010-0132-6Suche in Google Scholar

Graliński, F. and P. Wierzchoń. 2018. “Odkrywka, czyli leksykografia diachroniczna live” [Odkrywka, or lexicographic studies live]. In: Bańko, M. and H. Karaś (eds.), Między teorią a praktyką. Metody współczesnej leksykografii [Between theory and practice. Methods of modern lexicography]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego. 59–69.Suche in Google Scholar

Harris, Z.S. 1954. “Distributional structure”. Word 10(2–3). 146–162.10.1080/00437956.1954.11659520Suche in Google Scholar

Jassem, K., F. Graliński and T. Obrębski. 2017. “Pros and cons of normalizing text with Thrax”. Proceedings of the 8th Language and Technology Conference. 230–235.Suche in Google Scholar

Jurish, B. 2010. “More than words: using token context to improve canonicalization of historical German”. Journal for Language Technology and Computational Linguistics 25(1). 23–39.10.21248/jlcl.25.2010.127Suche in Google Scholar

Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. In: Burges, C.J.C., L. Bottou, M. Welling, Z. Ghahramani and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26. 3111–3119.Suche in Google Scholar

Piotrowski, M. 2012. Natural language processing for historical texts. Morgan & Claypool Publishers.10.2200/S00436ED1V01Y201207HLT017Suche in Google Scholar

Rayson, P., D. Archer and N. Smith. 2005. “VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora”. Proceedings of Corpus Linguistics 2005Suche in Google Scholar

Reffle, U. and C. Ringlstetter. 2013. “Unsupervised profiling of OCRed historical documents”. Pattern Recognition 46(5). 1346–1357.10.1016/j.patcog.2012.10.002Suche in Google Scholar

Wierzchoń, P. 2010. “Torując drogę teorii lingwochronologizacji’ [Paving the way for a theory of linguochronologization]. Investigationes Linguisticae 20. 105–185.10.14746/il.2010.20.9Suche in Google Scholar

Woliński, M., M. Miłkowski, M. Ogrodniczuk, A. Przepiórkowski and Ł. Szałkiewicz. 2012. “PoliMorf: a (not so) new open morphological dictionary for Polish”. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012. Istanbul, Turkey. 860–864.Suche in Google Scholar

Published Online: 2021-03-01

Published in Print: 2020-12-16

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/psicl-2020-0021

Schlagwörter für diesen Artikel

Spelling variants; OCR; word embeddings