Abstract
The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated.
The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.
References
Bollmann, M., F. Petran and S. Dipper, S. 2011. “Rule-based normalization of historical texts”. Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 34–42.Search in Google Scholar
Dahlmeier, D. and H.T. Ng. 2012. “Better evaluation for grammatical error correction”. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. 568–572.Search in Google Scholar
Derwojedowa, M., W. Kieraś, D. Skowrońska and R. Wołosz. 2014. “Zasób leksykalny polszczyzny II poł. XIX wieku a możliwość automatycznej analizy morfologicznej tekstów z tego okresu” [The lexical resources of Polish in the second half of the 19th century and the possibilities of automatic morphological analysis for texts from the period]. In: Leksyka języków słowiańskich w badaniach synchronicznych i diachronicznych [The lexis of Slavic languages in synchronic and diachronic studies]. Toruń: Wydawnictwo Naukowe Uniwersytetu Mikołaja Kopernika. 183–196.Search in Google Scholar
Fink, F., K.U. Schulz and U. Springmann. 2017. “Profiling of OCR’ed historical texts revisited”. Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. ACM. 61–66.10.1145/3078081.3078096Search in Google Scholar
Firth, J.R. 1957. “A synopsis of linguistic theory 1930–1955”. In: Studies in Linguistic Analysis. Oxford: Blackwell. 11.Search in Google Scholar
Gotscharek, A., U. Reffle, C. Ringlstetter, K.U. Schulz and A. Neumann. 2011. “Towards information retrieval on historical document collections: The role of matching procedures and special lexica”. International Journal on Document Analysis and Recognition (IJDAR) 14(2). 159–171.10.1007/s10032-010-0132-6Search in Google Scholar
Graliński, F. and P. Wierzchoń. 2018. “Odkrywka, czyli leksykografia diachroniczna live” [Odkrywka, or lexicographic studies live]. In: Bańko, M. and H. Karaś (eds.), Między teorią a praktyką. Metody współczesnej leksykografii [Between theory and practice. Methods of modern lexicography]. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego. 59–69.Search in Google Scholar
Harris, Z.S. 1954. “Distributional structure”. Word 10(2–3). 146–162.10.1080/00437956.1954.11659520Search in Google Scholar
Jassem, K., F. Graliński and T. Obrębski. 2017. “Pros and cons of normalizing text with Thrax”. Proceedings of the 8th Language and Technology Conference. 230–235.Search in Google Scholar
Jurish, B. 2010. “More than words: using token context to improve canonicalization of historical German”. Journal for Language Technology and Computational Linguistics 25(1). 23–39.10.21248/jlcl.25.2010.127Search in Google Scholar
Mikolov, T., I. Sutskever, K. Chen, G. Corrado and J. Dean. 2013. “Distributed representations of words and phrases and their compositionality”. In: Burges, C.J.C., L. Bottou, M. Welling, Z. Ghahramani and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems 26. 3111–3119.Search in Google Scholar
Piotrowski, M. 2012. Natural language processing for historical texts. Morgan & Claypool Publishers.10.2200/S00436ED1V01Y201207HLT017Search in Google Scholar
Rayson, P., D. Archer and N. Smith. 2005. “VARD versus Word: A comparison of the UCREL variant detector and modern spell checkers on English historical corpora”. Proceedings of Corpus Linguistics 2005Search in Google Scholar
Reffle, U. and C. Ringlstetter. 2013. “Unsupervised profiling of OCRed historical documents”. Pattern Recognition 46(5). 1346–1357.10.1016/j.patcog.2012.10.002Search in Google Scholar
Wierzchoń, P. 2010. “Torując drogę teorii lingwochronologizacji’ [Paving the way for a theory of linguochronologization]. Investigationes Linguisticae 20. 105–185.10.14746/il.2010.20.9Search in Google Scholar
Woliński, M., M. Miłkowski, M. Ogrodniczuk, A. Przepiórkowski and Ł. Szałkiewicz. 2012. “PoliMorf: a (not so) new open morphological dictionary for Polish”. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012. Istanbul, Turkey. 860–864.Search in Google Scholar
© 2020 Faculty of English, Adam Mickiewicz University, Poznań, Poland
Articles in the same Issue
- Frontmatter
- Isoglosses and language change: Evidence of the rise and loss of isoglosses from a comparison of early Greek and early English
- The interaction of L2 and L3 levels of proficiency in third language acquisition
- Self-reported communicative distance between Polish and English in formal and informal situational contexts
- Mining historical texts for diachronic spelling variants
- Nigerian newscasters’ English as a model of standard Nigerian English?
- A cognitive semantic exploration of English plant phrasal verbs with the particle out and their Serbian counterparts
- Wordform-specific frequency effects cause acoustic variation in zero-inflected homophones
- Erratum
- Erratum
Articles in the same Issue
- Frontmatter
- Isoglosses and language change: Evidence of the rise and loss of isoglosses from a comparison of early Greek and early English
- The interaction of L2 and L3 levels of proficiency in third language acquisition
- Self-reported communicative distance between Polish and English in formal and informal situational contexts
- Mining historical texts for diachronic spelling variants
- Nigerian newscasters’ English as a model of standard Nigerian English?
- A cognitive semantic exploration of English plant phrasal verbs with the particle out and their Serbian counterparts
- Wordform-specific frequency effects cause acoustic variation in zero-inflected homophones
- Erratum
- Erratum