Semantic approach for building generated virtual-parallel corpora from monolingual texts

Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek

doi:10.1515/psicl-2019-0017

Artikel

Semantic approach for building generated virtual-parallel corpora from monolingual texts

Krzysztof Wołk , Agnieszka Wołk und Krzysztof Marasek

Veröffentlicht/Copyright: 17. August 2019

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Poznan Studies in Contemporary Linguistics Band 55 Heft 2

Abstract

Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

Keywords: Data filtration; corpora building; machine learning; data mining; parallel corpora; machine translation

Krzysztof Wołk Polish-Japanese Academy of Information Technology Koszykowa 86 02-008 Warszawa Poland

5 Acknowledgements

This work was supported by the Polish Ministry of Science and Higher Education’s investment in the CLARIN-PL research infrastructure and was backed by the PJATK legal resources.

References

Abdelali, A., F. Guzman, H. Sajjad and S. Vogel. 2014. “The Amara corpus: Building parallel language resources for the educational domain”. LREC 14. 1044–1054.Suche in Google Scholar

Anderson, S.R., D. Harrison, L. Horn, R. Zanuttini and D. Lightfoot. 2017. “How many languages are there in the world?” Linguistic Society of America 201010.1093/actrade/9780199590599.003.0002Suche in Google Scholar

Axelrod, A., X. He and J. Gao. 2011. “Domain adaptation via pseudo in-domain data selection”. Proceedings of the conference on empirical methods in natural language processing Association for Computational Linguistics. 355–362.Suche in Google Scholar

Bellegarda, J. 2000. “Data-driven semantic language modeling”. Insitute for Mathematics and its Applications Workshop 2000.Suche in Google Scholar

Bertoldi, N., M. Barbaiani, M. Federico and R. Cattoni. 2008. “Phrase-based statistical machine translation with pivot languages”. International Workshop on Spoken Language Translation (IWSLT) 2008Suche in Google Scholar

Callison-Burch, C. and M. Osborne. 2002. Co-training for statistical machine translation. (Master’s thesis, School of Informatics, University of Edinburgh.)Suche in Google Scholar

Cettolo, M., C. Girardi and M. Federico. 2012. “Wit3: Web inventory of transcribed and translated talks”. Conference of european association for machine translation 261– 268.Suche in Google Scholar

Chen, S. F. and J. Goodman. 1999. “An empirical study of smoothing techniques for language modeling”. Computer Speech and Language 13(4). 359–394.10.1006/csla.1999.0128Suche in Google Scholar

Cohn, T. and M. Lapata. 2007. “Machine translation by triangulation: Making effective use of multi-parallel corpora”. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 728–735.Suche in Google Scholar

Durrani, N., H. Sajjad, H. Hoang and P. Koehn. 2014. “Integrating an unsupervised transliteration model into statistical machine translation”. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. (Volume 2 Short papers) 148–153.10.3115/v1/E14-4029Suche in Google Scholar

Eisele, A., C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann and Y. Chen. 2008. “Hybrid machine translation architectures within and beyond the euromatrix project”. Proceedings of the 12th Annual Conference of the European Association for Machine Translation (EAMT 2008) 27–34.Suche in Google Scholar

Habash, N. and J. Hu. 2009. “Improving arabic-chinese statistical machine translation using english as pivot language”. Proceedings of the fourth workshop on statistical machine translation Association for Computational Linguistics. 173–181.10.3115/1626431.1626467Suche in Google Scholar

Heafield, K. 2011. “KenLM: Faster and smaller language model queries”. Proceedings of the Sixth Workshop on Statistical Machine Translation Association for Computational Linguistics. 187–197.Suche in Google Scholar

Hovy, E. H. 1999. “Toward finely differentiated evaluation metrics for machine translation”. Proceedings of the Eagles Workshop on Standards and Evaluation Pisa.Suche in Google Scholar

Junczys-Dowmunt, M. and A. Szał. 2011. “Symgiza++: Symmetrized word alignment models for statistical machine translation”. International Joint Conferences on Security and Intelligent Information Systems Berlin: Springer. 379–390.10.1007/978-3-642-25261-7_30Suche in Google Scholar

Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, et al. 2007. “Moses: Open source toolkit for statistical machine translation”. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 177–180.10.3115/1557769.1557821Suche in Google Scholar

Kumar, S., F. Och and W. Macherey. 2007. “Improving word alignment with bridge languages”. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning Prague. 42–50.Suche in Google Scholar

Leusch, G., A. Max, J. Maria Crego and H. Ney. 2010. “Multi-pivot translation by system combination”. International Workshop on Spoken Language Translation (IWSLT) 2010Suche in Google Scholar

Mann, G. S and D. Yarowsky. 2001. “Multipath translation lexicon induction via bridge languages”. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies Association for Computational Linguistics. U–R.10.3115/1073336.1073356Suche in Google Scholar

Munteanu, D., A. Fraser and D. Marcu. 2004. “Improved machine translation performance via parallel sentence extraction from comparable corpora”. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004Suche in Google Scholar

Nie, J. No date. “Integrating term relationships into language models for information retrieval”. Report at ICT-CASSuche in Google Scholar

Oyeka, I., C. Anaene and G. U. Ebuh. 2012. “Modified wilcoxon signed-rank test”. Open Journal of Statistics 2(2). 172.10.4236/ojs.2012.22019Suche in Google Scholar

Paolillo, J. C and A. Das. 2006. “Evaluating language statistics: The ethnologue and beyond”. (Contract report for UNESCO Institute for Statistics.)Suche in Google Scholar

Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2002. “BLEU: A method for automatic evaluation of machine translation”. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Association for Computational Linguistics. 311–318.10.3115/1073083.1073135Suche in Google Scholar

Ruiz Costa-Jussà, M. and J.A. Rodrı́guez Fonollosa. 2011. “Using linear interpolation and weighted reordering hypotheses in the moses system”. Seventh Conference on International Language Resources and Evaluation 1712–1718.Suche in Google Scholar

Smith, J. R, C. Quirk and K. Toutanova. 2010. “Extracting parallel sentences from comparable corpora using document level alignment”. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Association for Computational Linguistics. 403–411.Suche in Google Scholar

Stolcke, A. 2002. “SRILM: An extensible language modeling toolkit”. Seventh International Conference on Spoken Language Processing10.21437/ICSLP.2002-303Suche in Google Scholar

Ueffing, N., G. Haffari and A. Sarkar. 2008. “Semi-supervised learning for machine translation”. Learning Machine Translation. NIPS Series. Cambridge, MA: MIT Press.10.7551/mitpress/9780262072977.003.0012Suche in Google Scholar

Vanni, M. and F. Reeder. 2000. “How are you doing? A look at MT evaluation”. Conference of the Association for Machine Translation in the Americas Berlin: Springer. 109–116.10.1007/3-540-39965-8_11Suche in Google Scholar

Verspoor, K. et al. 2008. “A semantics-enhanced language model for unsupervised word sense disambiguation”. International Conference on Intelligent Text Processing and Computational Linguistics Berlin: Springer. 287–298.Suche in Google Scholar

Wang, L., D. F. Wong, L. S. Chao, Y. Lu and J. Xing. 2014. “A systematic comparison of data selection criteria for smt domain adaptation”. The Scientific World Journal 2014.10.1155/2014/745485Suche in Google Scholar

Wołk, K., K. Marasek and A. Wołk. 2016. “Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora”. 2016 Federated Conference on Computer Science and Information Systems (FEDCSIS). IEEE. 517– 525.10.15439/2016F304Suche in Google Scholar

Wu, H. and H. Wang. 2007. “Pivot language approach for phrase-based statistical machine translation”. Machine Translation 21(3). 165–181.10.1007/s10590-008-9041-6Suche in Google Scholar

Yujian, L. and L. Bo. 2007. “A normalized Levenshtein distance metric”. IEEE transactions on pattern analysis and machine intelligence 29(6). 1091–1095.10.1109/TPAMI.2007.1078Suche in Google Scholar

Published Online: 2019-08-17

Published in Print: 2019-06-26

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/psicl-2019-0017

Schlagwörter für diesen Artikel

Data filtration; corpora building; machine learning; data mining; parallel corpora; machine translation