Abstract
Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English, Polish-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.
5 Acknowledgements
This work was supported by the Polish Ministry of Science and Higher Education’s investment in the CLARIN-PL research infrastructure and was backed by the PJATK legal resources.
References
Abdelali, A., F. Guzman, H. Sajjad and S. Vogel. 2014. “The Amara corpus: Building parallel language resources for the educational domain”. LREC 14. 1044–1054.Suche in Google Scholar
Anderson, S.R., D. Harrison, L. Horn, R. Zanuttini and D. Lightfoot. 2017. “How many languages are there in the world?” Linguistic Society of America 201010.1093/actrade/9780199590599.003.0002Suche in Google Scholar
Axelrod, A., X. He and J. Gao. 2011. “Domain adaptation via pseudo in-domain data selection”. Proceedings of the conference on empirical methods in natural language processing Association for Computational Linguistics. 355–362.Suche in Google Scholar
Bellegarda, J. 2000. “Data-driven semantic language modeling”. Insitute for Mathematics and its Applications Workshop 2000.Suche in Google Scholar
Bertoldi, N., M. Barbaiani, M. Federico and R. Cattoni. 2008. “Phrase-based statistical machine translation with pivot languages”. International Workshop on Spoken Language Translation (IWSLT) 2008Suche in Google Scholar
Callison-Burch, C. and M. Osborne. 2002. Co-training for statistical machine translation. (Master’s thesis, School of Informatics, University of Edinburgh.)Suche in Google Scholar
Cettolo, M., C. Girardi and M. Federico. 2012. “Wit3: Web inventory of transcribed and translated talks”. Conference of european association for machine translation 261– 268.Suche in Google Scholar
Chen, S. F. and J. Goodman. 1999. “An empirical study of smoothing techniques for language modeling”. Computer Speech and Language 13(4). 359–394.10.1006/csla.1999.0128Suche in Google Scholar
Cohn, T. and M. Lapata. 2007. “Machine translation by triangulation: Making effective use of multi-parallel corpora”. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics 728–735.Suche in Google Scholar
Durrani, N., H. Sajjad, H. Hoang and P. Koehn. 2014. “Integrating an unsupervised transliteration model into statistical machine translation”. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. (Volume 2 Short papers) 148–153.10.3115/v1/E14-4029Suche in Google Scholar
Eisele, A., C. Federmann, H. Uszkoreit, H. Saint-Amand, M. Kay, M. Jellinghaus, S. Hunsicker, T. Herrmann and Y. Chen. 2008. “Hybrid machine translation architectures within and beyond the euromatrix project”. Proceedings of the 12th Annual Conference of the European Association for Machine Translation (EAMT 2008) 27–34.Suche in Google Scholar
Habash, N. and J. Hu. 2009. “Improving arabic-chinese statistical machine translation using english as pivot language”. Proceedings of the fourth workshop on statistical machine translation Association for Computational Linguistics. 173–181.10.3115/1626431.1626467Suche in Google Scholar
Heafield, K. 2011. “KenLM: Faster and smaller language model queries”. Proceedings of the Sixth Workshop on Statistical Machine Translation Association for Computational Linguistics. 187–197.Suche in Google Scholar
Hovy, E. H. 1999. “Toward finely differentiated evaluation metrics for machine translation”. Proceedings of the Eagles Workshop on Standards and Evaluation Pisa.Suche in Google Scholar
Junczys-Dowmunt, M. and A. Szał. 2011. “Symgiza++: Symmetrized word alignment models for statistical machine translation”. International Joint Conferences on Security and Intelligent Information Systems Berlin: Springer. 379–390.10.1007/978-3-642-25261-7_30Suche in Google Scholar
Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, et al. 2007. “Moses: Open source toolkit for statistical machine translation”. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 177–180.10.3115/1557769.1557821Suche in Google Scholar
Kumar, S., F. Och and W. Macherey. 2007. “Improving word alignment with bridge languages”. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning Prague. 42–50.Suche in Google Scholar
Leusch, G., A. Max, J. Maria Crego and H. Ney. 2010. “Multi-pivot translation by system combination”. International Workshop on Spoken Language Translation (IWSLT) 2010Suche in Google Scholar
Mann, G. S and D. Yarowsky. 2001. “Multipath translation lexicon induction via bridge languages”. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies Association for Computational Linguistics. U–R.10.3115/1073336.1073356Suche in Google Scholar
Munteanu, D., A. Fraser and D. Marcu. 2004. “Improved machine translation performance via parallel sentence extraction from comparable corpora”. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004Suche in Google Scholar
Nie, J. No date. “Integrating term relationships into language models for information retrieval”. Report at ICT-CASSuche in Google Scholar
Oyeka, I., C. Anaene and G. U. Ebuh. 2012. “Modified wilcoxon signed-rank test”. Open Journal of Statistics 2(2). 172.10.4236/ojs.2012.22019Suche in Google Scholar
Paolillo, J. C and A. Das. 2006. “Evaluating language statistics: The ethnologue and beyond”. (Contract report for UNESCO Institute for Statistics.)Suche in Google Scholar
Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2002. “BLEU: A method for automatic evaluation of machine translation”. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Association for Computational Linguistics. 311–318.10.3115/1073083.1073135Suche in Google Scholar
Ruiz Costa-Jussà, M. and J.A. Rodrı́guez Fonollosa. 2011. “Using linear interpolation and weighted reordering hypotheses in the moses system”. Seventh Conference on International Language Resources and Evaluation 1712–1718.Suche in Google Scholar
Smith, J. R, C. Quirk and K. Toutanova. 2010. “Extracting parallel sentences from comparable corpora using document level alignment”. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Association for Computational Linguistics. 403–411.Suche in Google Scholar
Stolcke, A. 2002. “SRILM: An extensible language modeling toolkit”. Seventh International Conference on Spoken Language Processing10.21437/ICSLP.2002-303Suche in Google Scholar
Ueffing, N., G. Haffari and A. Sarkar. 2008. “Semi-supervised learning for machine translation”. Learning Machine Translation. NIPS Series. Cambridge, MA: MIT Press.10.7551/mitpress/9780262072977.003.0012Suche in Google Scholar
Vanni, M. and F. Reeder. 2000. “How are you doing? A look at MT evaluation”. Conference of the Association for Machine Translation in the Americas Berlin: Springer. 109–116.10.1007/3-540-39965-8_11Suche in Google Scholar
Verspoor, K. et al. 2008. “A semantics-enhanced language model for unsupervised word sense disambiguation”. International Conference on Intelligent Text Processing and Computational Linguistics Berlin: Springer. 287–298.Suche in Google Scholar
Wang, L., D. F. Wong, L. S. Chao, Y. Lu and J. Xing. 2014. “A systematic comparison of data selection criteria for smt domain adaptation”. The Scientific World Journal 2014.10.1155/2014/745485Suche in Google Scholar
Wołk, K., K. Marasek and A. Wołk. 2016. “Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora”. 2016 Federated Conference on Computer Science and Information Systems (FEDCSIS). IEEE. 517– 525.10.15439/2016F304Suche in Google Scholar
Wu, H. and H. Wang. 2007. “Pivot language approach for phrase-based statistical machine translation”. Machine Translation 21(3). 165–181.10.1007/s10590-008-9041-6Suche in Google Scholar
Yujian, L. and L. Bo. 2007. “A normalized Levenshtein distance metric”. IEEE transactions on pattern analysis and machine intelligence 29(6). 1091–1095.10.1109/TPAMI.2007.1078Suche in Google Scholar
© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland
Artikel in diesem Heft
- Frontmatter
- Foreword
- Automatic transcription of the Polish newsreel
- Part of speech tagging for Polish
- Named entity recognition for Polish
- Recognition and normalisation of temporal expressions using conditional random fields and cascade of partial rules
- Dependency parsing of Polish
- A Weakly supervised word sense disambiguation for Polish using rich lexical resources
- Nominal coreference resolution for Polish
- Three-step coreference-based summarizer for Polish news texts
- Sentiment analysis for Polish
- Semantic approach for building generated virtual-parallel corpora from monolingual texts
- Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus
Artikel in diesem Heft
- Frontmatter
- Foreword
- Automatic transcription of the Polish newsreel
- Part of speech tagging for Polish
- Named entity recognition for Polish
- Recognition and normalisation of temporal expressions using conditional random fields and cascade of partial rules
- Dependency parsing of Polish
- A Weakly supervised word sense disambiguation for Polish using rich lexical resources
- Nominal coreference resolution for Polish
- Three-step coreference-based summarizer for Polish news texts
- Sentiment analysis for Polish
- Semantic approach for building generated virtual-parallel corpora from monolingual texts
- Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus