Three-step coreference-based summarizer for Polish news texts

Mateusz Kopeć

doi:10.1515/psicl-2019-0015

Article

Three-step coreference-based summarizer for Polish news texts

Mateusz Kopeć

Published/Copyright: August 17, 2019

Published by

Become an author with De Gruyter Brill

Submit Manuscript Author Information Explore this Subject

From the journal Poznan Studies in Contemporary Linguistics Volume 55 Issue 2

Abstract

This article addresses the problem of automatic summarization of press articles in Polish. The main novelty of this research lays in the proposal of a three-step summarization algorithm which benefits from using coreference information.

In related work section, all coreference-based approaches to summarization are presented. Then we describe in detail all publicly available summarization tools developed for Polish language. We state the problem of single-document press article summarization for Polish, describing the training and evaluation dataset: the POLISH SUMMARIES CORPUS.

Next, a new coreference-based extractive summarization system NICOLAS is introduced. Its algorithm utilises advanced third-party preprocessing tools to extract the coreference information from the text to be summarized. This information is transformed into a complex set of features related to coreference concepts (mentions and coreference clusters) that are used for training the summarization system (on the basis of a manually prepared gold summaries corpus).

The proposed solution is compared to the best publicly available summarization systems for Polish language and two state-of-the-art tools, developed for English language, but adapted to Polish for this article. NICOLAS summarization system obtains best scores, for selected metrics outperforming other systems in a statistically significant way. The evaluation also contains calculation of interesting upper-bounds: human performance and theoretical upper-bound.

Keywords: Summarization; coreference; Polish

7 Acknowledgements

We would like to thank the anonymous reviewers for their insightful comments. The work reported here was co-funded by the European Union from financial resources of the European Social Fund, project PO KL Information technologies: Research and their interdisciplinary applications. ^[13]

Appendix

Most frequent mentions in the training set

Below are 237 lemmatized token-by-token and lowercased mentions, appearing in at least 50 texts from the training corpus. Mentions are sorted from the most frequent.

Mentions list: on, to, co, rok, być, wszystko, polska, człowiek, sobie, raz, my, mieć, czas, państwo, praca, osoba, sprawa, ja, kraj, pieniądz, nikt, kto, przykład, nic, koniec, rząd, prawo, życie, miejsce, móc, fot, problem, władza, miesiąc, rzecz, stan, świat, wszyscy, mówić, rozmowa, coś, sytuacja, powód, początek, wiedzieć, dzień, uwaga, strona, udział, in, musieć, polityk, ktoś, ogół, polityka, chcieć, walka, zmiana, decyzja, ciąg, m ., pan, szansa, polak, przypadek, większość, pytanie, wzgląd, warszawa, proca, pomoc, prezydent, społeczeństwo, wynik, dziecko, prawda, związek, gospodarka, część, wojna, tydzień, granica, głos, przyszłość, autor, wybory, rynek, cel, ustawa, uważać, ten rok, droga, dom, rys, myśleć, firma, zasada, fakt, kolej, nadzieja, dolar, wraz, miasto, rozwój, ten sposób, europa, temat, siła, rodzina, minister, historia, wpływ, współpraca, środek, informacja, procent, wniosek, unia europejski, niemcy, podstawa, reforma, partia, interes, ten sprawa, kandydat, sukces, sposób, wątpliwość, złoty, sld, pracownik, stanowisko, dyskusja, telewizja, pewność, odpowiedź, rzeczywistość, program, cena, działanie, system, unia, ręka, odpowiedzialność, środowisko, solidarność, demokracja, maić, ramy, badanie, media, wartość, wybór, głowa, zostać, usa, pracować, porozumienie, widzieć, zdanie, akcja, wolność, spotkanie, przeszłość, stosunek, okazja, prowadzić, zachód, kobieta, obywatel, sąd, ubiegły rok, dziennikarz, kultura, grupa, opinia publiczny, obrona, bezpieczeństwo, opinia, rzeczpospolita, dokument, racja, szkoła, góra, warunek, organizacja, oko, godzina, tysiąc, ten czas, możliwość, błąd, ziemia, parlament, ten pora, chwila, naród, konflikt, działalność, sejm, powrót, premier, działać, rada, zdrowie, wiek, dodatek, poziom, widzenie, żyć, powiedzieć, inwestycja, rosja, niemiec, samochód, skutek, punkt, rola, mieszkaniec, wyborca, koszt, budżet, szef, styczeń, instytucja, pełnia, ulica, aws, ochrona, dostęp, zagrożenie, zgoda, ue, " rzeczpospolita ", liczba, wieś, połowa.

References

Azzam, S., K. Humphreys and R. Gaizauskas. 1999. “Using coreference chains for text summarization”. Proceedings of the workshop on coreference and its applications (CorefApp ’99). Stroudsburg, PA: Association for Computational Linguistics. 77– 84. <http://dl.acm.org/citation.cfm?id=1608810.1608825>10.3115/1608810.1608825Search in Google Scholar

Baldwin, B. and T. S. Morton. 1998. “Dynamic coreference-based summarization”. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP-3)Search in Google Scholar

Barrios, F., F. López, L. Argerich and R. Wachenchauzer. 2016. “Variations of the similarity function of textrank for automated summarization”. CoRR abs/1602. 03606. <http://arxiv.org/abs/1602.03606>Search in Google Scholar

Barzilay, R. and M. Elhadad. 1997. “Using lexical chains for text summarization”. Proceedings of the Acl Workshop on Intelligent Scalable Text Summarization 10–17.Search in Google Scholar

Bergler, S., R. Witte, M. Khalife, Z. Li and F. Rudzicz. 2003. “Using knowledge-poor coreference resolution for text summarization”. Workshop on Text Summarization (Document Understanding Conference (Duc)). Edmonton: NIST.Search in Google Scholar

Brin, S. and L. Page. 1998. “The anatomy of a large-scale hypertextual web search engine”. Computer networks and ISDN systems 30(1). 107–117.10.1016/S0169-7552(98)00110-XSearch in Google Scholar

Broda, B., Ł. Burdka and M. Maziarz. 2012. IKAR: “An improved kit for anaphora resolution for Polish”. COLING (demos) 25–32.Search in Google Scholar

Cohan, A., F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang and N. Goharian. 2018. “A discourse-aware attention model for abstractive summarization of long documents”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2 (short papers) New Orleans: Association for Computational Linguistics. 615–621. <http://aclweb.org/anthology/N18-2097>10.18653/v1/N18-2097Search in Google Scholar

Crystal, D. 2011. A dictionary of linguistics and phonetics. (6th ed.)Search in Google Scholar

Dudczak, A. 2007. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, Poznań University of Technology.)Search in Google Scholar

Dudczak, A. J. Stefanowski and D. Weiss. 2008a. “Automatyczna selekcja zdań dla tekstów prasowych w języku polskim” [Automatic sentence selection for Polish press texts]. Technical Report. Institute of Computing Science, Poznań University of Technology.Search in Google Scholar

Dudczak, A., J. Stefanowski and D. Weiss. 2008b. “Comparing performance of text summarization methods on Polish news articles”. Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference Zakopane, Poland.Search in Google Scholar

Dudczak, A. J. Stefanowski and D. Weiss. 2010. “Evaluation of sentence-selection text summarization methods on Polish news articles”. Foundations of Computing and Decision Sciences 1(35). 27–41.Search in Google Scholar

Edmundson, H. P. 1969. “New methods in automatic extracting”. J. ACM 16(2). 264– 285.10.1145/321510.321519Search in Google Scholar

Erkan, G. and D. R. Radev. 2004. “LexPageRank: Prestige in multi-document text summarization”. EMNLP Barcelona, Spain.Search in Google Scholar

Grusky, M., M. Naaman and Y. Artzi. 2018. “NEWSROOM: A dataset of 1.3 million summaries with diverse extractive strategies”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies New Orleans. <https://summari.es/newsroom.pdf>10.18653/v1/N18-1065Search in Google Scholar

Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I.H. Witten. 2009. “The WEKA data mining software: An update”. SIGKDD Explor. Newsl 11(1). 10– 18. doi:10.1145/1656274.1656278.10.1145/1656274.1656278Search in Google Scholar

Hendrickx, I. and W. Bosma. 2008. “Using coreference links and sentence compression in graph-based summarization”. Proceedings of the text Analysis ConferenceSearch in Google Scholar

Hermann, K.M., T.máš Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman and P. Blunsom. 2015. “Teaching machines to read and comprehend”. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). Cambridge, MA, USA: MIT Press. 1693–1701.Search in Google Scholar

Hornby, A.S., A.P. Cowie and J.W. Lewis. 1974. Oxford Advanced Learner’s Dictionary of Current English Oxford: Oxford University Press.Search in Google Scholar

Kaplan, D., R. Iida and T. Tokunaga. 2009. “Automatic extraction of citation contexts for research paper summarization: A coreference-chain based approach”. Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL ’09). Stroudsburg, PA: Association for Computational Linguistics. 88–95. <http://dl.acm.org/citation.cfm?id=1699750.1699764>10.3115/1699750.1699764Search in Google Scholar

Kopeć, M. 2014. “Zero subject detection for Polish”. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. (Volume 2.) Gothenburg: Association for Computational Linguistics. 221–225.10.3115/v1/E14-4043Search in Google Scholar

Kopeć, M. 2015. “Coreference-based content selection for automatic summarization of Polish news”. Selected problems in information technologies 23–46.Search in Google Scholar

Kopeć, M. and M. Ogrodniczuk. 2012. “Creating a coreference resolution system for Polish”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 192–195.Search in Google Scholar

Lee, K., L. He, M. Lewis and L. Zettlemoyer. 2017. “End-to-end neural coreference resolution”. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Copenhagen: Association for Computational Linguistics. 188–197. <http://aclweb.org/anthology/D17-1018>10.18653/v1/D17-1018Search in Google Scholar

Lin, C. 2004. “ROUGE: A package for automatic evaluation of summaries”. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Barcelona: Association for Computational Linguistics. 74–81.Search in Google Scholar

Lin, C. and E. Hovy. 2002. “Manual and automatic evaluation of summaries”. Proceedings of the ACL-02 Workshop on Automatic Summarization (AS ’02). Stroudsburg, PA: Association for Computational Linguistics. 45–51.10.3115/1118162.1118168Search in Google Scholar

Lin, C. and E. Hovy. 2003. “Automatic evaluation of summaries using n-gram co-occurrence statistics”. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03). Stroudsburg, PA: Association for Computational Linguistics. 71–78. doi:10.3115/1073445.1073465.10.3115/1073445.1073465Search in Google Scholar

Loper, E. and S. Bird. 2002. “NLTK: The natural language toolkit”. Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.10.3115/1118108.1118117Search in Google Scholar

Luhn, H.P. 1958. “The automatic creation of literature abstracts”. IBM Journal of Research and Development S. 159–165.10.1147/rd.22.0159Search in Google Scholar

Manning, C.D., M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard and D. McClosky. 2014. “The Stanford CoreNLP natural language processing toolkit”. Association for Computational Linguistics (ACL) System Demonstrations 55–60. <http://www.aclweb.org/anthology/P/P14/P14-5010>10.3115/v1/P14-5010Search in Google Scholar

Marcu, D. 1999. “The automatic construction of large-scale corpora for summarization research”. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99). New York: ACM. 137–144. doi:10.1145/312624.312668.10.1145/312624.312668Search in Google Scholar

Mihalcea, R. and P. Tarau. 2004. “TextRank: Bringing order into texts”. Proceedings of the 2004 Conference on Empirical Methods in Natural Language ProcessingSearch in Google Scholar

Mitkov, R., R. Evans, C. Orăsan, I. Dornescu and M. Rios. 2012. “Coreference resolution: To what extent does it help NLP applications?” In: Sojka, P., A. Horák, I. Kopeček and K. Pala (eds.), Text, speech and dialogue – 15th International Conference. Heidelberg: Springer-Verlag. 16–27. doi:10.1007/978-3-642-32790-S_S10.1007/978-3-642-32790-S_SSearch in Google Scholar

Nallapati, R., B. Zhou, C. dos Santos, C. Gulcehre and B. Xiang. 2016. “Abstractive text summarization using sequence-to-sequence RNNS and beyond”. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin: Association for Computational Linguistics. 280–290. doi:10.18653/v1/K16-102810.18653/v1/K16-1028Search in Google Scholar

Nenkova, A., R.J. Passonneau and K. McKeown. 2007. “The Pyramid Method: Incorporating human content selection variation in summarization evaluation”. ACM Transactions on Speech and Language Processing 4(2). 16–44.10.1145/1233912.1233913Search in Google Scholar

Nitoń, B. 2013. “Evaluation of Uryupina’s coreference resolution features for Polish”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 122–126.Search in Google Scholar

Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2013. “Polish Coreference Corpus”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 494– 498.Search in Google Scholar

Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2015. Coreference: Annotation, resolution and evaluation in Polish Berlin: De Gruyter.10.1515/9781614518389Search in Google Scholar

Ogrodniczuk, M. and M. Kopeć. 2011a. “End-to-end coreference resolution baseline system for Polish”. Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 167–171.Search in Google Scholar

Ogrodniczuk, M. and Mateusz Kopeć. 2011b. Rule-based coreference resolution module for Polish. Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011) 191–200. Faro, Portugal.Search in Google Scholar

Ogrodniczuk, M. and M. Kopeć. 2014. “The POLISH SUMMARIES Corpus”. Proceedings of the Oth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík: European Language Resources Association. 3712–3715.Search in Google Scholar

Ogrodniczuk, M. and M. Lenart. „Web Service integration platform for Polish linguistic resources”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 1164–1168.Search in Google Scholar

Orăsan, C. 2009. “Comparative evaluation of term-weighting methods for automatic summarization”. Journal of Quantitative Linguistics 16(1). 67–95.10.1080/09296170802514187Search in Google Scholar

Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2002. “BLEU: A method for automatic evaluation of machine translation”. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). Stroudsburg, PA, USA: Association for Computational Linguistics. 311–318. doi:10.3115/1073083.107313510.3115/1073083.1073135Search in Google Scholar

Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Search in Google Scholar

Piasecki, M., S. Szpakowicz and B. Broda. 2009. A wordnet from the ground up Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej.Search in Google Scholar

Przepiórkowski, A. 2004. The IPI PAN Corpus: Preliminary version Warsaw: Institute of Computer Science, Polish Academy of Sciences.Search in Google Scholar

Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [National corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Search in Google Scholar

Przepiórkowski, A. and A. Buczyński. 2007. “Spejd: Shallow Parsing And Disambiguation Engine”. Proceedings of the 3rd Language and Technology Conference Poznań. 340–344.Search in Google Scholar

Radziszewski, A. and T. Śniatowski. 2011. “Maca – A configurable tool to integrate Polish morphological data”. Proceedings of the second international workshop on free/open-source rule-based machine translation Barcelona.Search in Google Scholar

Rotem, N. 2003. Open Text Summarizer. <http://libots.sourceforge.net/>Search in Google Scholar

Rush, A. M., S. Chopra and J. Weston. 2015. “A neural attention model for abstractive sentence summarization”. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. , 379–389. doi:10.18653/v1/D15-104410.18653/v1/D15-1044Search in Google Scholar

See, A., P.J. Liu and C.D. Manning. 2017. “Get to the point: Summarization with pointer-generator networks”. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1). Vancouver: Association for Computational Linguistics. 1073–1083. doi:10.18653/v1/P17-1099.10.18653/v1/P17-1099Search in Google Scholar

Shao, L., S. Gouws, D. Britz, A. Goldie, B. Strope and R. Kurzweil. 2017. “Generating long and diverse responses with neural conversation models”. CoRR abs/1701.03185. <http://arxiv.org/abs/1701.03185>Search in Google Scholar

Smith, C., H. Danielsson and A. Jönsson. 2012. “A more cohesive summarizer”. Proceedings of COLING 2012 Mumbai: The COLING 2012 Organizing Committee. 1161–1170. <http://aclweb.org/anthology/C12-2113>Search in Google Scholar

Steinberger, J., M. Poesio, M.A. Kabadjov and K. Jeek. 2007. “Two uses of anaphora resolution in summarization”. Information Processing and Management 43(6). 1663–1680. doi:10.1016/j.ipm.2007.01.010.10.1016/j.ipm.2007.01.010Search in Google Scholar

Stuckardt, R. 2003. “Coreference-based summarization and question answering: A case for high precision anaphor resolution”. International Symposium on Reference ResolutionSearch in Google Scholar

Świetlicka, J. 2010. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, University of Warsaw.)Search in Google Scholar

Tu, Z., Z. Lu, Y. Liu, X. Liu and H. Li. 2016. „Modeling coverage for neural machine translation”. arXiv preprint arXiv:1601.04811.10.18653/v1/P16-1008Search in Google Scholar

Versley, Y., S.P. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X. Yang and A. Moschitti. 2008. “BART: A modular toolkit for coreference resolution”. Association for Computational Linguistics (ACL) Demo Session10.3115/1564144.1564147Search in Google Scholar

Vinyals, O., M. Fortunato and N. Jaitly. 2015. “Pointer networks”. Advances in neural information processing systems 2692–2700.Search in Google Scholar

Waszczuk, J. 2012. “Harnessing the crf complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. COLING 2789–2804.Search in Google Scholar

Waszczuk, J., K. Głowińska, A. Savary and A. Przepiórkowski. 2010. “Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish”. Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational linguistics – applications (CLA’10) Wisła, Poland. 531–539.10.1109/IMCSIT.2010.5680057Search in Google Scholar

Weiss, D. 2002. Korpus Rzeczpospolitej [The Rzeczpospolita Corpus]. <http://www.cs.put.poznan.pl/dweiss/rzeczpospolita>Search in Google Scholar

Wikipedia contributors. 2004. Wikipedia, the free encyclopedia<https://pl.wikipedia.org>Search in Google Scholar

Witte, R. and S. Bergler. 2003. “Fuzzy coreference resolution for summarization”. Proceedings of 2003 International Symposium on Reference Resolution and its Applications to Question Answering and Summarization (ARQAS) Venice: Università Ca’ Foscari. 43–50.Search in Google Scholar

Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining’06 Conference Wisła, Poland. 511– 520.Search in Google Scholar

Published Online: 2019-08-17

Published in Print: 2019-06-26

You are currently not able to access this content.

Articles in the same Issue

https://doi.org/10.1515/psicl-2019-0015

Keywords for this article

Summarization; coreference; Polish