Three-step coreference-based summarizer for Polish news texts

Mateusz Kopeć

doi:10.1515/psicl-2019-0015

Artikel

Three-step coreference-based summarizer for Polish news texts

Mateusz Kopeć

Veröffentlicht/Copyright: 17. August 2019

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Poznan Studies in Contemporary Linguistics Band 55 Heft 2

Abstract

This article addresses the problem of automatic summarization of press articles in Polish. The main novelty of this research lays in the proposal of a three-step summarization algorithm which benefits from using coreference information.

In related work section, all coreference-based approaches to summarization are presented. Then we describe in detail all publicly available summarization tools developed for Polish language. We state the problem of single-document press article summarization for Polish, describing the training and evaluation dataset: the POLISH SUMMARIES CORPUS.

Next, a new coreference-based extractive summarization system NICOLAS is introduced. Its algorithm utilises advanced third-party preprocessing tools to extract the coreference information from the text to be summarized. This information is transformed into a complex set of features related to coreference concepts (mentions and coreference clusters) that are used for training the summarization system (on the basis of a manually prepared gold summaries corpus).

The proposed solution is compared to the best publicly available summarization systems for Polish language and two state-of-the-art tools, developed for English language, but adapted to Polish for this article. NICOLAS summarization system obtains best scores, for selected metrics outperforming other systems in a statistically significant way. The evaluation also contains calculation of interesting upper-bounds: human performance and theoretical upper-bound.

Keywords: Summarization; coreference; Polish

7 Acknowledgements

We would like to thank the anonymous reviewers for their insightful comments. The work reported here was co-funded by the European Union from financial resources of the European Social Fund, project PO KL Information technologies: Research and their interdisciplinary applications. ^[13]

Appendix

Most frequent mentions in the training set

Below are 237 lemmatized token-by-token and lowercased mentions, appearing in at least 50 texts from the training corpus. Mentions are sorted from the most frequent.

Mentions list: on, to, co, rok, być, wszystko, polska, człowiek, sobie, raz, my, mieć, czas, państwo, praca, osoba, sprawa, ja, kraj, pieniądz, nikt, kto, przykład, nic, koniec, rząd, prawo, życie, miejsce, móc, fot, problem, władza, miesiąc, rzecz, stan, świat, wszyscy, mówić, rozmowa, coś, sytuacja, powód, początek, wiedzieć, dzień, uwaga, strona, udział, in, musieć, polityk, ktoś, ogół, polityka, chcieć, walka, zmiana, decyzja, ciąg, m ., pan, szansa, polak, przypadek, większość, pytanie, wzgląd, warszawa, proca, pomoc, prezydent, społeczeństwo, wynik, dziecko, prawda, związek, gospodarka, część, wojna, tydzień, granica, głos, przyszłość, autor, wybory, rynek, cel, ustawa, uważać, ten rok, droga, dom, rys, myśleć, firma, zasada, fakt, kolej, nadzieja, dolar, wraz, miasto, rozwój, ten sposób, europa, temat, siła, rodzina, minister, historia, wpływ, współpraca, środek, informacja, procent, wniosek, unia europejski, niemcy, podstawa, reforma, partia, interes, ten sprawa, kandydat, sukces, sposób, wątpliwość, złoty, sld, pracownik, stanowisko, dyskusja, telewizja, pewność, odpowiedź, rzeczywistość, program, cena, działanie, system, unia, ręka, odpowiedzialność, środowisko, solidarność, demokracja, maić, ramy, badanie, media, wartość, wybór, głowa, zostać, usa, pracować, porozumienie, widzieć, zdanie, akcja, wolność, spotkanie, przeszłość, stosunek, okazja, prowadzić, zachód, kobieta, obywatel, sąd, ubiegły rok, dziennikarz, kultura, grupa, opinia publiczny, obrona, bezpieczeństwo, opinia, rzeczpospolita, dokument, racja, szkoła, góra, warunek, organizacja, oko, godzina, tysiąc, ten czas, możliwość, błąd, ziemia, parlament, ten pora, chwila, naród, konflikt, działalność, sejm, powrót, premier, działać, rada, zdrowie, wiek, dodatek, poziom, widzenie, żyć, powiedzieć, inwestycja, rosja, niemiec, samochód, skutek, punkt, rola, mieszkaniec, wyborca, koszt, budżet, szef, styczeń, instytucja, pełnia, ulica, aws, ochrona, dostęp, zagrożenie, zgoda, ue, " rzeczpospolita ", liczba, wieś, połowa.

References

Azzam, S., K. Humphreys and R. Gaizauskas. 1999. “Using coreference chains for text summarization”. Proceedings of the workshop on coreference and its applications (CorefApp ’99). Stroudsburg, PA: Association for Computational Linguistics. 77– 84. <http://dl.acm.org/citation.cfm?id=1608810.1608825>10.3115/1608810.1608825Suche in Google Scholar

Baldwin, B. and T. S. Morton. 1998. “Dynamic coreference-based summarization”. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP-3)Suche in Google Scholar

Barrios, F., F. López, L. Argerich and R. Wachenchauzer. 2016. “Variations of the similarity function of textrank for automated summarization”. CoRR abs/1602. 03606. <http://arxiv.org/abs/1602.03606>Suche in Google Scholar

Barzilay, R. and M. Elhadad. 1997. “Using lexical chains for text summarization”. Proceedings of the Acl Workshop on Intelligent Scalable Text Summarization 10–17.Suche in Google Scholar

Bergler, S., R. Witte, M. Khalife, Z. Li and F. Rudzicz. 2003. “Using knowledge-poor coreference resolution for text summarization”. Workshop on Text Summarization (Document Understanding Conference (Duc)). Edmonton: NIST.Suche in Google Scholar

Brin, S. and L. Page. 1998. “The anatomy of a large-scale hypertextual web search engine”. Computer networks and ISDN systems 30(1). 107–117.10.1016/S0169-7552(98)00110-XSuche in Google Scholar

Broda, B., Ł. Burdka and M. Maziarz. 2012. IKAR: “An improved kit for anaphora resolution for Polish”. COLING (demos) 25–32.Suche in Google Scholar

Cohan, A., F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang and N. Goharian. 2018. “A discourse-aware attention model for abstractive summarization of long documents”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2 (short papers) New Orleans: Association for Computational Linguistics. 615–621. <http://aclweb.org/anthology/N18-2097>10.18653/v1/N18-2097Suche in Google Scholar

Crystal, D. 2011. A dictionary of linguistics and phonetics. (6th ed.)Suche in Google Scholar

Dudczak, A. 2007. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, Poznań University of Technology.)Suche in Google Scholar

Dudczak, A. J. Stefanowski and D. Weiss. 2008a. “Automatyczna selekcja zdań dla tekstów prasowych w języku polskim” [Automatic sentence selection for Polish press texts]. Technical Report. Institute of Computing Science, Poznań University of Technology.Suche in Google Scholar

Dudczak, A., J. Stefanowski and D. Weiss. 2008b. “Comparing performance of text summarization methods on Polish news articles”. Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference Zakopane, Poland.Suche in Google Scholar

Dudczak, A. J. Stefanowski and D. Weiss. 2010. “Evaluation of sentence-selection text summarization methods on Polish news articles”. Foundations of Computing and Decision Sciences 1(35). 27–41.Suche in Google Scholar

Edmundson, H. P. 1969. “New methods in automatic extracting”. J. ACM 16(2). 264– 285.10.1145/321510.321519Suche in Google Scholar

Erkan, G. and D. R. Radev. 2004. “LexPageRank: Prestige in multi-document text summarization”. EMNLP Barcelona, Spain.Suche in Google Scholar

Grusky, M., M. Naaman and Y. Artzi. 2018. “NEWSROOM: A dataset of 1.3 million summaries with diverse extractive strategies”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies New Orleans. <https://summari.es/newsroom.pdf>10.18653/v1/N18-1065Suche in Google Scholar

Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I.H. Witten. 2009. “The WEKA data mining software: An update”. SIGKDD Explor. Newsl 11(1). 10– 18. doi:10.1145/1656274.1656278.10.1145/1656274.1656278Suche in Google Scholar

Hendrickx, I. and W. Bosma. 2008. “Using coreference links and sentence compression in graph-based summarization”. Proceedings of the text Analysis ConferenceSuche in Google Scholar

Hermann, K.M., T.máš Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman and P. Blunsom. 2015. “Teaching machines to read and comprehend”. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). Cambridge, MA, USA: MIT Press. 1693–1701.Suche in Google Scholar

Hornby, A.S., A.P. Cowie and J.W. Lewis. 1974. Oxford Advanced Learner’s Dictionary of Current English Oxford: Oxford University Press.Suche in Google Scholar

Kaplan, D., R. Iida and T. Tokunaga. 2009. “Automatic extraction of citation contexts for research paper summarization: A coreference-chain based approach”. Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL ’09). Stroudsburg, PA: Association for Computational Linguistics. 88–95. <http://dl.acm.org/citation.cfm?id=1699750.1699764>10.3115/1699750.1699764Suche in Google Scholar

Kopeć, M. 2014. “Zero subject detection for Polish”. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. (Volume 2.) Gothenburg: Association for Computational Linguistics. 221–225.10.3115/v1/E14-4043Suche in Google Scholar

Kopeć, M. 2015. “Coreference-based content selection for automatic summarization of Polish news”. Selected problems in information technologies 23–46.Suche in Google Scholar

Kopeć, M. and M. Ogrodniczuk. 2012. “Creating a coreference resolution system for Polish”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 192–195.Suche in Google Scholar

Lee, K., L. He, M. Lewis and L. Zettlemoyer. 2017. “End-to-end neural coreference resolution”. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Copenhagen: Association for Computational Linguistics. 188–197. <http://aclweb.org/anthology/D17-1018>10.18653/v1/D17-1018Suche in Google Scholar

Lin, C. 2004. “ROUGE: A package for automatic evaluation of summaries”. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Barcelona: Association for Computational Linguistics. 74–81.Suche in Google Scholar

Lin, C. and E. Hovy. 2002. “Manual and automatic evaluation of summaries”. Proceedings of the ACL-02 Workshop on Automatic Summarization (AS ’02). Stroudsburg, PA: Association for Computational Linguistics. 45–51.10.3115/1118162.1118168Suche in Google Scholar

Lin, C. and E. Hovy. 2003. “Automatic evaluation of summaries using n-gram co-occurrence statistics”. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03). Stroudsburg, PA: Association for Computational Linguistics. 71–78. doi:10.3115/1073445.1073465.10.3115/1073445.1073465Suche in Google Scholar

Loper, E. and S. Bird. 2002. “NLTK: The natural language toolkit”. Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.10.3115/1118108.1118117Suche in Google Scholar

Luhn, H.P. 1958. “The automatic creation of literature abstracts”. IBM Journal of Research and Development S. 159–165.10.1147/rd.22.0159Suche in Google Scholar

Manning, C.D., M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard and D. McClosky. 2014. “The Stanford CoreNLP natural language processing toolkit”. Association for Computational Linguistics (ACL) System Demonstrations 55–60. <http://www.aclweb.org/anthology/P/P14/P14-5010>10.3115/v1/P14-5010Suche in Google Scholar

Marcu, D. 1999. “The automatic construction of large-scale corpora for summarization research”. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99). New York: ACM. 137–144. doi:10.1145/312624.312668.10.1145/312624.312668Suche in Google Scholar

Mihalcea, R. and P. Tarau. 2004. “TextRank: Bringing order into texts”. Proceedings of the 2004 Conference on Empirical Methods in Natural Language ProcessingSuche in Google Scholar

Mitkov, R., R. Evans, C. Orăsan, I. Dornescu and M. Rios. 2012. “Coreference resolution: To what extent does it help NLP applications?” In: Sojka, P., A. Horák, I. Kopeček and K. Pala (eds.), Text, speech and dialogue – 15th International Conference. Heidelberg: Springer-Verlag. 16–27. doi:10.1007/978-3-642-32790-S_S10.1007/978-3-642-32790-S_SSuche in Google Scholar

Nallapati, R., B. Zhou, C. dos Santos, C. Gulcehre and B. Xiang. 2016. “Abstractive text summarization using sequence-to-sequence RNNS and beyond”. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin: Association for Computational Linguistics. 280–290. doi:10.18653/v1/K16-102810.18653/v1/K16-1028Suche in Google Scholar

Nenkova, A., R.J. Passonneau and K. McKeown. 2007. “The Pyramid Method: Incorporating human content selection variation in summarization evaluation”. ACM Transactions on Speech and Language Processing 4(2). 16–44.10.1145/1233912.1233913Suche in Google Scholar

Nitoń, B. 2013. “Evaluation of Uryupina’s coreference resolution features for Polish”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 122–126.Suche in Google Scholar

Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2013. “Polish Coreference Corpus”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 494– 498.Suche in Google Scholar

Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2015. Coreference: Annotation, resolution and evaluation in Polish Berlin: De Gruyter.10.1515/9781614518389Suche in Google Scholar

Ogrodniczuk, M. and M. Kopeć. 2011a. “End-to-end coreference resolution baseline system for Polish”. Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 167–171.Suche in Google Scholar

Ogrodniczuk, M. and Mateusz Kopeć. 2011b. Rule-based coreference resolution module for Polish. Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011) 191–200. Faro, Portugal.Suche in Google Scholar

Ogrodniczuk, M. and M. Kopeć. 2014. “The POLISH SUMMARIES Corpus”. Proceedings of the Oth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík: European Language Resources Association. 3712–3715.Suche in Google Scholar

Ogrodniczuk, M. and M. Lenart. „Web Service integration platform for Polish linguistic resources”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 1164–1168.Suche in Google Scholar

Orăsan, C. 2009. “Comparative evaluation of term-weighting methods for automatic summarization”. Journal of Quantitative Linguistics 16(1). 67–95.10.1080/09296170802514187Suche in Google Scholar

Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2002. “BLEU: A method for automatic evaluation of machine translation”. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). Stroudsburg, PA, USA: Association for Computational Linguistics. 311–318. doi:10.3115/1073083.107313510.3115/1073083.1073135Suche in Google Scholar

Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Suche in Google Scholar

Piasecki, M., S. Szpakowicz and B. Broda. 2009. A wordnet from the ground up Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej.Suche in Google Scholar

Przepiórkowski, A. 2004. The IPI PAN Corpus: Preliminary version Warsaw: Institute of Computer Science, Polish Academy of Sciences.Suche in Google Scholar

Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [National corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Suche in Google Scholar

Przepiórkowski, A. and A. Buczyński. 2007. “Spejd: Shallow Parsing And Disambiguation Engine”. Proceedings of the 3rd Language and Technology Conference Poznań. 340–344.Suche in Google Scholar

Radziszewski, A. and T. Śniatowski. 2011. “Maca – A configurable tool to integrate Polish morphological data”. Proceedings of the second international workshop on free/open-source rule-based machine translation Barcelona.Suche in Google Scholar

Rotem, N. 2003. Open Text Summarizer. <http://libots.sourceforge.net/>Suche in Google Scholar

Rush, A. M., S. Chopra and J. Weston. 2015. “A neural attention model for abstractive sentence summarization”. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. , 379–389. doi:10.18653/v1/D15-104410.18653/v1/D15-1044Suche in Google Scholar

See, A., P.J. Liu and C.D. Manning. 2017. “Get to the point: Summarization with pointer-generator networks”. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1). Vancouver: Association for Computational Linguistics. 1073–1083. doi:10.18653/v1/P17-1099.10.18653/v1/P17-1099Suche in Google Scholar

Shao, L., S. Gouws, D. Britz, A. Goldie, B. Strope and R. Kurzweil. 2017. “Generating long and diverse responses with neural conversation models”. CoRR abs/1701.03185. <http://arxiv.org/abs/1701.03185>Suche in Google Scholar

Smith, C., H. Danielsson and A. Jönsson. 2012. “A more cohesive summarizer”. Proceedings of COLING 2012 Mumbai: The COLING 2012 Organizing Committee. 1161–1170. <http://aclweb.org/anthology/C12-2113>Suche in Google Scholar

Steinberger, J., M. Poesio, M.A. Kabadjov and K. Jeek. 2007. “Two uses of anaphora resolution in summarization”. Information Processing and Management 43(6). 1663–1680. doi:10.1016/j.ipm.2007.01.010.10.1016/j.ipm.2007.01.010Suche in Google Scholar

Stuckardt, R. 2003. “Coreference-based summarization and question answering: A case for high precision anaphor resolution”. International Symposium on Reference ResolutionSuche in Google Scholar

Świetlicka, J. 2010. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, University of Warsaw.)Suche in Google Scholar

Tu, Z., Z. Lu, Y. Liu, X. Liu and H. Li. 2016. „Modeling coverage for neural machine translation”. arXiv preprint arXiv:1601.04811.10.18653/v1/P16-1008Suche in Google Scholar

Versley, Y., S.P. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X. Yang and A. Moschitti. 2008. “BART: A modular toolkit for coreference resolution”. Association for Computational Linguistics (ACL) Demo Session10.3115/1564144.1564147Suche in Google Scholar

Vinyals, O., M. Fortunato and N. Jaitly. 2015. “Pointer networks”. Advances in neural information processing systems 2692–2700.Suche in Google Scholar

Waszczuk, J. 2012. “Harnessing the crf complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. COLING 2789–2804.Suche in Google Scholar

Waszczuk, J., K. Głowińska, A. Savary and A. Przepiórkowski. 2010. “Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish”. Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational linguistics – applications (CLA’10) Wisła, Poland. 531–539.10.1109/IMCSIT.2010.5680057Suche in Google Scholar

Weiss, D. 2002. Korpus Rzeczpospolitej [The Rzeczpospolita Corpus]. <http://www.cs.put.poznan.pl/dweiss/rzeczpospolita>Suche in Google Scholar

Wikipedia contributors. 2004. Wikipedia, the free encyclopedia<https://pl.wikipedia.org>Suche in Google Scholar

Witte, R. and S. Bergler. 2003. “Fuzzy coreference resolution for summarization”. Proceedings of 2003 International Symposium on Reference Resolution and its Applications to Question Answering and Summarization (ARQAS) Venice: Università Ca’ Foscari. 43–50.Suche in Google Scholar

Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining’06 Conference Wisła, Poland. 511– 520.Suche in Google Scholar

Published Online: 2019-08-17

Published in Print: 2019-06-26

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/psicl-2019-0015

Schlagwörter für diesen Artikel

Summarization; coreference; Polish