Abstract
This article addresses the problem of automatic summarization of press articles in Polish. The main novelty of this research lays in the proposal of a three-step summarization algorithm which benefits from using coreference information.
In related work section, all coreference-based approaches to summarization are presented. Then we describe in detail all publicly available summarization tools developed for Polish language. We state the problem of single-document press article summarization for Polish, describing the training and evaluation dataset: the POLISH SUMMARIES CORPUS.
Next, a new coreference-based extractive summarization system NICOLAS is introduced. Its algorithm utilises advanced third-party preprocessing tools to extract the coreference information from the text to be summarized. This information is transformed into a complex set of features related to coreference concepts (mentions and coreference clusters) that are used for training the summarization system (on the basis of a manually prepared gold summaries corpus).
The proposed solution is compared to the best publicly available summarization systems for Polish language and two state-of-the-art tools, developed for English language, but adapted to Polish for this article. NICOLAS summarization system obtains best scores, for selected metrics outperforming other systems in a statistically significant way. The evaluation also contains calculation of interesting upper-bounds: human performance and theoretical upper-bound.
7 Acknowledgements
We would like to thank the anonymous reviewers for their insightful comments. The work reported here was co-funded by the European Union from financial resources of the European Social Fund, project PO KL Information technologies: Research and their interdisciplinary applications. [13]
Appendix
Most frequent mentions in the training set
Below are 237 lemmatized token-by-token and lowercased mentions, appearing in at least 50 texts from the training corpus. Mentions are sorted from the most frequent.
Mentions list: on, to, co, rok, być, wszystko, polska, człowiek, sobie, raz, my, mieć, czas, państwo, praca, osoba, sprawa, ja, kraj, pieniądz, nikt, kto, przykład, nic, koniec, rząd, prawo, życie, miejsce, móc, fot, problem, władza, miesiąc, rzecz, stan, świat, wszyscy, mówić, rozmowa, coś, sytuacja, powód, początek, wiedzieć, dzień, uwaga, strona, udział, in, musieć, polityk, ktoś, ogół, polityka, chcieć, walka, zmiana, decyzja, ciąg, m ., pan, szansa, polak, przypadek, większość, pytanie, wzgląd, warszawa, proca, pomoc, prezydent, społeczeństwo, wynik, dziecko, prawda, związek, gospodarka, część, wojna, tydzień, granica, głos, przyszłość, autor, wybory, rynek, cel, ustawa, uważać, ten rok, droga, dom, rys, myśleć, firma, zasada, fakt, kolej, nadzieja, dolar, wraz, miasto, rozwój, ten sposób, europa, temat, siła, rodzina, minister, historia, wpływ, współpraca, środek, informacja, procent, wniosek, unia europejski, niemcy, podstawa, reforma, partia, interes, ten sprawa, kandydat, sukces, sposób, wątpliwość, złoty, sld, pracownik, stanowisko, dyskusja, telewizja, pewność, odpowiedź, rzeczywistość, program, cena, działanie, system, unia, ręka, odpowiedzialność, środowisko, solidarność, demokracja, maić, ramy, badanie, media, wartość, wybór, głowa, zostać, usa, pracować, porozumienie, widzieć, zdanie, akcja, wolność, spotkanie, przeszłość, stosunek, okazja, prowadzić, zachód, kobieta, obywatel, sąd, ubiegły rok, dziennikarz, kultura, grupa, opinia publiczny, obrona, bezpieczeństwo, opinia, rzeczpospolita, dokument, racja, szkoła, góra, warunek, organizacja, oko, godzina, tysiąc, ten czas, możliwość, błąd, ziemia, parlament, ten pora, chwila, naród, konflikt, działalność, sejm, powrót, premier, działać, rada, zdrowie, wiek, dodatek, poziom, widzenie, żyć, powiedzieć, inwestycja, rosja, niemiec, samochód, skutek, punkt, rola, mieszkaniec, wyborca, koszt, budżet, szef, styczeń, instytucja, pełnia, ulica, aws, ochrona, dostęp, zagrożenie, zgoda, ue, " rzeczpospolita ", liczba, wieś, połowa.
References
Azzam, S., K. Humphreys and R. Gaizauskas. 1999. “Using coreference chains for text summarization”. Proceedings of the workshop on coreference and its applications (CorefApp ’99). Stroudsburg, PA: Association for Computational Linguistics. 77– 84. <http://dl.acm.org/citation.cfm?id=1608810.1608825>10.3115/1608810.1608825Search in Google Scholar
Baldwin, B. and T. S. Morton. 1998. “Dynamic coreference-based summarization”. Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP-3)Search in Google Scholar
Barrios, F., F. López, L. Argerich and R. Wachenchauzer. 2016. “Variations of the similarity function of textrank for automated summarization”. CoRR abs/1602. 03606. <http://arxiv.org/abs/1602.03606>Search in Google Scholar
Barzilay, R. and M. Elhadad. 1997. “Using lexical chains for text summarization”. Proceedings of the Acl Workshop on Intelligent Scalable Text Summarization 10–17.Search in Google Scholar
Bergler, S., R. Witte, M. Khalife, Z. Li and F. Rudzicz. 2003. “Using knowledge-poor coreference resolution for text summarization”. Workshop on Text Summarization (Document Understanding Conference (Duc)). Edmonton: NIST.Search in Google Scholar
Brin, S. and L. Page. 1998. “The anatomy of a large-scale hypertextual web search engine”. Computer networks and ISDN systems 30(1). 107–117.10.1016/S0169-7552(98)00110-XSearch in Google Scholar
Broda, B., Ł. Burdka and M. Maziarz. 2012. IKAR: “An improved kit for anaphora resolution for Polish”. COLING (demos) 25–32.Search in Google Scholar
Cohan, A., F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang and N. Goharian. 2018. “A discourse-aware attention model for abstractive summarization of long documents”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2 (short papers) New Orleans: Association for Computational Linguistics. 615–621. <http://aclweb.org/anthology/N18-2097>10.18653/v1/N18-2097Search in Google Scholar
Crystal, D. 2011. A dictionary of linguistics and phonetics. (6th ed.)Search in Google Scholar
Dudczak, A. 2007. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, Poznań University of Technology.)Search in Google Scholar
Dudczak, A. J. Stefanowski and D. Weiss. 2008a. “Automatyczna selekcja zdań dla tekstów prasowych w języku polskim” [Automatic sentence selection for Polish press texts]. Technical Report. Institute of Computing Science, Poznań University of Technology.Search in Google Scholar
Dudczak, A., J. Stefanowski and D. Weiss. 2008b. “Comparing performance of text summarization methods on Polish news articles”. Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference Zakopane, Poland.Search in Google Scholar
Dudczak, A. J. Stefanowski and D. Weiss. 2010. “Evaluation of sentence-selection text summarization methods on Polish news articles”. Foundations of Computing and Decision Sciences 1(35). 27–41.Search in Google Scholar
Edmundson, H. P. 1969. “New methods in automatic extracting”. J. ACM 16(2). 264– 285.10.1145/321510.321519Search in Google Scholar
Erkan, G. and D. R. Radev. 2004. “LexPageRank: Prestige in multi-document text summarization”. EMNLP Barcelona, Spain.Search in Google Scholar
Grusky, M., M. Naaman and Y. Artzi. 2018. “NEWSROOM: A dataset of 1.3 million summaries with diverse extractive strategies”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies New Orleans. <https://summari.es/newsroom.pdf>10.18653/v1/N18-1065Search in Google Scholar
Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann and I.H. Witten. 2009. “The WEKA data mining software: An update”. SIGKDD Explor. Newsl 11(1). 10– 18. doi:10.1145/1656274.1656278.10.1145/1656274.1656278Search in Google Scholar
Hendrickx, I. and W. Bosma. 2008. “Using coreference links and sentence compression in graph-based summarization”. Proceedings of the text Analysis ConferenceSearch in Google Scholar
Hermann, K.M., T.máš Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman and P. Blunsom. 2015. “Teaching machines to read and comprehend”. Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). Cambridge, MA, USA: MIT Press. 1693–1701.Search in Google Scholar
Hornby, A.S., A.P. Cowie and J.W. Lewis. 1974. Oxford Advanced Learner’s Dictionary of Current English Oxford: Oxford University Press.Search in Google Scholar
Kaplan, D., R. Iida and T. Tokunaga. 2009. “Automatic extraction of citation contexts for research paper summarization: A coreference-chain based approach”. Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL ’09). Stroudsburg, PA: Association for Computational Linguistics. 88–95. <http://dl.acm.org/citation.cfm?id=1699750.1699764>10.3115/1699750.1699764Search in Google Scholar
Kopeć, M. 2014. “Zero subject detection for Polish”. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. (Volume 2.) Gothenburg: Association for Computational Linguistics. 221–225.10.3115/v1/E14-4043Search in Google Scholar
Kopeć, M. 2015. “Coreference-based content selection for automatic summarization of Polish news”. Selected problems in information technologies 23–46.Search in Google Scholar
Kopeć, M. and M. Ogrodniczuk. 2012. “Creating a coreference resolution system for Polish”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 192–195.Search in Google Scholar
Lee, K., L. He, M. Lewis and L. Zettlemoyer. 2017. “End-to-end neural coreference resolution”. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Copenhagen: Association for Computational Linguistics. 188–197. <http://aclweb.org/anthology/D17-1018>10.18653/v1/D17-1018Search in Google Scholar
Lin, C. 2004. “ROUGE: A package for automatic evaluation of summaries”. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Barcelona: Association for Computational Linguistics. 74–81.Search in Google Scholar
Lin, C. and E. Hovy. 2002. “Manual and automatic evaluation of summaries”. Proceedings of the ACL-02 Workshop on Automatic Summarization (AS ’02). Stroudsburg, PA: Association for Computational Linguistics. 45–51.10.3115/1118162.1118168Search in Google Scholar
Lin, C. and E. Hovy. 2003. “Automatic evaluation of summaries using n-gram co-occurrence statistics”. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03). Stroudsburg, PA: Association for Computational Linguistics. 71–78. doi:10.3115/1073445.1073465.10.3115/1073445.1073465Search in Google Scholar
Loper, E. and S. Bird. 2002. “NLTK: The natural language toolkit”. Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.10.3115/1118108.1118117Search in Google Scholar
Luhn, H.P. 1958. “The automatic creation of literature abstracts”. IBM Journal of Research and Development S. 159–165.10.1147/rd.22.0159Search in Google Scholar
Manning, C.D., M. Surdeanu, J. Bauer, J. Finkel, S.J. Bethard and D. McClosky. 2014. “The Stanford CoreNLP natural language processing toolkit”. Association for Computational Linguistics (ACL) System Demonstrations 55–60. <http://www.aclweb.org/anthology/P/P14/P14-5010>10.3115/v1/P14-5010Search in Google Scholar
Marcu, D. 1999. “The automatic construction of large-scale corpora for summarization research”. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99). New York: ACM. 137–144. doi:10.1145/312624.312668.10.1145/312624.312668Search in Google Scholar
Mihalcea, R. and P. Tarau. 2004. “TextRank: Bringing order into texts”. Proceedings of the 2004 Conference on Empirical Methods in Natural Language ProcessingSearch in Google Scholar
Mitkov, R., R. Evans, C. Orăsan, I. Dornescu and M. Rios. 2012. “Coreference resolution: To what extent does it help NLP applications?” In: Sojka, P., A. Horák, I. Kopeček and K. Pala (eds.), Text, speech and dialogue – 15th International Conference. Heidelberg: Springer-Verlag. 16–27. doi:10.1007/978-3-642-32790-S_S10.1007/978-3-642-32790-S_SSearch in Google Scholar
Nallapati, R., B. Zhou, C. dos Santos, C. Gulcehre and B. Xiang. 2016. “Abstractive text summarization using sequence-to-sequence RNNS and beyond”. Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin: Association for Computational Linguistics. 280–290. doi:10.18653/v1/K16-102810.18653/v1/K16-1028Search in Google Scholar
Nenkova, A., R.J. Passonneau and K. McKeown. 2007. “The Pyramid Method: Incorporating human content selection variation in summarization evaluation”. ACM Transactions on Speech and Language Processing 4(2). 16–44.10.1145/1233912.1233913Search in Google Scholar
Nitoń, B. 2013. “Evaluation of Uryupina’s coreference resolution features for Polish”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 122–126.Search in Google Scholar
Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2013. “Polish Coreference Corpus”. Proceedings of the 6th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 494– 498.Search in Google Scholar
Ogrodniczuk, M., K. Głowińska, M. Kopeć, A. Savary and M. Zawisławska. 2015. Coreference: Annotation, resolution and evaluation in Polish Berlin: De Gruyter.10.1515/9781614518389Search in Google Scholar
Ogrodniczuk, M. and M. Kopeć. 2011a. “End-to-end coreference resolution baseline system for Polish”. Proceedings of the 5th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science And Linguistics Poznań: Fundacja Uniwersytetu im. Adama Mickiewicza. 167–171.Search in Google Scholar
Ogrodniczuk, M. and Mateusz Kopeć. 2011b. Rule-based coreference resolution module for Polish. Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011) 191–200. Faro, Portugal.Search in Google Scholar
Ogrodniczuk, M. and M. Kopeć. 2014. “The POLISH SUMMARIES Corpus”. Proceedings of the Oth International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík: European Language Resources Association. 3712–3715.Search in Google Scholar
Ogrodniczuk, M. and M. Lenart. „Web Service integration platform for Polish linguistic resources”. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). Istanbul: European Language Resources Association. 1164–1168.Search in Google Scholar
Orăsan, C. 2009. “Comparative evaluation of term-weighting methods for automatic summarization”. Journal of Quantitative Linguistics 16(1). 67–95.10.1080/09296170802514187Search in Google Scholar
Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. 2002. “BLEU: A method for automatic evaluation of machine translation”. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02). Stroudsburg, PA, USA: Association for Computational Linguistics. 311–318. doi:10.3115/1073083.107313510.3115/1073083.1073135Search in Google Scholar
Piasecki, M. 2007. “Polish tagger TaKIPI: Rule based construction and optimisation”. Task Quarterly 11(1–2). 151–167.Search in Google Scholar
Piasecki, M., S. Szpakowicz and B. Broda. 2009. A wordnet from the ground up Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej.Search in Google Scholar
Przepiórkowski, A. 2004. The IPI PAN Corpus: Preliminary version Warsaw: Institute of Computer Science, Polish Academy of Sciences.Search in Google Scholar
Przepiórkowski, A., M. Bańko, R.L. Górski and B. Lewandowska-Tomaszczyk (eds.). 2012. Narodowy korpus języka polskiego [National corpus of Polish]. Warsaw: Wydawnictwo Naukowe PWN.Search in Google Scholar
Przepiórkowski, A. and A. Buczyński. 2007. “Spejd: Shallow Parsing And Disambiguation Engine”. Proceedings of the 3rd Language and Technology Conference Poznań. 340–344.Search in Google Scholar
Radziszewski, A. and T. Śniatowski. 2011. “Maca – A configurable tool to integrate Polish morphological data”. Proceedings of the second international workshop on free/open-source rule-based machine translation Barcelona.Search in Google Scholar
Rotem, N. 2003. Open Text Summarizer. <http://libots.sourceforge.net/>Search in Google Scholar
Rush, A. M., S. Chopra and J. Weston. 2015. “A neural attention model for abstractive sentence summarization”. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. , 379–389. doi:10.18653/v1/D15-104410.18653/v1/D15-1044Search in Google Scholar
See, A., P.J. Liu and C.D. Manning. 2017. “Get to the point: Summarization with pointer-generator networks”. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 1). Vancouver: Association for Computational Linguistics. 1073–1083. doi:10.18653/v1/P17-1099.10.18653/v1/P17-1099Search in Google Scholar
Shao, L., S. Gouws, D. Britz, A. Goldie, B. Strope and R. Kurzweil. 2017. “Generating long and diverse responses with neural conversation models”. CoRR abs/1701.03185. <http://arxiv.org/abs/1701.03185>Search in Google Scholar
Smith, C., H. Danielsson and A. Jönsson. 2012. “A more cohesive summarizer”. Proceedings of COLING 2012 Mumbai: The COLING 2012 Organizing Committee. 1161–1170. <http://aclweb.org/anthology/C12-2113>Search in Google Scholar
Steinberger, J., M. Poesio, M.A. Kabadjov and K. Jeek. 2007. “Two uses of anaphora resolution in summarization”. Information Processing and Management 43(6). 1663–1680. doi:10.1016/j.ipm.2007.01.010.10.1016/j.ipm.2007.01.010Search in Google Scholar
Stuckardt, R. 2003. “Coreference-based summarization and question answering: A case for high precision anaphor resolution”. International Symposium on Reference ResolutionSearch in Google Scholar
Świetlicka, J. 2010. Metody maszynowego uczenia w automatycznym streszczaniu tekstów [Machine learning methods in automatic text summarization]. (Master’s thesis, University of Warsaw.)Search in Google Scholar
Tu, Z., Z. Lu, Y. Liu, X. Liu and H. Li. 2016. „Modeling coverage for neural machine translation”. arXiv preprint arXiv:1601.04811.10.18653/v1/P16-1008Search in Google Scholar
Versley, Y., S.P. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X. Yang and A. Moschitti. 2008. “BART: A modular toolkit for coreference resolution”. Association for Computational Linguistics (ACL) Demo Session10.3115/1564144.1564147Search in Google Scholar
Vinyals, O., M. Fortunato and N. Jaitly. 2015. “Pointer networks”. Advances in neural information processing systems 2692–2700.Search in Google Scholar
Waszczuk, J. 2012. “Harnessing the crf complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language”. COLING 2789–2804.Search in Google Scholar
Waszczuk, J., K. Głowińska, A. Savary and A. Przepiórkowski. 2010. “Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish”. Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational linguistics – applications (CLA’10) Wisła, Poland. 531–539.10.1109/IMCSIT.2010.5680057Search in Google Scholar
Weiss, D. 2002. Korpus Rzeczpospolitej [The Rzeczpospolita Corpus]. <http://www.cs.put.poznan.pl/dweiss/rzeczpospolita>Search in Google Scholar
Wikipedia contributors. 2004. Wikipedia, the free encyclopedia<https://pl.wikipedia.org>Search in Google Scholar
Witte, R. and S. Bergler. 2003. “Fuzzy coreference resolution for summarization”. Proceedings of 2003 International Symposium on Reference Resolution and its Applications to Question Answering and Summarization (ARQAS) Venice: Università Ca’ Foscari. 43–50.Search in Google Scholar
Woliński, M. 2006. “Morfeusz – a practical tool for the morphological analysis of Polish”. Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining’06 Conference Wisła, Poland. 511– 520.Search in Google Scholar
© 2019 Faculty of English, Adam Mickiewicz University, Poznań, Poland
Articles in the same Issue
- Frontmatter
- Foreword
- Automatic transcription of the Polish newsreel
- Part of speech tagging for Polish
- Named entity recognition for Polish
- Recognition and normalisation of temporal expressions using conditional random fields and cascade of partial rules
- Dependency parsing of Polish
- A Weakly supervised word sense disambiguation for Polish using rich lexical resources
- Nominal coreference resolution for Polish
- Three-step coreference-based summarizer for Polish news texts
- Sentiment analysis for Polish
- Semantic approach for building generated virtual-parallel corpora from monolingual texts
- Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus
Articles in the same Issue
- Frontmatter
- Foreword
- Automatic transcription of the Polish newsreel
- Part of speech tagging for Polish
- Named entity recognition for Polish
- Recognition and normalisation of temporal expressions using conditional random fields and cascade of partial rules
- Dependency parsing of Polish
- A Weakly supervised word sense disambiguation for Polish using rich lexical resources
- Nominal coreference resolution for Polish
- Three-step coreference-based summarizer for Polish news texts
- Sentiment analysis for Polish
- Semantic approach for building generated virtual-parallel corpora from monolingual texts
- Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus