A new method for detecting multiple text change points

Berhane Abebe

doi:10.1515/glot-2025-2003

Article

A new method for detecting multiple text change points

Berhane Abebe

Published/Copyright: April 14, 2025

Published by

Become an author with De Gruyter Brill

Author Information Explore this Subject

From the journal Glottotheory

Abstract

This research introduces a new algorithm designed to identify double text change points within a concatenated text composed of three distinct texts. It also investigates the application of text homogeneity and text change point detection techniques to low-resource languages, specifically Tigre and Tigrigna. Leveraging recently developed probability models for text homogeneity and change point detection, the study proposes a novel algorithm capable of accurately locating multiple (double) text change points in a sequence of three texts while evaluating the error rate in estimating the primary and secondary points of concatenation. Data samples were gathered from three different genres in each of the target languages. The results demonstrate a notable reduction in the error rate for detecting text change points as the heterogeneity of the concatenated text increases.

Keywords: text homogeneity; multiple text change point detection; Tigrigna and Tigre; quantitative language models; urn model

Corresponding author: Berhane Abebe, Department of Statistics, Mainefhi College of Science, Mainefhi, Eritrea; and Department of Probability Theory and Matematical Statistics, Novosibirsk State University, Novosibirsk, Russia Federation, E-mail: b.andemikael@g.nsu.ru

Acknowledgments

The Author would like to thank Rora Digital Library (Eritrean People’s Front for Democracy and Justice) and Mr. Adem Abaharish (Tigre program coordinator, Ministry of information, Eritrea) for providing me the scripts of Tigrigna and Tigre texts, respectively. Last but not least, I would like to express my deep appreciation for Professor Artyom Kovalevskii for his time and effort devoted to providing constructive feedback on my work. We would also like to express our sincere gratitude to the editor and the reviewer for their invaluable contributions to our article. Their insightful feedback and constructive suggestions have significantly enhanced the quality of our work.

Conflict of interest: The author reports that there are no competing interests to declare.
Research funding: The author did not receive specific funding for this work.

References

Abebe, Berhane, Mikhail Chebunin & Artyom Kovalevskii. 2024. Text segmentation via processes that count the number of different words forward and backward. Journal of Quantitative Linguistics 31(1). 1–18. https://doi.org/10.1080/09296174.2023.2275342.Search in Google Scholar

Abebe, Berhane, Mikhail Chebunin, Artyom Kovalevskii & Zakrevskay Natalia. 2022. Statistical tests for text homogeneity: Using forward and backward processes of numbers of different words. Glottometrics 53(1). 42–58. https://doi.org/10.53482/2022\_53\_401.10.53482/2022_53_401Search in Google Scholar

Alemi, AlexanderA. & Paul GinspargH. 2015. Text segmentation based on semantic word embeddings. arXiv e-prints. https://doi.org/10.48550/arXiv.1503.05543.Search in Google Scholar

Beeferman, Doug, Adam Berger & John Lafferty. 1999. Statistical models for text segmentation. Machine Learning 34(1–3). 177–210.10.1023/A:1007506220214Search in Google Scholar

Birkenmaier, Lukas, Clemens Maine Lechner & Claudia Wagner. 2024. The search for solid ground in text as data: A systematic review of validation practices and practical recommendations for validation. Communication Methods and Measures 18(3). 249–277. https://doi.org/10.1080/19312458.2023.2285765.Search in Google Scholar

Chakrabarty, Anik, Mikhail Chebunin, Artyom Kovalevskii, Ilya Pupyshev, Natalia Zakrevskaya & Qianqian Zhou. 2020. A statistical test for correspondence of texts to the Zipf-Mandelbrot law. Siberian Electronic Mathematical Reports 17. 1959–1974. https://doi.org/10.33048/semi.2020.17.13.Search in Google Scholar

Chebunin, Mikhail & Artyom Kovalevskii. 2019. A statistical test for the Zipf’s law by deviations from the Heaps’ law. Сибирские электронные математические известия 16(0). 1822–1832. https://doi.org/10.33048/semi.2019.16.129.Search in Google Scholar

Choi, Freddy Y. Y. 2000. Advances in domain independent linear text segmentation. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.cs/0003083.Search in Google Scholar

Davis, Victor. 2019. Types, tokens, and hapaxes: A new Heap’s law. Glottotheory 9(2). 113–129. https://doi.org/10.1515/glot-2018-0014.Search in Google Scholar

Eisenstein, Jacob & Regina Barzilay. 2008. Bayesian unsupervised topic segmentation. Proceedings of the 2008 conference on empirical methods in natural language processing, 334–343. Honolulu, Hawaii: Association for Computational Linguistics.10.3115/1613715.1613760Search in Google Scholar

Ferrer-i-Cancho, R. & R. V. Solé. 2001. Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistics 8(3). 165–173. https://doi.org/10.1076/jqul.8.3.165.4101.Search in Google Scholar

Fragkou, Pavlina, Vassilios Petridis & Kehagias Ath. 2004. A dynamic programming algorithm for linear text segmentation. Journal of Intelligent Information Systems 23(2). 179–197.10.1023/B:JIIS.0000039534.65423.00Search in Google Scholar

Guillou, Armelle & Peter Hall. 2001. A diagnostic for selecting the threshold in extreme value analysis. Journal of the Royal Statistical Society: Series B 63(2). 293–305. https://doi.org/10.1111/1467-9868.00286.Search in Google Scholar

Hearst, Marti A. 1997. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1). 33–64.Search in Google Scholar

Hill, Bruce. 1975. A simple general approach to inference about the tail of a distribution. Annals of Statistics 3(5). 1163–1174. https://doi.org/10.1214/aos/1176343247.Search in Google Scholar

Kilgarriff, Adam & Tony Rose. 1998. Measures for corpus similarity and homogeneity. Proceedings of the third conference on empirical methods for natural language processing, 46–52. Granada, Spain: Association for Computational Linguistics.Search in Google Scholar

Li, Haipeng & Jonathan Dunn. 2022. Corpus similarity measures remain robust across diverse languages. Lingua 275. 103377. https://doi.org/10.1016/j.lingua.2022.103377.Search in Google Scholar

Misra, Hemant, Francois Yvon, Olivier Cappé & Jose Joemon. 2011. Text segmentation: A topic modeling perspective. Information Processing & Management 47(4). 528–544.10.1016/j.ipm.2010.11.008Search in Google Scholar

Nicholls, Paul. 1987. Estimation of Zipf parameters. Journal of the American Society for Information Science 38(4). 443–445. https://doi.org/10.1002/(SICI)1097-4571(198711)38:6<443::AID-ASI4>3.0.CO;2-E.10.1002/(SICI)1097-4571(198711)38:6<443::AID-ASI4>3.0.CO;2-ESearch in Google Scholar

Ohannessian, Mesrob & Munther A. Dahleh. 2012. Rare probability estimation under regularly varying heavy tails. Conference on learning theory, 21–1. Edinburgh, Scotland: JMLR Workshop and Conference Proceedings.Search in Google Scholar

Riedl, Martin & Chris Biemann. 2012. Topic tiling: A text segmentation algorithm based on lda. Proceedings of ACL student research workshop, 37–42. Jeju Island, Korea: Association for Computational Linguistics.Search in Google Scholar

Song, Fel, Darling William, Duric Adnan & Kroon Fred. 2011. An iterative approach to text segmentation. In Paul Clough, Colum Foley, Cathal Gurrin, Gareth J. F. Jones, Wessel Kraaij, Hyowon Lee & Vanessa Murdoch (eds.), Advances in information retrieval: 33rd European conference on IR research, ECIR 2011, Dublin, Ireland, proceedings, 33, 629–640. Berlin, Heidelberg: Springer.Search in Google Scholar

Stanton, Jeffrey & Yisi Sang. 2020. Assessing topical homogeneity with word embedding and distance matrices. School of information studies – faculty scholarship. New York, USA: Syracuse University.Search in Google Scholar

Utiyama, Masao & Hitoshi Isahara. 2001. A statistical model for domain-independent text segmentation. Proceedings of the 39th annual meeting of the association for computational linguistics, 499–506. Stroudsburg, PA, United States: Association for Computational Linguistics.10.3115/1073012.1073076Search in Google Scholar

Yu, Shuiyuan, Chunshan Xu & Haitao Liu. 2018. Zipf’s law in 50 languages: Its structural pattern, linguistic interpretation, and cognitive motivation. arXiv preprint arXiv:1807.01855. https://doi.org/10.48550/arXiv.1807.01855.Search in Google Scholar

Zipf, George Kingsley. 1936. The psycho-Biology of language: An introduction to dynamic philology. Boston: Houghton Mifflin.Search in Google Scholar

Published Online: 2025-04-14

You are currently not able to access this content.

https://doi.org/10.1515/glot-2025-2003

Keywords for this article

text homogeneity; multiple text change point detection; Tigrigna and Tigre; quantitative language models; urn model