Abstract
This research introduces a new algorithm designed to identify double text change points within a concatenated text composed of three distinct texts. It also investigates the application of text homogeneity and text change point detection techniques to low-resource languages, specifically Tigre and Tigrigna. Leveraging recently developed probability models for text homogeneity and change point detection, the study proposes a novel algorithm capable of accurately locating multiple (double) text change points in a sequence of three texts while evaluating the error rate in estimating the primary and secondary points of concatenation. Data samples were gathered from three different genres in each of the target languages. The results demonstrate a notable reduction in the error rate for detecting text change points as the heterogeneity of the concatenated text increases.
Acknowledgments
The Author would like to thank Rora Digital Library (Eritrean People’s Front for Democracy and Justice) and Mr. Adem Abaharish (Tigre program coordinator, Ministry of information, Eritrea) for providing me the scripts of Tigrigna and Tigre texts, respectively. Last but not least, I would like to express my deep appreciation for Professor Artyom Kovalevskii for his time and effort devoted to providing constructive feedback on my work. We would also like to express our sincere gratitude to the editor and the reviewer for their invaluable contributions to our article. Their insightful feedback and constructive suggestions have significantly enhanced the quality of our work.
-
Conflict of interest: The author reports that there are no competing interests to declare.
-
Research funding: The author did not receive specific funding for this work.
References
Abebe, Berhane, Mikhail Chebunin & Artyom Kovalevskii. 2024. Text segmentation via processes that count the number of different words forward and backward. Journal of Quantitative Linguistics 31(1). 1–18. https://doi.org/10.1080/09296174.2023.2275342.Search in Google Scholar
Abebe, Berhane, Mikhail Chebunin, Artyom Kovalevskii & Zakrevskay Natalia. 2022. Statistical tests for text homogeneity: Using forward and backward processes of numbers of different words. Glottometrics 53(1). 42–58. https://doi.org/10.53482/2022\_53\_401.10.53482/2022_53_401Search in Google Scholar
Alemi, AlexanderA. & Paul GinspargH. 2015. Text segmentation based on semantic word embeddings. arXiv e-prints. https://doi.org/10.48550/arXiv.1503.05543.Search in Google Scholar
Beeferman, Doug, Adam Berger & John Lafferty. 1999. Statistical models for text segmentation. Machine Learning 34(1–3). 177–210.10.1023/A:1007506220214Search in Google Scholar
Birkenmaier, Lukas, Clemens Maine Lechner & Claudia Wagner. 2024. The search for solid ground in text as data: A systematic review of validation practices and practical recommendations for validation. Communication Methods and Measures 18(3). 249–277. https://doi.org/10.1080/19312458.2023.2285765.Search in Google Scholar
Chakrabarty, Anik, Mikhail Chebunin, Artyom Kovalevskii, Ilya Pupyshev, Natalia Zakrevskaya & Qianqian Zhou. 2020. A statistical test for correspondence of texts to the Zipf-Mandelbrot law. Siberian Electronic Mathematical Reports 17. 1959–1974. https://doi.org/10.33048/semi.2020.17.13.Search in Google Scholar
Chebunin, Mikhail & Artyom Kovalevskii. 2019. A statistical test for the Zipf’s law by deviations from the Heaps’ law. Сибирские электронные математические известия 16(0). 1822–1832. https://doi.org/10.33048/semi.2019.16.129.Search in Google Scholar
Choi, Freddy Y. Y. 2000. Advances in domain independent linear text segmentation. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.cs/0003083.Search in Google Scholar
Davis, Victor. 2019. Types, tokens, and hapaxes: A new Heap’s law. Glottotheory 9(2). 113–129. https://doi.org/10.1515/glot-2018-0014.Search in Google Scholar
Eisenstein, Jacob & Regina Barzilay. 2008. Bayesian unsupervised topic segmentation. Proceedings of the 2008 conference on empirical methods in natural language processing, 334–343. Honolulu, Hawaii: Association for Computational Linguistics.10.3115/1613715.1613760Search in Google Scholar
Ferrer-i-Cancho, R. & R. V. Solé. 2001. Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistics 8(3). 165–173. https://doi.org/10.1076/jqul.8.3.165.4101.Search in Google Scholar
Fragkou, Pavlina, Vassilios Petridis & Kehagias Ath. 2004. A dynamic programming algorithm for linear text segmentation. Journal of Intelligent Information Systems 23(2). 179–197.10.1023/B:JIIS.0000039534.65423.00Search in Google Scholar
Guillou, Armelle & Peter Hall. 2001. A diagnostic for selecting the threshold in extreme value analysis. Journal of the Royal Statistical Society: Series B 63(2). 293–305. https://doi.org/10.1111/1467-9868.00286.Search in Google Scholar
Hearst, Marti A. 1997. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1). 33–64.Search in Google Scholar
Hill, Bruce. 1975. A simple general approach to inference about the tail of a distribution. Annals of Statistics 3(5). 1163–1174. https://doi.org/10.1214/aos/1176343247.Search in Google Scholar
Kilgarriff, Adam & Tony Rose. 1998. Measures for corpus similarity and homogeneity. Proceedings of the third conference on empirical methods for natural language processing, 46–52. Granada, Spain: Association for Computational Linguistics.Search in Google Scholar
Li, Haipeng & Jonathan Dunn. 2022. Corpus similarity measures remain robust across diverse languages. Lingua 275. 103377. https://doi.org/10.1016/j.lingua.2022.103377.Search in Google Scholar
Misra, Hemant, Francois Yvon, Olivier Cappé & Jose Joemon. 2011. Text segmentation: A topic modeling perspective. Information Processing & Management 47(4). 528–544.10.1016/j.ipm.2010.11.008Search in Google Scholar
Nicholls, Paul. 1987. Estimation of Zipf parameters. Journal of the American Society for Information Science 38(4). 443–445. https://doi.org/10.1002/(SICI)1097-4571(198711)38:6<443::AID-ASI4>3.0.CO;2-E.10.1002/(SICI)1097-4571(198711)38:6<443::AID-ASI4>3.0.CO;2-ESearch in Google Scholar
Ohannessian, Mesrob & Munther A. Dahleh. 2012. Rare probability estimation under regularly varying heavy tails. Conference on learning theory, 21–1. Edinburgh, Scotland: JMLR Workshop and Conference Proceedings.Search in Google Scholar
Riedl, Martin & Chris Biemann. 2012. Topic tiling: A text segmentation algorithm based on lda. Proceedings of ACL student research workshop, 37–42. Jeju Island, Korea: Association for Computational Linguistics.Search in Google Scholar
Song, Fel, Darling William, Duric Adnan & Kroon Fred. 2011. An iterative approach to text segmentation. In Paul Clough, Colum Foley, Cathal Gurrin, Gareth J. F. Jones, Wessel Kraaij, Hyowon Lee & Vanessa Murdoch (eds.), Advances in information retrieval: 33rd European conference on IR research, ECIR 2011, Dublin, Ireland, proceedings, 33, 629–640. Berlin, Heidelberg: Springer.Search in Google Scholar
Stanton, Jeffrey & Yisi Sang. 2020. Assessing topical homogeneity with word embedding and distance matrices. School of information studies – faculty scholarship. New York, USA: Syracuse University.Search in Google Scholar
Utiyama, Masao & Hitoshi Isahara. 2001. A statistical model for domain-independent text segmentation. Proceedings of the 39th annual meeting of the association for computational linguistics, 499–506. Stroudsburg, PA, United States: Association for Computational Linguistics.10.3115/1073012.1073076Search in Google Scholar
Yu, Shuiyuan, Chunshan Xu & Haitao Liu. 2018. Zipf’s law in 50 languages: Its structural pattern, linguistic interpretation, and cognitive motivation. arXiv preprint arXiv:1807.01855. https://doi.org/10.48550/arXiv.1807.01855.Search in Google Scholar
Zipf, George Kingsley. 1936. The psycho-Biology of language: An introduction to dynamic philology. Boston: Houghton Mifflin.Search in Google Scholar
© 2025 Walter de Gruyter GmbH, Berlin/Boston