A study on Chinese register characteristics based on regression analysis and text clustering

Renkui Hou; Chu-Ren Huang; Hongchao Liu

doi:10.1515/cllt-2016-0062

Artikel

A study on Chinese register characteristics based on regression analysis and text clustering

, und

Veröffentlicht/Copyright: 30. März 2017

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Manuskript einreichen Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift Corpus Linguistics and Linguistic Theory Band 15 Heft 1

Abstract

This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aL^bc^L, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.

Keywords: sentence length distribution; regression analysis; Chinese register; text clustering

Acknowledgments

We would like to thank the anonymous CLLT reviewers for their insightful and helpful comments. Research on this paper was funded by National Social Science Fund in China (Grant Award Number: ‘16BYY110’) and by the Hong Kong Polytechnic University -Peking University Research Centre on Chinese Linguistics.

References

Altmann, G. 1988. Verteilungen der Satzlängen. Glottometrika 9. 147–169.Suche in Google Scholar

Best, K.-H. 2005. Quantitative linguistics. An international handbook, chapter Satzlänge (Sentence length), 298–304. Berlin: de Gruyter.Suche in Google Scholar

Best, K.-H. 2002. The distribution of rhythmic units in German short prose. Glottometrics 3, 136–142.Suche in Google Scholar

Biber, D. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8(1). 9–37.10.1515/cllt-2012-0002Suche in Google Scholar

Biber, D. & S. Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press.10.1017/CBO9780511814358Suche in Google Scholar

Chao, Yuen Ren. 1968. A grammar of spoken Chinese. Berkeley and Los Angeles: University of California Press.Suche in Google Scholar

Chen, H. H. 1994. The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4). 281–289.10.1093/llc/9.4.281Suche in Google Scholar

Chen, K.-J., C.-R. Huang, L.-P. Chang, & H.-L. Hsu. 1996. Sinica corpus: Design methodology for balanced corpora. In B.-S. Park & J.B. Kim (eds.), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation, 167–176. Seoul: Kyung Hee University.Suche in Google Scholar

Chen, K.-J., C.-C. Luo, M.-C. Chang, F.-Y. Chen, C.-J. Chen, C.-R. Huang & Z.-M. Gao. 2003. Sinica treebank: Design criteria, representational issues and implementation. In Anne Abeillé (ed.), Treebanks: Building and using parsed Corpora, 231–248. Dordrecht/Boston: Kluwer Academic Publishers.10.1007/978-94-010-0201-1_13Suche in Google Scholar

Dzurjuk, T. 2006. Sentence length as a feature of style (applied to works of German writers). Glottometrics 12. 55–62.Suche in Google Scholar

Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues, 15–90. The Netherlands: Springer.Suche in Google Scholar

Grzybek, P., E. Kelih & E. Stadlober. 2008. The relation between word length and sentence length: an intra-systemic perspective in the core data structure. Glottometrics 16. 111–121.Suche in Google Scholar

Grzybek, P., E. Stadlober & E. Kelih. 2007. The relationship of word length and sentence length: the inter-textual perspective. In Decker R. & H. J. Lenz (eds), Advances in Data Analysis, 611–618. Berlin/Heidelberg: Springer.10.1007/978-3-540-70981-7_70Suche in Google Scholar

Guha, S., R. Rastogi & K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, Vol. 27, No. 2, 73-84. ACM.10.1145/276305.276312Suche in Google Scholar

Halkidi, M., Y. Batistakis & M. Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17. 107–145.10.1023/A:1012801612483Suche in Google Scholar

Hou, R., J. Yang & M. Jiang. 2014. A study on Chinese quantitative stylistic features and relation among different styles based on text clustering. Journal of Quantitative Linguistics 21(3). 246–280.10.1080/09296174.2014.911508Suche in Google Scholar

Huang, C.-R. & D. Shi. 2016. A reference grammar of Chinese. Cambridge: Cambridge University Press.10.1017/CBO9781139028462Suche in Google Scholar

Huang, C.-R. & K.-J. Chen. (2017). Sinica Treebank. In N. Ide and J. Pustejovsky (eds), Handbook of Linguistic Annotation. Berlin & Heidelberg: Springer.Suche in Google Scholar

Kelih, E., P. Grzybek, G. Antić & E. Stadlober. 2006. Quantitative text typology: The impact of sentence length. In Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds), From data and information analysis to knowledge engineering, Studies in Classification, Data Analysis, and Knowledge Organization. 382–389. Berlin/Heidelberg: Springer.10.1007/3-540-31314-1_46Suche in Google Scholar

Köhler, R. 2012. Quantitative syntax analysis, Vol. 65. Berlin: de Gruyter.10.1515/9783110272925Suche in Google Scholar

Koppel, M., J. Schler & S. Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60(1). 9–26.10.1002/asi.20961Suche in Google Scholar

Liu, Y. & F. Hu. 2011. A comparative study of stylistics between “Reading News” and “Talking News”. Language Teaching and Linguistic Studies 1. 97–104.Suche in Google Scholar

Lu, J. 1993. The features of Chinese sentences. Chinese Language Learning 1. 1–6.Suche in Google Scholar

Lv, S. 1992. Studies on Chinese grammar through comparison. Language Teaching and Linguistic Studies. 2. 4–18.Suche in Google Scholar

Manning, C., & H. Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Suche in Google Scholar

Mannion, D. & P. Dixon. 2004. Sentence-length and authorship attribution: the case of Oliver Goldsmith. Literary and Linguistic Computing 19(4). 497–508.10.1093/llc/19.4.497Suche in Google Scholar

Morton, A. Q. 1965. The authorship of Greek prose. Journal of the Royal Statistical Society. Series A (General) 128(2). 169–233.10.2307/2344178Suche in Google Scholar

Pande, H. & H. S. Dhami. 2015. Determination of the distribution of sentence length frequencies for Hindi language texts and utilization of sentence length frequency profiles for authorship attribution. Journal of Quantitative Linguistics 22(4). 338–348.10.1080/09296174.2015.1106269Suche in Google Scholar

Popescu, I. I., K. H. Best & G. Altmann. 2014. Unified modeling of length in language. Lüdenscheid: RAM Verlag.Suche in Google Scholar

R Core Team. 2016. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.Suche in Google Scholar

Rezaee, R., B. P. F. Lelieveldt & J. H. C. Reiber. 1998. A new cluster validity index for the fuzzy c-Mean. Pattern Recognition Letters 19. 237–246.10.1016/S0167-8655(97)00168-2Suche in Google Scholar

Sherman, L. A. 1888. Some observations upon the sentence-length in English prose. University of Nebraska Studies 1. 119–130.Suche in Google Scholar

Sichel, H. S. 1971. On a family of discrete distributions particularly suited to represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of the Third Symposium on Mathematical Statistics, 51–97. South Africa: Council for Scientific and Industrial Research.Suche in Google Scholar

Sichel, H. S. 1974. Distribution representing sentence-length in written prose. Journal of the Royal Statistical Society Series A- Statistical in Society 137. 25–34.10.2307/2345142Suche in Google Scholar

Sigurd, B., M. Eeg‐Olofsson & J. Van Weijer. 2004. Word length, sentence length and frequency–Zipf revisited. Studia Linguistica 58(1). 37–52.10.1111/j.0039-3193.2004.00109.xSuche in Google Scholar

Wang, K. & H. Qin. 2014. What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1). 57–77.10.1515/cllt-2013-0020Suche in Google Scholar

Williams, C. B. 1940. A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3/4). 356–361.10.2307/2332615Suche in Google Scholar

Wimmer, G. & G. Altmann. 2005. Unified derivation of some linguistic laws. In R. Köhler, G. Altmann & R. G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 791–807. Berlin: de Gruyter.Suche in Google Scholar

Wimmer, G. & G. Altmann. 2007. Towards a unified derivation of some linguistic laws. In P. Grzybek (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues. 329–337. The Netherlands: Springer.10.1007/978-1-4020-4068-9_17Suche in Google Scholar

Wu, Y. 2005. A comparative study of news and novel style. Chinese Monthly 5. 66–67.Suche in Google Scholar

Zhang, Z. S. 2012. A corpus study of variation in written Chinese. Corpus Linguistics and Linguistic Theory 8(1). 209–240.10.1515/cllt-2012-0009Suche in Google Scholar

Zhu, D. 1982. Lectures on grammar. Beijing, China: Commercial Press.Suche in Google Scholar

Zipf, G. K. 1935. The psycho-biology of language. Oxford, England: Houghton, Mifflin.Suche in Google Scholar

Zipf, G. K. 1949. Human behavior and the principle of least effort. Reading, MA: Ed: Addison-Wesley.Suche in Google Scholar

Appendix

Table 11:

The numbers of simple sentences in every class.

1	2	3	4	5	6	7	8	9	10	11	12
1–5	6–10	11–15	16–20	21–25	26–30	31–35	36–40	41–45	46–50	51–55	56–60
731	1100	1256	1189	803	457	305	180	88	67	38	81

Figure 32:

The observed and fitted values of the number of simple sentences (gray line refers to observed value).

Published Online: 2017-03-30

Published in Print: 2019-05-27

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/cllt-2016-0062

Schlagwörter für diesen Artikel

sentence length distribution; regression analysis; Chinese register; text clustering