Abstract
This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aLbcL, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.
Acknowledgments
We would like to thank the anonymous CLLT reviewers for their insightful and helpful comments. Research on this paper was funded by National Social Science Fund in China (Grant Award Number: ‘16BYY110’) and by the Hong Kong Polytechnic University -Peking University Research Centre on Chinese Linguistics.
References
Altmann, G. 1988. Verteilungen der Satzlängen. Glottometrika 9. 147–169.Suche in Google Scholar
Best, K.-H. 2005. Quantitative linguistics. An international handbook, chapter Satzlänge (Sentence length), 298–304. Berlin: de Gruyter.Suche in Google Scholar
Best, K.-H. 2002. The distribution of rhythmic units in German short prose. Glottometrics 3, 136–142.Suche in Google Scholar
Biber, D. 2012. Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory 8(1). 9–37.10.1515/cllt-2012-0002Suche in Google Scholar
Biber, D. & S. Conrad. 2009. Register, genre, and style. Cambridge: Cambridge University Press.10.1017/CBO9780511814358Suche in Google Scholar
Chao, Yuen Ren. 1968. A grammar of spoken Chinese. Berkeley and Los Angeles: University of California Press.Suche in Google Scholar
Chen, H. H. 1994. The contextual analysis of Chinese sentences with punctuation marks. Literary and Linguistic Computing 9(4). 281–289.10.1093/llc/9.4.281Suche in Google Scholar
Chen, K.-J., C.-R. Huang, L.-P. Chang, & H.-L. Hsu. 1996. Sinica corpus: Design methodology for balanced corpora. In B.-S. Park & J.B. Kim (eds.), Proceedings of the 11th Pacific Asia Conference on Language, Information and Computation, 167–176. Seoul: Kyung Hee University.Suche in Google Scholar
Chen, K.-J., C.-C. Luo, M.-C. Chang, F.-Y. Chen, C.-J. Chen, C.-R. Huang & Z.-M. Gao. 2003. Sinica treebank: Design criteria, representational issues and implementation. In Anne Abeillé (ed.), Treebanks: Building and using parsed Corpora, 231–248. Dordrecht/Boston: Kluwer Academic Publishers.10.1007/978-94-010-0201-1_13Suche in Google Scholar
Dzurjuk, T. 2006. Sentence length as a feature of style (applied to works of German writers). Glottometrics 12. 55–62.Suche in Google Scholar
Grzybek, P. (2007). History and methodology of word length studies. In Grzybek, P. (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues, 15–90. The Netherlands: Springer.Suche in Google Scholar
Grzybek, P., E. Kelih & E. Stadlober. 2008. The relation between word length and sentence length: an intra-systemic perspective in the core data structure. Glottometrics 16. 111–121.Suche in Google Scholar
Grzybek, P., E. Stadlober & E. Kelih. 2007. The relationship of word length and sentence length: the inter-textual perspective. In Decker R. & H. J. Lenz (eds), Advances in Data Analysis, 611–618. Berlin/Heidelberg: Springer.10.1007/978-3-540-70981-7_70Suche in Google Scholar
Guha, S., R. Rastogi & K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, Vol. 27, No. 2, 73-84. ACM.10.1145/276305.276312Suche in Google Scholar
Halkidi, M., Y. Batistakis & M. Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems 17. 107–145.10.1023/A:1012801612483Suche in Google Scholar
Hou, R., J. Yang & M. Jiang. 2014. A study on Chinese quantitative stylistic features and relation among different styles based on text clustering. Journal of Quantitative Linguistics 21(3). 246–280.10.1080/09296174.2014.911508Suche in Google Scholar
Huang, C.-R. & D. Shi. 2016. A reference grammar of Chinese. Cambridge: Cambridge University Press.10.1017/CBO9781139028462Suche in Google Scholar
Huang, C.-R. & K.-J. Chen. (2017). Sinica Treebank. In N. Ide and J. Pustejovsky (eds), Handbook of Linguistic Annotation. Berlin & Heidelberg: Springer.Suche in Google Scholar
Kelih, E., P. Grzybek, G. Antić & E. Stadlober. 2006. Quantitative text typology: The impact of sentence length. In Spiliopoulou M., Kruse R., Borgelt C., Nürnberger A., Gaul W. (eds), From data and information analysis to knowledge engineering, Studies in Classification, Data Analysis, and Knowledge Organization. 382–389. Berlin/Heidelberg: Springer.10.1007/3-540-31314-1_46Suche in Google Scholar
Köhler, R. 2012. Quantitative syntax analysis, Vol. 65. Berlin: de Gruyter.10.1515/9783110272925Suche in Google Scholar
Koppel, M., J. Schler & S. Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60(1). 9–26.10.1002/asi.20961Suche in Google Scholar
Liu, Y. & F. Hu. 2011. A comparative study of stylistics between “Reading News” and “Talking News”. Language Teaching and Linguistic Studies 1. 97–104.Suche in Google Scholar
Lu, J. 1993. The features of Chinese sentences. Chinese Language Learning 1. 1–6.Suche in Google Scholar
Lv, S. 1992. Studies on Chinese grammar through comparison. Language Teaching and Linguistic Studies. 2. 4–18.Suche in Google Scholar
Manning, C., & H. Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: MIT Press.Suche in Google Scholar
Mannion, D. & P. Dixon. 2004. Sentence-length and authorship attribution: the case of Oliver Goldsmith. Literary and Linguistic Computing 19(4). 497–508.10.1093/llc/19.4.497Suche in Google Scholar
Morton, A. Q. 1965. The authorship of Greek prose. Journal of the Royal Statistical Society. Series A (General) 128(2). 169–233.10.2307/2344178Suche in Google Scholar
Pande, H. & H. S. Dhami. 2015. Determination of the distribution of sentence length frequencies for Hindi language texts and utilization of sentence length frequency profiles for authorship attribution. Journal of Quantitative Linguistics 22(4). 338–348.10.1080/09296174.2015.1106269Suche in Google Scholar
Popescu, I. I., K. H. Best & G. Altmann. 2014. Unified modeling of length in language. Lüdenscheid: RAM Verlag.Suche in Google Scholar
R Core Team. 2016. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.Suche in Google Scholar
Rezaee, R., B. P. F. Lelieveldt & J. H. C. Reiber. 1998. A new cluster validity index for the fuzzy c-Mean. Pattern Recognition Letters 19. 237–246.10.1016/S0167-8655(97)00168-2Suche in Google Scholar
Sherman, L. A. 1888. Some observations upon the sentence-length in English prose. University of Nebraska Studies 1. 119–130.Suche in Google Scholar
Sichel, H. S. 1971. On a family of discrete distributions particularly suited to represent long-tailed frequency data. In N. F. Laubscher (ed.), Proceedings of the Third Symposium on Mathematical Statistics, 51–97. South Africa: Council for Scientific and Industrial Research.Suche in Google Scholar
Sichel, H. S. 1974. Distribution representing sentence-length in written prose. Journal of the Royal Statistical Society Series A- Statistical in Society 137. 25–34.10.2307/2345142Suche in Google Scholar
Sigurd, B., M. Eeg‐Olofsson & J. Van Weijer. 2004. Word length, sentence length and frequency–Zipf revisited. Studia Linguistica 58(1). 37–52.10.1111/j.0039-3193.2004.00109.xSuche in Google Scholar
Wang, K. & H. Qin. 2014. What is peculiar to translational Mandarin Chinese? A corpus-based study of Chinese constructions’ load capacity. Corpus Linguistics and Linguistic Theory 10(1). 57–77.10.1515/cllt-2013-0020Suche in Google Scholar
Williams, C. B. 1940. A note on the statistical analysis of sentence-length as a criterion of literary style. Biometrika 31(3/4). 356–361.10.2307/2332615Suche in Google Scholar
Wimmer, G. & G. Altmann. 2005. Unified derivation of some linguistic laws. In R. Köhler, G. Altmann & R. G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 791–807. Berlin: de Gruyter.Suche in Google Scholar
Wimmer, G. & G. Altmann. 2007. Towards a unified derivation of some linguistic laws. In P. Grzybek (ed.), Contributions to the Science of Text and Language: Word Length Studies and Related Issues. 329–337. The Netherlands: Springer.10.1007/978-1-4020-4068-9_17Suche in Google Scholar
Wu, Y. 2005. A comparative study of news and novel style. Chinese Monthly 5. 66–67.Suche in Google Scholar
Zhang, Z. S. 2012. A corpus study of variation in written Chinese. Corpus Linguistics and Linguistic Theory 8(1). 209–240.10.1515/cllt-2012-0009Suche in Google Scholar
Zhu, D. 1982. Lectures on grammar. Beijing, China: Commercial Press.Suche in Google Scholar
Zipf, G. K. 1935. The psycho-biology of language. Oxford, England: Houghton, Mifflin.Suche in Google Scholar
Zipf, G. K. 1949. Human behavior and the principle of least effort. Reading, MA: Ed: Addison-Wesley.Suche in Google Scholar
Appendix
The numbers of simple sentences in every class.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1–5 | 6–10 | 11–15 | 16–20 | 21–25 | 26–30 | 31–35 | 36–40 | 41–45 | 46–50 | 51–55 | 56–60 |
| 731 | 1100 | 1256 | 1189 | 803 | 457 | 305 | 180 | 88 | 67 | 38 | 81 |

The observed and fitted values of the number of simple sentences (gray line refers to observed value).
© 2019 Walter de Gruyter GmbH, Berlin/Boston
Artikel in diesem Heft
- Frontmatter
- A study on Chinese register characteristics based on regression analysis and text clustering
- Keymorph analysis, or how morphosyntax informs discourse
- Reliability vs. granularity in discourse annotation: What is the trade-off?
- Similarity is closeness: Using distributional semantic spaces to model similarity in visual and linguistic metaphors
- The influence of social distance on speech behavior: Formality variation in casual speech
- Entrenchment and persistence in language change: the Spanish past subjunctive
- The English gerund revisited
Artikel in diesem Heft
- Frontmatter
- A study on Chinese register characteristics based on regression analysis and text clustering
- Keymorph analysis, or how morphosyntax informs discourse
- Reliability vs. granularity in discourse annotation: What is the trade-off?
- Similarity is closeness: Using distributional semantic spaces to model similarity in visual and linguistic metaphors
- The influence of social distance on speech behavior: Formality variation in casual speech
- Entrenchment and persistence in language change: the Spanish past subjunctive
- The English gerund revisited