A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words
Abstract
In this study, I analyzed the distribution of words in the text from a time-series perspective. The data were comprised of 635 texts, where a token ranges from 1950-2050 words from the Balanced Corpus of Contemporary Written Japanese. Each text was divided into 10 segments containing an equal number of words, and the distribution of words among them was investigated. The relationship between the frequency of appearances and the characteristics of the words was also analyzed. From the results, the following conclusions were drawn. (1) The distribution of words among the segments follows a decreasing curve, like a Zipf’s curve, but starts to rise close to the end of the curve. (2) At the token level, as the word appearance ratio increases, the ratio of particles increases, and the ratio of nouns decreases. Additionally, the ratio of auxiliary verbs becomes slightly higher, and there is no considerable change in the ratio of verbs. (3) Conversely, at the type level, the proportion of parts of speech remains almost unchanged. (4) The average number of words that appear in all segments was about 12 words per text, and there was no significant difference between the registers. (5) Four hundred and seventy different words appeared in all segments. They were divided into topic words, scene words, function words, and noncharacteristic words from the discourse structure point of view, and were classified according to the number of text appearances.
Abstract
In this study, I analyzed the distribution of words in the text from a time-series perspective. The data were comprised of 635 texts, where a token ranges from 1950-2050 words from the Balanced Corpus of Contemporary Written Japanese. Each text was divided into 10 segments containing an equal number of words, and the distribution of words among them was investigated. The relationship between the frequency of appearances and the characteristics of the words was also analyzed. From the results, the following conclusions were drawn. (1) The distribution of words among the segments follows a decreasing curve, like a Zipf’s curve, but starts to rise close to the end of the curve. (2) At the token level, as the word appearance ratio increases, the ratio of particles increases, and the ratio of nouns decreases. Additionally, the ratio of auxiliary verbs becomes slightly higher, and there is no considerable change in the ratio of verbs. (3) Conversely, at the type level, the proportion of parts of speech remains almost unchanged. (4) The average number of words that appear in all segments was about 12 words per text, and there was no significant difference between the registers. (5) Four hundred and seventy different words appeared in all segments. They were divided into topic words, scene words, function words, and noncharacteristic words from the discourse structure point of view, and were classified according to the number of text appearances.
Kapitel in diesem Buch
- Frontmatter I
- Editors’ Foreword V
- Contents VII
- Why does negation of the predicate shorten a clause? 1
- The co-effect of Menzerath-Altmann law and heavy constituent shift in natural languages 11
- Does the century matter? Machine learning methods to attribute historical periods in an Italian literary corpus 25
- Too much of a good thing 37
- Linguistic laws in Catalan 49
- Dating and geolocation of medieval and modern Spanish notarial documents using distributed representation 63
- Cross-modal authorship attribution in Russian texts 73
- Free or not so free? On stress position in Russian, Slovene, and Ukrainian 89
- Unpacking lexical intertextuality: Vocabulary shared among texts 101
- The Menzerath-Altmann law in the syntactic relations of the Chinese language based on Universal Dependencies (UD) 117
- Statistical tools, automatic taxonomies, and topic modelling in the study of self-promotional mission and vision texts of Polish universities 131
- Quantitative characteristics of phonological words (stress units) 147
- Explorative study on the Menzerath- Altmann law regarding style, text length, and distributions of data points 161
- Quantitative analysis of the authorship problem of “The Tale of Genji” 179
- Revisiting Zipf’s law: A new indicator of lexical diversity 193
- A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words 203
- Authors’ addresses 217
- Name index 219
- Subject index 227
Kapitel in diesem Buch
- Frontmatter I
- Editors’ Foreword V
- Contents VII
- Why does negation of the predicate shorten a clause? 1
- The co-effect of Menzerath-Altmann law and heavy constituent shift in natural languages 11
- Does the century matter? Machine learning methods to attribute historical periods in an Italian literary corpus 25
- Too much of a good thing 37
- Linguistic laws in Catalan 49
- Dating and geolocation of medieval and modern Spanish notarial documents using distributed representation 63
- Cross-modal authorship attribution in Russian texts 73
- Free or not so free? On stress position in Russian, Slovene, and Ukrainian 89
- Unpacking lexical intertextuality: Vocabulary shared among texts 101
- The Menzerath-Altmann law in the syntactic relations of the Chinese language based on Universal Dependencies (UD) 117
- Statistical tools, automatic taxonomies, and topic modelling in the study of self-promotional mission and vision texts of Polish universities 131
- Quantitative characteristics of phonological words (stress units) 147
- Explorative study on the Menzerath- Altmann law regarding style, text length, and distributions of data points 161
- Quantitative analysis of the authorship problem of “The Tale of Genji” 179
- Revisiting Zipf’s law: A new indicator of lexical diversity 193
- A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words 203
- Authors’ addresses 217
- Name index 219
- Subject index 227