A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words

Makoto Yamazaki

Abstract

In this study, I analyzed the distribution of words in the text from a time-series perspective. The data were comprised of 635 texts, where a token ranges from 1950-2050 words from the Balanced Corpus of Contemporary Written Japanese. Each text was divided into 10 segments containing an equal number of words, and the distribution of words among them was investigated. The relationship between the frequency of appearances and the characteristics of the words was also analyzed. From the results, the following conclusions were drawn. (1) The distribution of words among the segments follows a decreasing curve, like a Zipf’s curve, but starts to rise close to the end of the curve. (2) At the token level, as the word appearance ratio increases, the ratio of particles increases, and the ratio of nouns decreases. Additionally, the ratio of auxiliary verbs becomes slightly higher, and there is no considerable change in the ratio of verbs. (3) Conversely, at the type level, the proportion of parts of speech remains almost unchanged. (4) The average number of words that appear in all segments was about 12 words per text, and there was no significant difference between the registers. (5) Four hundred and seventy different words appeared in all segments. They were divided into topic words, scene words, function words, and noncharacteristic words from the discourse structure point of view, and were classified according to the number of text appearances.

A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words

Abstract

Abstract

Kapitel in diesem Buch

Kapitel in diesem Buch

A time-series analysis of vocabulary in Japanese texts: Non-characteristic words and topic words

Abstract

Kapitel PDF Ansicht

Abstract

Kapitel in diesem Buch

Kapitel in diesem Buch

Kapitel in diesem Buch