A Modern Greek readability tool
-
George Mikros
and Rania Voskaki
Abstract
The aim of this paper is to develop an automatic readability analysis tool that focusses on Modern Greek as a foreign language. Based on previous work done in the Centre for the Greek Language (CGL), we offer an enhanced methodology in readability prediction for Modern Greek texts matching the adequacy level (A1 to C2) according to the Common European Framework of Languages. The proposed tool is based on several stylometric indices inspired by work done in the field of quantitative linguistics. The resulting feature vectors train a Random Forest, a robust and accurate machine learning algorithm that predicts readability in our testing dataset with 0.943 accuracy, surpassing all previous readability tools for Modern Greek. Further, analysis of the results with advanced visualization methods reveals the complex and fluid dynamics of the features used and their readability predictions.
Abstract
The aim of this paper is to develop an automatic readability analysis tool that focusses on Modern Greek as a foreign language. Based on previous work done in the Centre for the Greek Language (CGL), we offer an enhanced methodology in readability prediction for Modern Greek texts matching the adequacy level (A1 to C2) according to the Common European Framework of Languages. The proposed tool is based on several stylometric indices inspired by work done in the field of quantitative linguistics. The resulting feature vectors train a Random Forest, a robust and accurate machine learning algorithm that predicts readability in our testing dataset with 0.943 accuracy, surpassing all previous readability tools for Modern Greek. Further, analysis of the results with advanced visualization methods reveals the complex and fluid dynamics of the features used and their readability predictions.
Chapters in this book
- Prelim pages i
- Table of contents v
- Introduction 1
- Part I. Theory and models 7
- On the impact of the initial phrase length on the position of enclitics in Old Czech 9
- Term distance, frequency and collocations 21
- A method for the comparison of general sequences via type-token ratio 37
- Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian 55
- N -grams of grammatical functions and their significant order in the Japanese clause 69
- Linking the dependents 93
- Grammar efficiency and the One-Meaning–One-Form Principle 109
- Distribution and characteristics of commonly used words across different texts in Japanese 121
- Part II. Empirical studies 135
- The perils of big data 137
- From distinguishability to informativity 145
- A Modern Greek readability tool 163
- Phonological properties as predictors of text success 177
- Calculating the victory chances 195
- Topological mapping for visualisation of high-dimensional historical linguistic data 209
- Book genre and author’s gender recognition based on titles 225
- Quantitative analysis of bibliographic corpora 239
- Analysis of English text genre classification based on dependency types 257
- In memory of Gabriel Altmann 271
- Index 277
Chapters in this book
- Prelim pages i
- Table of contents v
- Introduction 1
- Part I. Theory and models 7
- On the impact of the initial phrase length on the position of enclitics in Old Czech 9
- Term distance, frequency and collocations 21
- A method for the comparison of general sequences via type-token ratio 37
- Quantitative analysis of syllable properties in Croatian, Serbian, Russian, and Ukrainian 55
- N -grams of grammatical functions and their significant order in the Japanese clause 69
- Linking the dependents 93
- Grammar efficiency and the One-Meaning–One-Form Principle 109
- Distribution and characteristics of commonly used words across different texts in Japanese 121
- Part II. Empirical studies 135
- The perils of big data 137
- From distinguishability to informativity 145
- A Modern Greek readability tool 163
- Phonological properties as predictors of text success 177
- Calculating the victory chances 195
- Topological mapping for visualisation of high-dimensional historical linguistic data 209
- Book genre and author’s gender recognition based on titles 225
- Quantitative analysis of bibliographic corpora 239
- Analysis of English text genre classification based on dependency types 257
- In memory of Gabriel Altmann 271
- Index 277