3 Rapid and automated determination of cluster numbers for high-dimensional big data: a comprehensive update
Abstract
Automatically defining the optimal number of clusters is a pivotal challenge in clustering algorithms. Striking a balance between clustering quality and algorithm efficiency in this determination process is a crucial tradeoff that motivated our research. In our approach, we have successfully automated the identification of the optimal number of clusters, particularly tailored for large high-dimensional datasets. Our method addresses both the quality and efficiency aspects of clustering. Through conducting experimental studies on five previously explored datasets [23] and introducing four new, larger datasets, which have been done in this study, I have observed that our procedure provides flexibility in selecting diverse criteria for determining the optimal K under each circumstance. Leveraging the advantages of the bisecting K-means algorithm, our approach outperforms the Ray and Turi method, showcasing higher efficiency in identifying the best number of clusters.
Abstract
Automatically defining the optimal number of clusters is a pivotal challenge in clustering algorithms. Striking a balance between clustering quality and algorithm efficiency in this determination process is a crucial tradeoff that motivated our research. In our approach, we have successfully automated the identification of the optimal number of clusters, particularly tailored for large high-dimensional datasets. Our method addresses both the quality and efficiency aspects of clustering. Through conducting experimental studies on five previously explored datasets [23] and introducing four new, larger datasets, which have been done in this study, I have observed that our procedure provides flexibility in selecting diverse criteria for determining the optimal K under each circumstance. Leveraging the advantages of the bisecting K-means algorithm, our approach outperforms the Ray and Turi method, showcasing higher efficiency in identifying the best number of clusters.
Chapters in this book
- Frontmatter I
- Preface V
- Contents VII
-
Methods and instrumentation
- 1 Identifying and estimating outliers in time series with nonstationary mean through multiobjective optimization method 1
- 2 Using the intentionally linked entities (ILE) database system to create hypergraph databases with fast and reliable relationship linking, with example applications 21
- 3 Rapid and automated determination of cluster numbers for high-dimensional big data: a comprehensive update 37
- 4 Canonical correlation analysis and exploratory factor analysis of the four major centrality metrics 49
- 5 Navigating the landscape of automated data preprocessing: an in-depth review of automated machine learning platforms 71
- 6 Generating random XML 83
-
Applications and case studies
- 7 Exploring autism risk: a deep dive into graph neural networks and gene interaction data 105
- 8 Leveraging ChatGPT and table arrangement techniques in advanced newspaper content analysis for stock insights 121
- 9 An experimental study on road surface classification 145
- 10 RNN models for evaluating financial indices: examining volatility and demand-supply shifts in financial markets during COVID-19 165
- 11 Topological methods for vibration feature extraction 185
- 12 Dyna-SPECTS: DYNAmic enSemble of Price Elasticity Computation models using Thompson Sampling in e-commerce 215
- 13 Creating a metadata schema for reservoirs of data: a systems engineering approach 251
- 14 Implementation and evaluation of an eXplainable artificial intelligence to explain the evaluation of an assessment analytics algorithm for freetext exams in psychology courses in higher education to attest QBLM-based competencies 271
- 15 Toward a skill-centered qualification ontology supporting data mining of human resources in knowledge-based enterprise process representations 307
- Index 333
Chapters in this book
- Frontmatter I
- Preface V
- Contents VII
-
Methods and instrumentation
- 1 Identifying and estimating outliers in time series with nonstationary mean through multiobjective optimization method 1
- 2 Using the intentionally linked entities (ILE) database system to create hypergraph databases with fast and reliable relationship linking, with example applications 21
- 3 Rapid and automated determination of cluster numbers for high-dimensional big data: a comprehensive update 37
- 4 Canonical correlation analysis and exploratory factor analysis of the four major centrality metrics 49
- 5 Navigating the landscape of automated data preprocessing: an in-depth review of automated machine learning platforms 71
- 6 Generating random XML 83
-
Applications and case studies
- 7 Exploring autism risk: a deep dive into graph neural networks and gene interaction data 105
- 8 Leveraging ChatGPT and table arrangement techniques in advanced newspaper content analysis for stock insights 121
- 9 An experimental study on road surface classification 145
- 10 RNN models for evaluating financial indices: examining volatility and demand-supply shifts in financial markets during COVID-19 165
- 11 Topological methods for vibration feature extraction 185
- 12 Dyna-SPECTS: DYNAmic enSemble of Price Elasticity Computation models using Thompson Sampling in e-commerce 215
- 13 Creating a metadata schema for reservoirs of data: a systems engineering approach 251
- 14 Implementation and evaluation of an eXplainable artificial intelligence to explain the evaluation of an assessment analytics algorithm for freetext exams in psychology courses in higher education to attest QBLM-based competencies 271
- 15 Toward a skill-centered qualification ontology supporting data mining of human resources in knowledge-based enterprise process representations 307
- Index 333