Machine learning-enabled techniques for speech categorization
-
Rajiv Ranjan Giri
Abstract
In the modern era trolling, abusive language, violation of the right to freedom of speech in the name of free speech, and bigotry, to incite hatred and disharmony among people or groups on various grounds are ubiquitous on social platforms. Therefore, this work not only categorizes comments on such platforms into hostile speech but also identifies religion-oriented, offensive, violence-provocating, and normal speeches. In this way, our proposed algorithm uses traditional yet simple word-matching criteria, with complexity O(n) to categorize speech into five classes, by identifying the words in comments identical to three predesigned sets, named SR (set of religion-oriented words), SO (set of offensive words), and SV (set of violence-provocating words). These words are selected on the basis of the different IPC sections to tackle the cases of hate speeches. To test our algorithm, we extracted comments from three different incidents, that is, the Hijab case (Karnataka), Boycott movies, and Dharmsansad (Haridwar) from different YouTube news channels and their respective Facebook pages, and created three datasets. These datasets are then preprocessed to maintain anonymity and obtain clean data for categorization. Furthermore, the total similarity index is also computed to determine the kind of words used most in these datasets. However, after preprocessing the final sample sizes are 116, 372, and 861 for three datasets in respective order, where the misclassification rate of the present approach is 2.58%, 1.344%, and 0.813% of the total sample sizes, conveying that the presented approach generalizes better with large samples. To handle imbalanced data SVMSMOTE is used, whose output is fed to linear-, polynomial-, and RBF-kernel support vector machine, logistic regression, Gini index and entropy criteria decision tree, and random forest classifiers to estimate the efficacy of the proposed algorithm. The highest and the lowest efficacy of 99.03% and 92.95% for the Dharmsansad (Haridwar) dataset, 98.66% and 92.95% for the Boycott movies dataset, and 97.42% and 89.87% for the Hijab case (Karnataka) dataset are attained with random forest and polynomial-kernel SVM, respectively. However, the work also has some limitations like the inability to categorize comments with deep sarcasm, skepticism, and a lack of comprehension of the context of statements endorsing violence, where the explicit use of violence-provocative phrases is absent.
Abstract
In the modern era trolling, abusive language, violation of the right to freedom of speech in the name of free speech, and bigotry, to incite hatred and disharmony among people or groups on various grounds are ubiquitous on social platforms. Therefore, this work not only categorizes comments on such platforms into hostile speech but also identifies religion-oriented, offensive, violence-provocating, and normal speeches. In this way, our proposed algorithm uses traditional yet simple word-matching criteria, with complexity O(n) to categorize speech into five classes, by identifying the words in comments identical to three predesigned sets, named SR (set of religion-oriented words), SO (set of offensive words), and SV (set of violence-provocating words). These words are selected on the basis of the different IPC sections to tackle the cases of hate speeches. To test our algorithm, we extracted comments from three different incidents, that is, the Hijab case (Karnataka), Boycott movies, and Dharmsansad (Haridwar) from different YouTube news channels and their respective Facebook pages, and created three datasets. These datasets are then preprocessed to maintain anonymity and obtain clean data for categorization. Furthermore, the total similarity index is also computed to determine the kind of words used most in these datasets. However, after preprocessing the final sample sizes are 116, 372, and 861 for three datasets in respective order, where the misclassification rate of the present approach is 2.58%, 1.344%, and 0.813% of the total sample sizes, conveying that the presented approach generalizes better with large samples. To handle imbalanced data SVMSMOTE is used, whose output is fed to linear-, polynomial-, and RBF-kernel support vector machine, logistic regression, Gini index and entropy criteria decision tree, and random forest classifiers to estimate the efficacy of the proposed algorithm. The highest and the lowest efficacy of 99.03% and 92.95% for the Dharmsansad (Haridwar) dataset, 98.66% and 92.95% for the Boycott movies dataset, and 97.42% and 89.87% for the Hijab case (Karnataka) dataset are attained with random forest and polynomial-kernel SVM, respectively. However, the work also has some limitations like the inability to categorize comments with deep sarcasm, skepticism, and a lack of comprehension of the context of statements endorsing violence, where the explicit use of violence-provocative phrases is absent.
Chapters in this book
- Frontmatter I
- Preface V
- Contents VII
- Machine learning-enabled techniques for speech categorization 1
- Comprehensive study of cybersecurity issues and challenges 21
- An energy-efficient FPGA-based implementation of AES algorithm using HSTL IO standards for new digital age technologies 41
- A comparative study on security issues and clustering of wireless sensor networks 55
- Heuristic approach and its application to solve NP-complete traveling salesman problem 69
- Assessment of fake news detection from machine learning and deep learning techniques 87
- Spam mail detection various machine learning methods and their comparisons 119
- Cybersecurity threats in modern digital world 137
- Mechanism to protect the physical boundary of organization where the private and public networks encounter 149
- By combining binary search and insertion sort, a sorting method for small input size 167
- Index 179
Chapters in this book
- Frontmatter I
- Preface V
- Contents VII
- Machine learning-enabled techniques for speech categorization 1
- Comprehensive study of cybersecurity issues and challenges 21
- An energy-efficient FPGA-based implementation of AES algorithm using HSTL IO standards for new digital age technologies 41
- A comparative study on security issues and clustering of wireless sensor networks 55
- Heuristic approach and its application to solve NP-complete traveling salesman problem 69
- Assessment of fake news detection from machine learning and deep learning techniques 87
- Spam mail detection various machine learning methods and their comparisons 119
- Cybersecurity threats in modern digital world 137
- Mechanism to protect the physical boundary of organization where the private and public networks encounter 149
- By combining binary search and insertion sort, a sorting method for small input size 167
- Index 179