Home Mathematics Machine learning-enabled techniques for speech categorization
Chapter
Licensed
Unlicensed Requires Authentication

Machine learning-enabled techniques for speech categorization

  • Rajiv Ranjan Giri , Richa Indu and Sushil Chandra Dimri
Become an author with De Gruyter Brill
Algorithms
This chapter is in the book Algorithms

Abstract

In the modern era trolling, abusive language, violation of the right to freedom of speech in the name of free speech, and bigotry, to incite hatred and disharmony among people or groups on various grounds are ubiquitous on social platforms. Therefore, this work not only categorizes comments on such platforms into hostile speech but also identifies religion-oriented, offensive, violence-provocating, and normal speeches. In this way, our proposed algorithm uses traditional yet simple word-matching criteria, with complexity O(n) to categorize speech into five classes, by identifying the words in comments identical to three predesigned sets, named SR (set of religion-oriented words), SO (set of offensive words), and SV (set of violence-provocating words). These words are selected on the basis of the different IPC sections to tackle the cases of hate speeches. To test our algorithm, we extracted comments from three different incidents, that is, the Hijab case (Karnataka), Boycott movies, and Dharmsansad (Haridwar) from different YouTube news channels and their respective Facebook pages, and created three datasets. These datasets are then preprocessed to maintain anonymity and obtain clean data for categorization. Furthermore, the total similarity index is also computed to determine the kind of words used most in these datasets. However, after preprocessing the final sample sizes are 116, 372, and 861 for three datasets in respective order, where the misclassification rate of the present approach is 2.58%, 1.344%, and 0.813% of the total sample sizes, conveying that the presented approach generalizes better with large samples. To handle imbalanced data SVMSMOTE is used, whose output is fed to linear-, polynomial-, and RBF-kernel support vector machine, logistic regression, Gini index and entropy criteria decision tree, and random forest classifiers to estimate the efficacy of the proposed algorithm. The highest and the lowest efficacy of 99.03% and 92.95% for the Dharmsansad (Haridwar) dataset, 98.66% and 92.95% for the Boycott movies dataset, and 97.42% and 89.87% for the Hijab case (Karnataka) dataset are attained with random forest and polynomial-kernel SVM, respectively. However, the work also has some limitations like the inability to categorize comments with deep sarcasm, skepticism, and a lack of comprehension of the context of statements endorsing violence, where the explicit use of violence-provocative phrases is absent.

Abstract

In the modern era trolling, abusive language, violation of the right to freedom of speech in the name of free speech, and bigotry, to incite hatred and disharmony among people or groups on various grounds are ubiquitous on social platforms. Therefore, this work not only categorizes comments on such platforms into hostile speech but also identifies religion-oriented, offensive, violence-provocating, and normal speeches. In this way, our proposed algorithm uses traditional yet simple word-matching criteria, with complexity O(n) to categorize speech into five classes, by identifying the words in comments identical to three predesigned sets, named SR (set of religion-oriented words), SO (set of offensive words), and SV (set of violence-provocating words). These words are selected on the basis of the different IPC sections to tackle the cases of hate speeches. To test our algorithm, we extracted comments from three different incidents, that is, the Hijab case (Karnataka), Boycott movies, and Dharmsansad (Haridwar) from different YouTube news channels and their respective Facebook pages, and created three datasets. These datasets are then preprocessed to maintain anonymity and obtain clean data for categorization. Furthermore, the total similarity index is also computed to determine the kind of words used most in these datasets. However, after preprocessing the final sample sizes are 116, 372, and 861 for three datasets in respective order, where the misclassification rate of the present approach is 2.58%, 1.344%, and 0.813% of the total sample sizes, conveying that the presented approach generalizes better with large samples. To handle imbalanced data SVMSMOTE is used, whose output is fed to linear-, polynomial-, and RBF-kernel support vector machine, logistic regression, Gini index and entropy criteria decision tree, and random forest classifiers to estimate the efficacy of the proposed algorithm. The highest and the lowest efficacy of 99.03% and 92.95% for the Dharmsansad (Haridwar) dataset, 98.66% and 92.95% for the Boycott movies dataset, and 97.42% and 89.87% for the Hijab case (Karnataka) dataset are attained with random forest and polynomial-kernel SVM, respectively. However, the work also has some limitations like the inability to categorize comments with deep sarcasm, skepticism, and a lack of comprehension of the context of statements endorsing violence, where the explicit use of violence-provocative phrases is absent.

Downloaded on 20.10.2025 from https://www.degruyterbrill.com/document/doi/10.1515/9783111229157-001/html
Scroll to top button