Machine learning-enabled techniques for speech categorization

Rajiv Ranjan Giri; Richa Indu; Sushil Chandra Dimri

Abstract

In the modern era trolling, abusive language, violation of the right to freedom of speech in the name of free speech, and bigotry, to incite hatred and disharmony among people or groups on various grounds are ubiquitous on social platforms. Therefore, this work not only categorizes comments on such platforms into hostile speech but also identifies religion-oriented, offensive, violence-provocating, and normal speeches. In this way, our proposed algorithm uses traditional yet simple word-matching criteria, with complexity O(n) to categorize speech into five classes, by identifying the words in comments identical to three predesigned sets, named SR (set of religion-oriented words), SO (set of offensive words), and SV (set of violence-provocating words). These words are selected on the basis of the different IPC sections to tackle the cases of hate speeches. To test our algorithm, we extracted comments from three different incidents, that is, the Hijab case (Karnataka), Boycott movies, and Dharmsansad (Haridwar) from different YouTube news channels and their respective Facebook pages, and created three datasets. These datasets are then preprocessed to maintain anonymity and obtain clean data for categorization. Furthermore, the total similarity index is also computed to determine the kind of words used most in these datasets. However, after preprocessing the final sample sizes are 116, 372, and 861 for three datasets in respective order, where the misclassification rate of the present approach is 2.58%, 1.344%, and 0.813% of the total sample sizes, conveying that the presented approach generalizes better with large samples. To handle imbalanced data SVMSMOTE is used, whose output is fed to linear-, polynomial-, and RBF-kernel support vector machine, logistic regression, Gini index and entropy criteria decision tree, and random forest classifiers to estimate the efficacy of the proposed algorithm. The highest and the lowest efficacy of 99.03% and 92.95% for the Dharmsansad (Haridwar) dataset, 98.66% and 92.95% for the Boycott movies dataset, and 97.42% and 89.87% for the Hijab case (Karnataka) dataset are attained with random forest and polynomial-kernel SVM, respectively. However, the work also has some limitations like the inability to categorize comments with deep sarcasm, skepticism, and a lack of comprehension of the context of statements endorsing violence, where the explicit use of violence-provocative phrases is absent.

Machine learning-enabled techniques for speech categorization

Abstract

Abstract

Chapters in this book

Chapters in this book

Machine learning-enabled techniques for speech categorization

Abstract

Chapter PDF View

Abstract

Chapters in this book

Chapters in this book

Chapters in this book