Download PDFOpen PDF in browserRoman Urdu Multi-Class Offensive Text Detection using Hybrid Features and SVMEasyChair Preprint 48105 pages•Date: December 25, 2020AbstractHate content has become a significant issue worldwide due to the increase in social networking sites. Detection of hate content from a language other than English is challenging. We propose a new technique that automatically detects the Roman Urdu comments from YouTube videos into five classes. These classes, including, Religious Hate, Violence Promotion, Extremist (Racist), Threat/Fear, and Neutral. We have generated dataset by scrapping Roman Urdu comments from YouTube videos and labeled by the language experts. We have considered N-grams and TF-IDF values for feature extraction followed by SVM classification. Some classes have relatively less instances, and we employed SMOTE for class-balancing. The developed model offers a high classification performance of 77.45% using the 10-Fold cross-validation technique. The proposed approach offers superior classification results as compared to others. Keyphrases: Roman Urdu, TF-ID, Tri-gram, Uni-gram, YouTube, deep learning, forensic lab air, hate speech, machine learning, n-gram, religious hate, roman urdu data, uni gram bi, violence promotion
|