Explainable AI approach towards Toxic Comment Classification

EasyChair Preprint 2773, version 2

Versions: 12→history

24 pages•Date: February 27, 2020

Aditya Mahajan, Divyank Shah and Gibraan Jafar

Abstract

In this paper, we have documented results and discussed our approach towards solving a serious problem that people face online (i.e. on internet) which is encountering “toxic”,“abusive”,“inappropriate” and “offensive” content in the form of textual comments on social media. The purpose of taking up this project topic is to stop “cyber bullying” and having a safe online environment. The methodology followed includes data collection from online resources, data preprocessing, converting textual data to vectors (TF-IDF,Word Embeddings), building machine learning and deep learning models, comparing the models using standard metrics as well as interpretability techniques, and thus selecting the best model. After training and evaluating various models, we have come up with a conclusion that standard model evaluation metrics (such as accuracy, precision, recall, f1-score) can often be deceiving as almost every single model we trained gave very good accuracy scores on the test set. After using a model-interpretability technique like LIME and checking out some of the explanations that the models generated on a common set of comments we created manually, we noticed that some of the models were considering incorrect words for a sentence to be predicted as toxic. So even with a high accuracy or any evaluation score, we can’t deploy such models in real world scenarios. The combination that gave us the best result in our study is Gated Recurrent Unit (GRU) + word embeddings with a high accuracy score along with intuitive LIME explanations. This paper provides a comparative study of various machine learning and deep learning models in Toxic Comment Classification. Through this project and study we also want to emphasize that model interpretability techniques (like LIME,etc.) are pivotal while selecting the best model for any ML/DL project and solutions as well as establishing the trust of the end user on the deployed model.

Keyphrases: Explainable AI, Lime, Toxic Comment Classification, deep learning, interpretability, machine learning

Links:

https://easychair.org/publications/preprint/L1xh

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:2773,
  author    = {Aditya Mahajan and Divyank Shah and Gibraan Jafar},
  title     = {Explainable AI approach towards  Toxic Comment Classification},
  howpublished = {EasyChair Preprint 2773},
  year      = {EasyChair, 2020}}

Download PDF Open PDF in browser