| Download PDFOpen PDF in browser Text Clustering for Topic Identification: a TF-IDF and K-Means Approach Applied to the 20 Newsgroups DatasetEasyChair Preprint 139175 pages•Date: July 10, 2024AbstractIn this paper, we present an efficient approach for topic modeling using Term Frequency-Inverse Document Frequency (TF-IDF) and K-means clustering, applied to the 20 Newsgroups dataset. The 20 Newsgroups dataset is a well-known collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. Our method involves preprocessing the text data to remove noise, calculating the TF-IDF matrix to represent the documents in a high-dimensional space, and employing K-means clustering to group the documents into distinct topics. The effectiveness of the approach is demonstrated through the identification of coherent topic clusters, highlighting the key terms associated with each cluster. This straightforward yet powerful combination of TF-IDF and K-means clustering offers a robust solution for text clustering and topic identification tasks, making it suitable for various natural language processing applications. The results show that our method can effectively uncover the underlying topics within a large text corpus, providing valuable insights for further text analysis and information retrieval. Keyphrases: 20 news groups, Clusters, K-means, Natural Language Processing, Term frequency Inverse Term Frequency 
 | 

