Download PDFOpen PDF in browser

Genre Classification Problem: in Pursuit of Systematics on a Big Webcorpus

14 pagesPublished: March 18, 2019

Abstract

This article is devoted to the problem of defining a genre in computer linguistics and searching for parameters that could formalize the concept of a genre. All kinds of existing typologies of genres rely on different types of features, whereas in the practice of NLP, any modern applications are adapted to learning on big data, and therefore - on text features that do not require additional non-automatic markup. Based on such text-internal features, in this article we focus on differentiation of various genres and their grouping on the basis of a similar distribution of features. The description of the contribution of various types of features to the final result and their interpretation are given, and also an analysis of how such features can be used to further adaptation of NLP models is provided. The materials of the "Taiga" corpus with genre annotation are used as experimental data.

Keyphrases: genre classification, machine learning, text classification, web corpus

In: Gerhard Wohlgenannt, Ruprecht von Waldenfels, Svetlana Toldova, Ekaterina Rakhilina, Denis Paperno, Olga Lyashevskaya, Natalia Loukachevitch, Sergei O. Kuznetsov, Olga Kultepina, Dmitry Ilvovsky, Boris Galitsky, Ekaterina Artemova and Elena Bolshakova (editors). Proceedings of Third Workshop "Computational linguistics and language science", vol 4, pages 70-83.

BibTeX entry
@inproceedings{CLLS2018:Genre_Classification_Problem_Pursuit,
  author    = {Tatiana Shavrina},
  title     = {Genre Classification Problem: in Pursuit of Systematics on a Big Webcorpus},
  booktitle = {Proceedings of Third Workshop "Computational linguistics and language science"},
  editor    = {Gerhard Wohlgenannt and Ruprecht von Waldenfels and Svetlana Toldova and Ekaterina Rakhilina and Denis Paperno and Olga Lyashevskaya and Natalia Loukachevitch and Sergei O. Kuznetsov and Olga Kultepina and Dmitry Ilvovsky and Boris Galitsky and Ekaterina Artemova and Elena Bolshakova},
  series    = {EPiC Series in Language and Linguistics},
  volume    = {4},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2398-5283},
  url       = {/publications/paper/kL4r},
  doi       = {10.29007/f1kr},
  pages     = {70-83},
  year      = {2019}}
Download PDFOpen PDF in browser