A network approach to dimensionality reduction in Text Mining

Michelangelo Misuraca; Germana Scepi; Maria Spano

Open Conference Systems, 50th Scientific meeting of the Italian Statistical Society

Michelangelo Misuraca, Germana Scepi, Maria Spano

Last modified: 2018-06-04

Abstract

The ever-increasing popularity of the Internet, together with the amazing progress of computer technology, has led to a tremendous growth in the availability of electronic documents. There is a great interest in developing statistical tools for the effective and efficient extraction of information from documental repositories on the Web. The most common reference model for representing documents is the so-called vector space model. Documents are coded as bag-of-words, i.e. as an unordered set of terms, disregarding grammatical and syntactical roles. The focus is on the presence/absence of a term in a document, its characterisation and discrimination power. The knowledge discovery process implies a dimensionality reduction step, both via feature selection and/or feature extraction. Here we propose a novel strategy designed for dimensionality reduction in a Text Mining frame. The idea is that textual data can be processed at different levels, e.g. as single terms or subsets of terms identifying different concepts. Network analysis tools allow performing the reduction by visualising the relationships among the terms and the concepts. An analysis on a set of tweets about the 2018 Italian general election will show the effectiveness of our proposal.

Full Text: PDF