Tree-based Semi-supervised Clustering

Claudio Conversano; Giulia Contu; Luca Frigau; Francesco Mola

Open Conference Systems, 50th Scientific meeting of the Italian Statistical Society

Claudio Conversano, Giulia Contu, Luca Frigau, Francesco Mola

Last modified: 2018-05-31

Abstract

Recently, in literature new clustering methods that take advantages of complex networks algorithms have been developed (De Oliveira et al., 2008; DeArruda et al., 2012). The use of complex networks approaches allows obtaining higher accuracy than traditional clustering methods. These new kinds of approaches consist in defining the distances between objects by classical metrics and performing complex network algorithms to partition the data, considering the distances as links for the derivation of the network (Granell et al., 2002; De Oliveira et al., 2008; Granell et al., 2010; De Arruda et al., 2012).

We propose a new approach in which the distances are defined by machine learning methods. Particularly, we consider tree-based methods such as Classification and Regression Tree (CART) and Random Forest (RF) (Breiman et al., 1984; Breiman, 2001). This approach results in a semi-supervised clustering method in which the communities are identified using community detection algorithms (traditional methods, divisive methods, modularity methods and other methods).

We illustrate the advantages of our approach through a simulation experiment and some real data examples. One of them concerns the performance metrics and the content characteristics of 137 websites of UNESCO sites located in Italy, France and Spain.

Full Text: DOC