Co-clustering algorithms for histogram data

Antonio Balzanella; Francisco de A.T. De Carvalho; Rosanna Verde

Open Conference Systems, 50th Scientific meeting of the Italian Statistical Society

Antonio Balzanella, Francisco de A.T. De Carvalho, Rosanna Verde

Last modified: 2018-06-04

Abstract

One of the current big-data age requirements is the need of representing groups of data by summaries allowing the minimum loss of information as possible. Recently, histograms have been used for summarizing numerical variables keeping more information about the data generation process than characteristic values such as the mean, the standard deviation, or quantiles. We propose two co-clustering algorithms for histogram data based on the double $k$-means algorithm. The first proposed algorithm, named "distributional double Kmeans (DDK)", is an extension Â of double Kmeans (DK) proposed to manage usual quantitative data, to histogram data. Â The second algorithm, named adaptive distributional double Kmeans (ADDK), is an extension of DDK with automated variable weighting allowing co-clustering and feature selection simultaneously.

Full Text: PDF