Open Conference Systems, 50th Scientific meeting of the Italian Statistical Society

Font Size: 
Flexible clustering methods for high-dimensional data sets
Cristina Tortora, Paul McNicholas

Last modified: 2018-05-17

Abstract


Finite mixture models assume that a population is a convex combination of densities; therefore, they are well suited for clustering applications. Each cluster is modeled using a density function. One of the most flexible distributions is the generalized hyperbolic distribution (GHD). It can handle skewness and heavy tails, and has many well-known distributions as special or limiting cases.  The multiple scaled GHD (MSGHD) and the mixture of coalesced GHDs (CGHD) are even more flexible methods that can detect non-elliptical, and even non-convex, clusters. The drawback of high flexibility is a high parametrization --- especially so for high-dimensional data because the number of parameters is depends on the number of variables. Therefore, the aforementioned methods are not well suited for high-dimensional data clustering. However, the eigen-decomposition of the component scale matrix can naturally be used for dimension reduction obtaining a transformation of the MSGHD and MCGHD that is better suited for high-dimensional data clustering.

References


 

  1. 1. Barndorff-Nielsen, O., Halgreen, C.: Infinite divisibility of the hyperbolic and generalized in- verse Gaussian distributions. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 38, 309–311 (1977)
  2. 2. Barndorff-Nielsen, O., Kent, J., Sørensen, M.: Normal variance-mean mixtures and z distri- butions. International Statistical Review / Revue Internationale de Statistique 50(2), 145–159 (1982)
  3. Browne, R.P., McNicholas, P.D.: A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics 43(2), 176–198 (2015)
  4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38 (1977)
  5. Forbes, F., Wraith, D.: A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweights: Application to robust clustering. Statistics and Computing 24(6), 971–984 (2014)
  6. Gneiting, T.: Normal scale mixtures and dual probability densities. Journal of Statistical Computation and Simulation 59(4), 375–384 (1997)
  7. Tortora, C., Franczak, B., Browne, R., McNicholas, P.: A mixture of coalesced generalized hyperbolic distributions. Journal of Classification (accepted) (2018)
  8. Tortora, C., McNicholas, P.D., Browne, R.P.: A mixture of generalized hyperbolic factor analyzers. Advances in Data Analysis and Classification 10(4), 423–440 (2016)

Full Text: PDF