Open Conference Systems, CLADAG2023

Font Size: 
Simultaneous Clustering and Variable Selection on Multi-view Data
Shuai Yuan

Last modified: 2023-06-21

Abstract


To accurately capture the heterogeneity of human behavior, psychologists frequently apply cluster analysis to identify subgroups with distinctive behavioral profiles. Cluster analysis is especially useful when dealing with data-intensive studies involving a large number of variables, collected from multiple sources, as the wealth of information covered by these data sets can potentially lead to important discoveries about hitherto unknown subtypes. These applications, however, face two major challenges. To begin, these large-scale, multi-source data sets are likely to contain irrelevant variables that do not contribute to the separation of the subgroups, and, in the worst case, may even prevent accurate recovery of clusters. Second, to avoid false detection of clusters, the findings should be validated with both theory-driven and data-driven lens, but guidance for this validation process is lacking. In response to these two challenges, this tutorial describes a recently proposed method (Cardinality K-means or CKM) that allows simultaneous variable selection and clustering and discusses a framework for cluster validation. Moreover, the tutorial provides a step-by-step guide to conduct simultaneous clustering and variable selection on multi-view data using the R-package CKM and ShinyApp ClusterViz. To this aim, an illustrative example of clustering citizens based on their political opinions is presented in detail, where annotated R code is also available.