Least Squares Clustering
Project outline and aims
Clustering is an important part of data mining which aims at finding and describing coherent chunks of data in databases and data tables. This projects intends to develop a mathematical theory for clustering based on the idea that the discovered clusters must be treated as an ``ideal'' representation of the data to be used for recovering the original data back from the ideal format. This is the idea of the data recovery approach: not only use data for finding clusters but also use clusters for recovering the data. In a general situation, the data recovered from aggregate clusters cannot fit the original data exactly, which can be used for evaluation of the quality of clusters: the better the fit, the better the clusters. This perspective also leads to the addressing of issues in pre- and post-processing, which becomes possible because parts of the data that are well explained by clusters can be separated from those that are not.
This data recovery approach works due to the Pythagorean decomposition of the data scatter into ``explained'' and ``unexplained'' parts, thus extending onto clustering what works fine in more traditional data mining and statistics areas such as regression, analysis of variance and factor analysis.
The results found within this approach amount to a mathematical theory for clustering involving: (a) a considerable part of clustering techniques unified, (b) effective techniques for tackling old and new problems, (c) firm relations among various notions and algorithms in statistics, machine learning and combinatorial optimization. The unifying capability of the data recovery clustering is grounded on convenient relations between approximation problems and geometrically explicit clustering.
The approach is described in
Recent invited presentations
Recent developments
Some publications: S. Nascimento, B. Mirkin, and F. Moura-Pires (2003) Modeling proportional membership in fuzzy clustering, IEEE Transactions on Fuzzy Systems, 11, no. 2, 173-186, S. Nascimento (2005) Fuzzy Clustering Via Proportional Membership Model, IOS Press, 178 p.
Some publications: I. Wasito, B. Mirkin (2005) Nearest neighbour approach in the least squares data imputation algorithms, Information Sciences, 169, 1-25; I. Wasito, B. Mirkin (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns, Computational Statistics & Data Analysis 50, 926-949.
Some papers: Robert W. Stanforth, Evgueni Kolossov, Boris Mirkin (2007) A Measure of domain of applicability for QSAR modelling based on intelligent K-Means clustering, QSAR and Combinatorial Science, 26, no. 7, 837-844; Robert W. Stanforth, Evgueni Kolossov, Boris Mirkin (2007) Hybrid K-Means: Combining regression-wise and centroid-based criteria for QSAR (in P. Brito, P. Bertrand, G. Cucumel, F. de Carvalho (Eds.) Selected Contributions in Data Analysis and Classification, Springer, 225-233).
Some papers: Mark MingTso Chiang and B. Mirkin (2006) Determining the number of clusters in the Straight K-means: Experimental comparison of eight options, UKCI 2006 Proceedings, 119-126; Mark MingTso Chiang and B. Mirkin (2010) Intelligent choice of the number of clusters in K-Means clustering: an experimental study with different cluster spreads (Journal of Classification, v.27, 1, pp. 1-38.).
Some papers: B. Mirkin, E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, Providence: AMS, 97-112; B. Mirkin, R. Camargo, T. Fenner, G. Loizou and P. Kellam (2010) Similarity clustering of proteins using substantive knowledge and reconstruction of evolutionary gene histories in herpesvirus, Theoretical Chemistry Accounts, 125, nn. 3-6, 569-581. .
Some publications: B. Mirkin (2001) Eleven Ways to Look at the Chi-Squared Coefficient for Contingency Tables, The American Statistician, 55, no. 2, 111-120, and B. Mirkin (2001) Reinterpreting the Category Utility Function, Machine Learning, 45, 219-228; B. Mirkin (2002) Towards comprehensive clustering of mixed scale data with K-Means, Proceedings of the UK Workshop on Computational Intelligence, Birmingham, 2-4 September, 2002, 126-130.
B. Mirkin, S. Nascimento, L. Muniz Pereira (2007)
How ACM classification can be used for profiling a University CS department
, a PowerPoint presentation;
ACM classification can be used for representing research organizations
, DIMACS Technical Report, 2007-13, 18 p.
________________________________________________________________________________
Topics for student projects
Building a classification tree for mixed data scales (extending some ideas from paper by Mirkin 2001 Machine Learning).
Stratification of entities over multiple performance metrics.
Clustering text documents (extending some
ideas from paper by Pampapathi, Mirkin and Levene 2006 Machine Learning)..