Clustering refers to data mining tools and techniques by which a set of cases are placed into natural groupings based upon their measured characteristics. Since the number of characteristics is often large, a multivariate measure of similarity between cases needs to be employed. When looking for how to data mine, Statgraphics provides a number of methods for deriving clusters, including nearest neighbor, furthest neighbor, centroid, median, group average, Ward's method, and the method of K-Means. The results may be displayed as a dendrogram, a membership table, or an icicle plot. Agglomeration plots are used to suggest the proper number of clusters.
More: Cluster Analysis.pdf
Classification is among the data mining tools and techniques by which a set of cases are assigned to levels of a categorical factor based upon their characteristics. A training set of known cases is used to develop a classification algorithm which can then be used to predict which category unknown cases are most likely to belong to. For example, applicants for a loan might be placed into risk categories based upon their personal characteristics, given an algorithm developed from previous applicants.
The Neural Network Classifier in Statgraphics uses a method based on nonparametric density function estimates combined with Bayesian priors.
Measure of Association are used to identify variables that are related to each other. If the factors are quantitative, correlation coefficients may be used for statistical data mining tools and techniques like this. If the factors are non-quantitative, other measures of association are used for considering how to date mine. A matrix plot with nonlinear Lowess smoothers is shown at the right.
Statgraphics includes statistics such as Pearson's product-moment correlation coefficient, Kenkall and Spearman rank correlations, partial correlations, lambda, the uncertainty coefficient, Somer's D, the contingency coefficient, eta, Cramer's V, conditional gamma, Pearson's R, and Kendall's tau.
Prediction refers to the development of statistical models that can predict the value of one variable given the values of other variables. Regression models of various sorts are often used among data mining tools and techniques. When the number of predictors is large, selection of a good model can be difficult. In Statgraphics, the Regression Model Selection procedure of statistical data mining fits models involving all possible linear combinations of a set of predictors all selects the best models using criteria such as Mallows' Cp and the adjusted R-squared statistic.