I. Stephanakis, Theodoros Iliou, G. Anastassopoulos
Aug 25, 2017
Journal name not available for this finding
Clustering algorithms like k-means, BIRCH, CLARANS and DBSCAN are designed to be scalable and they are developed to discover clusters in the full dimensional space of a database. Nevertheless their characteristics depend upon the size of the database. A DB/data warehouse may store terabytes of data. Complex data analysis (mining) may take a very long time to run on the complex dataset. One has to obtain a reduced representation of the dataset that is much smaller in volume - but yet produces the same or almost the same analytical results - in order to accelerate information processing. Reduced representations yield simplified models that are easier to interpret, avoid the curse of dimensionality and enhance generalization by reducing overfitting. Data reduction methods include data cube aggregation, attribute subset selection, fitting data into models, dimensionality reduction, hierarchies as well as other approaches. Feature selection is considered as a specific case of a more general paradigm which is called Structure Learning in cases of an outcome associated to a set of attributes. Feature selection aims at selecting a minimum set of features such that the probability distribution of different classes given the values of those features is as close as possible to the original distribution given the values of all features. A combined approach based upon representing complex datasets in DB as a minimal set of connected attribute sets of reduced dimensions is herein proposed. Value-Difference (VD) Metrics based upon binary, categorical and continuous values are used for subspace clustering. Each cluster can be represented by a different set of object features/attributes maximizing the information which is rendered by the cluster representation. Numerical data regarding a test-bed system for anomaly detection are provided in order to illustrate the aforementioned approach.