Paper
Building a new taxonomy for data discretization techniques
Published Dec 1, 2009 · A. Bakar, Z. Othman, N. Shuib
2009 2nd Conference on Data Mining and Optimization
32
Citations
0
Influential Citations
Abstract
Data preprocessing is an important step in data mining. It is used to resolve various types of problem in a large dataset in order to produce quality data. It consists of four steps, namely, data cleaning, integration, reduction and transformation. The literature shows that each preprocessing step consists of various techniques. In order to develop quality data, a data miner must decide the most appropriate techniques in every step of data preprocessing. In this study, we focus on data reduction particularly data discretization as one important data preprocessing step. Data reduction involves reducing the data distribution by reducing the range of continuous data into a range of values or categories. Data discretization plays a major role in reducing the attribute intervals of data values. Finding an appropriate number of discrete values will improve the performance of the data mining modelling, particularly in terms of classification accuracy. This paper proposes four levels of data discretization taxonomy as follows: hierarchical and non-hierarchical; splitting, merging and combination; supervised and unsupervised combinations; and binning, statistic, entropy and other related techniques. The taxonomy is developed based on a detailed review of previous discretization techniques. More than fifty techniques are investigated, and the structure of the discretization approach is outlined. Guidelines on how to use the proposed taxonomy are also discussed.
Full text analysis coming soon...