Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Data Cleaning

Data cleaning deals with issues of removing errant transactions, updating transactions to account for reversals, elimination of missing data, and so on.

The aim of data cleaning is to raise the data quality to a level suitable for the selected analyses.

The data cleaning to be performed depends on purpose to which the data is to be put. Some activities will require a selection of data cleaning and data transformation modules to be applied to the data.

Data cleaning occurs early in the process and then continually throughout the process as we learn more about the data.

Often we will find ourselves loading data from a CSV file which is readily supported by R (See Section 2.8). On the first loading of the data we generally want to get a quick summary, using R's summary function. It is here that we might note that some numeric columns have become factors!

Consider the example of the cardiac dataset (See Section 2.8.2).



> cardiac <- read.csv("cardiac.data", header=F)
> summary(cardiac)
[...]
      V10               V11           V12           V13           V14     
 Min.   :-172.00   52     : 13   60     : 23   49     :  9   ?      :376  
 1st Qu.:   3.75   36     : 10   ?      : 22   55     :  9   84     :  3  
 Median :  40.00   42     :  9   61     : 16   59     :  9   -157   :  2  
 Mean   :  33.68   10     :  8   56     : 14   62     :  9   -164   :  2  
 3rd Qu.:  66.00   33     :  8   58     : 13   26     :  8   -93    :  2  
 Max.   : 169.00   41     :  8   68     : 12   33     :  8   103    :  2  
                   (Other):396   (Other):352   (Other):400   (Other): 65  
[...]

Our understanding of the data might be that we expect these variables to be numeric. Indeed, the telltale sign is V14 having a ? as one of its values. A little more exploration to show the frequency of each value will indicate that the apparently nominal variables only have a single non-numeric value, the ? When we read the data from the CSV file we need to tell R that the ? is used to indicate missing values



> cardiac <- read.csv("cardiac.data", header=F, na.string="?")
> summary(cardiac)
[...]
      V11               V12               V13               V14       
 Min.   :-177.00   Min.   :-170.00   Min.   :-135.00   Min.   :-179.00
 1st Qu.:  14.00   1st Qu.:  41.00   1st Qu.:  12.00   1st Qu.:-124.50
 Median :  41.00   Median :  56.00   Median :  40.00   Median : -50.50
 Mean   :  36.15   Mean   :  48.91   Mean   :  36.72   Mean   : -13.59
 3rd Qu.:  63.25   3rd Qu.:  65.00   3rd Qu.:  62.00   3rd Qu.: 117.25
 Max.   : 179.00   Max.   : 176.00   Max.   : 166.00   Max.   : 178.00
 NA's   :   8.00   NA's   :  22.00   NA's   :   1.00   NA's   : 376.00
[...]

That's looking better. Note that the NAs are reported and that V14 has 376 of them, in accord with the previous observation of 376 ?'s.



Subsections
Copyright © 2004-2005
Brought to you by Togaware.