|
DATA MINING
Desktop Survival Guide by Graham Williams |
|
|||
Data cleaning deals with issues of removing errant transactions, updating transactions to account for reversals, elimination of missing data, and so on.
The aim of data cleaning is to raise the data quality to a level suitable for the selected analyses.
The data cleaning to be performed depends on purpose to which the data is to be put. Some activities will require a selection of data cleaning and data transformation modules to be applied to the data.
Data cleaning occurs early in the process and then continually throughout the process as we learn more about the data.
Often we will find ourselves loading data from a CSV file which is readily supported by R (See Section 2.8). On the first loading of the data we generally want to get a quick summary, using R's summary function. It is here that we might note that some numeric columns have become factors!
Consider the example of the cardiac dataset (See Section 2.8.2).
> cardiac <- read.csv("cardiac.data", header=F)
> summary(cardiac)
[...]
V10 V11 V12 V13 V14
Min. :-172.00 52 : 13 60 : 23 49 : 9 ? :376
1st Qu.: 3.75 36 : 10 ? : 22 55 : 9 84 : 3
Median : 40.00 42 : 9 61 : 16 59 : 9 -157 : 2
Mean : 33.68 10 : 8 56 : 14 62 : 9 -164 : 2
3rd Qu.: 66.00 33 : 8 58 : 13 26 : 8 -93 : 2
Max. : 169.00 41 : 8 68 : 12 33 : 8 103 : 2
(Other):396 (Other):352 (Other):400 (Other): 65
[...]
|
Our understanding of the data might be that we expect these variables to be numeric. Indeed, the telltale sign is V14 having a ? as one of its values. A little more exploration to show the frequency of each value will indicate that the apparently nominal variables only have a single non-numeric value, the ? When we read the data from the CSV file we need to tell R that the ? is used to indicate missing values
> cardiac <- read.csv("cardiac.data", header=F, na.string="?")
> summary(cardiac)
[...]
V11 V12 V13 V14
Min. :-177.00 Min. :-170.00 Min. :-135.00 Min. :-179.00
1st Qu.: 14.00 1st Qu.: 41.00 1st Qu.: 12.00 1st Qu.:-124.50
Median : 41.00 Median : 56.00 Median : 40.00 Median : -50.50
Mean : 36.15 Mean : 48.91 Mean : 36.72 Mean : -13.59
3rd Qu.: 63.25 3rd Qu.: 65.00 3rd Qu.: 62.00 3rd Qu.: 117.25
Max. : 179.00 Max. : 176.00 Max. : 166.00 Max. : 178.00
NA's : 8.00 NA's : 22.00 NA's : 1.00 NA's : 376.00
[...]
|
That's looking better. Note that the NAs are reported and that V14 has 376 of them, in accord with the previous observation of 376 ?'s.
Brought to you by Togaware.