CSC 470 Data Mining Spring 2004

Assignment 1                           Data Preparation                              10 points

 

Assigned: 01/29/04

Due:         02/05/04

 

Task:

            Create arff files suitable for the WEKA software as described below:

1.     Obtain the file adult.data.adjusted.shortened.csv from my WWW page

2.     Remove attributes: wt, edu, relationship

3.     Use Excel to replace capitalgain and capitalloss attributes with 4 attributes:

a.      Whether the person had a capital gain (Y/N)

b.     Whether the person had a capital loss (Y/N)

c.      Whether the person had either of the above (Y/N)

d.     The difference between the capital gain and loss (net gain – numeric)

Be careful!  If you don’t know how to do this relatively quickly in Excel – ask!

4.      Save the adjusted file as a csv file, but with a arff extension

5.     Edit the arff file using Wordpad to add arff header information (and Save of course)

6.     Open your file in WEKA to ensure that it opens correctly

7.     Discretize the hrsperwk attribute using WEKA –

a.      Use Unsupervised Attribute Discretize under the Preprocess tab

b.     Choose options bins=5 and FindNumBins=True

8.     Look at visualization in Preprocess area to ensure discretizing worked

9.     Save to a different file name so that you have both versions of the file

10.  Re-open the discretized file to make sure that that opens.

11.  Turn in both files on a disk