CSC 470 Data Mining Fall 2005 – Evening Section

Assignment 1                            Data Preparation                                  10 points

 

Assigned: 09/14/05

Due:         09/21/05 

Task:

            Create arff files suitable for the WEKA software as described below:

1.      Obtain the file NJDOHcleanFinalDataReducedReady.csv from my WWW page

2.      Remove attribute: FIPScode

3.      Use Excel to replace 1990 data (5 attributes) with percent change from 1990-2000 (actually use a decimal such as .02 would be a two percent increase)

·        Be careful!  If you don’t know how to do this relatively quickly in Excel – ask!

·        Remove DIV / 0 errors – replacing with ? for unknown

4.      Put 2000 drug admissions last (except for % town name)

5.      Save the adjusted file as a csv file, but with a arff extension

6.      Edit the arff file using Wordpad to add arff header information. Make Drug Admissions a Nominal with possible values 0,1,2, etc

7.      Replace empty data (how does Excel save empty data when saving to csv?) with ? for unknown.

8.      Save, of course

9.      Open your file in WEKA to ensure that it opens correctly

10.  Discretize all numeric attributes using WEKA –

a.       Use Unsupervised Attribute Discretize under the Preprocess tab

b.      Choose options bins=5 and FindNumBins=True

11.  Look at visualization in Preprocess area to ensure discretizing worked

12.  Save to a different file name so that you have both versions of the file

13.  Re-open the discretized file to make sure that that opens.

14.  Turn in both files you created on a disk