CIS 658 Data Mining Fall 2007  

Assignment 1                          Data Preparation                                 10 points

 

Assigned: 09/13/07

Due:         09/20/07

 

Task:

            Create arff files suitable for the WEKA software as described below:

1.      Obtain the file cl-data-almost-ready-raw.csv from my WWW page

2.      (Using Excel?) Remove attributes: CountyCode, CommunCode, Group, and extra crime attributes – all crime except violPerPop (crime attributes are at the end – to the right)

3.      Use Excel to create 2 new attributes – difference between upper and lower quartile for:

·         Housing value (look for ownHousLowQ)

·         Rent

·         Be careful!  If you don’t know how to do this relatively quickly in Excel – ask!

4.      Make sure vioPerPop is the last column (except for the % town name) – which it should already be (probably)

5.      If you’re clever with Excel, you may be able to prepare for adding arff header info while in Excel.

6.      Save the adjusted file as a csv file, but with a arff extension

7.      Edit the arff file using Wordpad to add arff header information. I believe all info besides state is numeric

8.      Replace empty data (Question: how does Excel save empty data when saving to csv?) with ? for unknown.

9.      Save, of course

10.  Open your file in WEKA to ensure that it opens correctly

11.  Discretize all numeric attributes using WEKA –

·         Use Filter - Unsupervised Attribute Discretize under the Preprocess tab

·         Choose options bins=5 and FindNumBins=True

·         Make sure the last attribute is not set up as the “class” in order to allow it to be discretized

12.  Look at visualization in Preprocess area to see where discretizing worked. Remove any attributes in which all communities with a value are categorized into the same group (“all”). This should only be around 10 (mostly attributes with pure counts instead of pct etc)

13.  Save to a different file name so that you have both versions of the file

14.  Re-open the discretized file to make sure that that opens.

15.  Turn in both files you created in a zipped file – via submitting to Blackboard – questions? – ask!