CSC 470 Data Mining Fall 2005 – Evening Section
Assignment 1 Data Preparation 10 points
Assigned: 09/14/05
Due: 09/21/05
Task:
Create arff files suitable for the WEKA software as described below:
1. Obtain the file NJDOHcleanFinalDataReducedReady.csv from my WWW page
2. Remove attribute: FIPScode
3. Use Excel to replace 1990 data (5 attributes) with percent change from 1990-2000 (actually use a decimal such as .02 would be a two percent increase)
· Be careful! If you don’t know how to do this relatively quickly in Excel – ask!
· Remove DIV / 0 errors – replacing with ? for unknown
4. Put 2000 drug admissions last (except for % town name)
5. Save the adjusted file as a csv file, but with a arff extension
6. Edit the arff file using Wordpad to add arff header information. Make Drug Admissions a Nominal with possible values 0,1,2, etc
7. Replace empty data (how does Excel save empty data when saving to csv?) with ? for unknown.
8. Save, of course
9. Open your file in WEKA to ensure that it opens correctly
10. Discretize all numeric attributes using WEKA –
a. Use Unsupervised Attribute Discretize under the Preprocess tab
b. Choose options bins=5 and FindNumBins=True
11. Look at visualization in Preprocess area to ensure discretizing worked
12. Save to a different file name so that you have both versions of the file
13. Re-open the discretized file to make sure that that opens.
14. Turn in both files you created on a disk