CSC 470 Data Mining Fall 2005
Assignment 3 Data Mining Using WEKA 10 points
Assigned: 09/29/05
Due: 10/11/05 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a) Original data – with created attributes for pct change in population, etc;
b) One with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5)
a) Two OneR rules.
b) Two NaiveBayesSimple models
a) Under Results-list, right-click on the results you want to save
b) Choose “Save result buffer”
c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)
a) What do you think the “chance” probability of getting a prediction correct is in this problem? Explain.
b) Do you think that the programs’ accuracy is significantly better than “chance”? Explain.
c) Do the differences between the results of the different runs appear to be important/significant? Explain your answer.
d) Are there any differences in the more detailed results, as seen in the confusion matrices, between the different runs? Explain any important differences that you see.
e) Since the attribute being predicted is actually an ordinal variable, some incorrect predictions are worse than others. Which are the worst?
f) Which of the method/ data combinations has the MOST of these bad errors?
g) Which has the least bad errors?
h) In this particular case, does discretization appear to make much difference (in either accuracy or the model generated)? Explain.
i) Which models are the most comprehensible?
j) Try to explain what you can get out of the models generated if you were trying to learn about the data based on these runs.
k) Do you trust/believe in the models that were generated? Do they fit your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.
l) Which run do you want to use as a comparison for more sophisticated efforts? Explain.
Turn in:
§ Disk with data and results files
§ Table summarizing results (Part 4)