CIS 658 Data Mining Fall 2007
Assignment 2 Data Mining Using WEKA 10 points
Assigned: 09/27/07
Due: 10/04/07 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a) One with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5) – created as part of Assignment 1.
b) One based on the Original data – with just the violent crime discretized – using same method but only that one attribute (rest of numeric data stays numeric!). This is needed because OneR and NaïveBayes cannot make numeric predictions.
a) Two ZeroR predictions.
b) Two OneR rules.
c) Two NaiveBayes models
a) Under Results-list, right-click on the results you want to save
b) Choose “Save result buffer”
c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)
a) We speak of the “chance” probability of getting a prediction correct as the percent correct we might expect to get via random prediction (e.g. the “chance” probability of correctly guessing a coin flip is 50%). What do you think the “chance” probability of getting a prediction correct is in this problem? Explain.
b) Do you think that the programs’ accuracy is significantly better (or worse ) than “chance”? Or is it roughly the same as “chance”? Explain.
c) Do you think “Chance” is the appropriate “strawman” in this problem? Explain. (a “strawman” is an “easy target” that should be able to be easily defeated by a decent opponent)
d) Do the differences between the results of the different runs appear to be important/significant? Explain your answer.
e) Are there any differences in the more detailed results, as seen in the confusion matrices, between the different runs? Explain any important differences that you see.
f) Since the attribute being predicted is actually an ordinal variable, some incorrect predictions are worse than others. Which are the worst?
g) Which of the method/ data combinations has the MOST of these bad errors?
h) Which has the least bad errors?
i) In this particular case, does discretization appear to make much difference (in either accuracy or the model generated)? Explain.
j) Which models are the most comprehensible? (understandable to humans)
k) Try to explain what you can get out of the models generated if you were trying to learn about the data based on these runs.
l) Do you trust/believe in the models that were generated? Do they fit your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.
m) Which run do you want to use as a comparison for more sophisticated efforts? Explain.
Turn in:
§ Zip file with:
o data and results files
o Table summarizing results (Part 4)