CSC 470 Data Mining Spring 2004

CIS 658 Data Mining Fall 2007

Assignment 2 Data Mining Using WEKA 10 points

Assigned: 09/27/07

Due: 10/04/07 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

For these experiments you need two data sets (both based on the community data you worked with in assignment 1):

a) One with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5) – created as part of Assignment 1.

b) One based on the Original data – with just the violent crime discretized – using same method but only that one attribute (rest of numeric data stays numeric!). This is needed because OneR and NaïveBayes cannot make numeric predictions.

Run the following three Weka classifiers on each one of the two data sets, thus producing six different models of the community data. Run the classifier with 10-fold cross validation test option and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have:

a) Two ZeroR predictions.

b) Two OneR rules.

c) Two NaiveBayes models

Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

Collect information about the accuracy of each model: Use percent correctly classified instances as a measure of accuracy. Report for all, preferably in a table.
Answer the questions listed below. (Some may not have obvious answers. Some may not necessarily have “correct” answers. Where it says “explain”, your explanation is important. I’m trying to lead you into putting yourself “inside the loop” of data mining.

a) We speak of the “chance” probability of getting a prediction correct as the percent correct we might expect to get via random prediction (e.g. the “chance” probability of correctly guessing a coin flip is 50%). What do you think the “chance” probability of getting a prediction correct is in this problem? Explain.

b) Do you think that the programs’ accuracy is significantly better (or worse ) than “chance”? Or is it roughly the same as “chance”? Explain.

c) Do you think “Chance” is the appropriate “strawman” in this problem? Explain. (a “strawman” is an “easy target” that should be able to be easily defeated by a decent opponent)

d) Do the differences between the results of the different runs appear to be important/significant? Explain your answer.

e) Are there any differences in the more detailed results, as seen in the confusion matrices, between the different runs? Explain any important differences that you see.

f) Since the attribute being predicted is actually an ordinal variable, some incorrect predictions are worse than others. Which are the worst?

g) Which of the method/ data combinations has the MOST of these bad errors?

h) Which has the least bad errors?

i) In this particular case, does discretization appear to make much difference (in either accuracy or the model generated)? Explain.

j) Which models are the most comprehensible? (understandable to humans)

k) Try to explain what you can get out of the models generated if you were trying to learn about the data based on these runs.

l) Do you trust/believe in the models that were generated? Do they fit your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

m) Which run do you want to use as a comparison for more sophisticated efforts? Explain.

With the OneR method and the (almost) original (numeric) data, experiment with values for the B option. Graph Training and test accuracy for different values of B. At what value for B is the best accuracy found? Explain what is happening with the results.

Turn in:

§ Zip file with:

o data and results files

o Table summarizing results (Part 4)

Answers to questions in #5 above
Graph and answers for #6 above