CSC 470 Data Mining Spring 2004

CSC 470 Data Mining Fall 2005

Assignment 3 Data Mining Using WEKA 10 points

Assigned: 09/29/05

Due: 10/11/05 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

For these experiments you need two data sets (both based on the community data you worked with in assignment 1):

a) Original data – with created attributes for pct change in population, etc;

b) One with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5)

Run the following two Weka classifiers on each one of the two data sets, thus producing four different models of the community data. Run the classifier with 10-fold cross validation test option and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have:

a) Two OneR rules.

b) Two NaiveBayesSimple models

Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table.
Answer the questions listed below.

a) What do you think the “chance” probability of getting a prediction correct is in this problem? Explain.

b) Do you think that the programs’ accuracy is significantly better than “chance”? Explain.

c) Do the differences between the results of the different runs appear to be important/significant? Explain your answer.

d) Are there any differences in the more detailed results, as seen in the confusion matrices, between the different runs? Explain any important differences that you see.

e) Since the attribute being predicted is actually an ordinal variable, some incorrect predictions are worse than others. Which are the worst?

f) Which of the method/ data combinations has the MOST of these bad errors?

g) Which has the least bad errors?

h) In this particular case, does discretization appear to make much difference (in either accuracy or the model generated)? Explain.

i) Which models are the most comprehensible?

j) Try to explain what you can get out of the models generated if you were trying to learn about the data based on these runs.

k) Do you trust/believe in the models that were generated? Do they fit your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

l) Which run do you want to use as a comparison for more sophisticated efforts? Explain.

With the OneR method and the original (numeric) data, experiment with values for the B option. Graph Training and test accuracy for different values of B. At what value for B is the best accuracy found? Explain what is happening with the results.

Turn in:

§ Disk with data and results files

§ Table summarizing results (Part 4)

Answers to questions in #5 above
Graph and answers for #6 above