CSC 470 Data Mining Spring 2004

Assignment 2 Data Mining Using WEKA 10 points

Assigned: 02/12/04

Due: 02/26/04 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

For these experiments you need three data sets (all based on the adult census data you worked with in assignment 1):

a) Two data sets, which you already have prepared for Assignment 1:

§ Original data – with created attributes for capital gain and loss etc;

§ unsupervised discretization on hrsperwk with automatic number of bins selection (findNumBins=true, bins=5).

b) One with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5)

Run the following two Weka classifiers on each one of the three data sets, thus producing six different models of the loan data. Run the classifier with 10-fold cross validation test option and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have:

a) Three OneR rules.

b) Three NaiveBayesSimple models

Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table.
Answer the questions listed below.

a) What do you think the “chance” probability of getting a prediction correct is in this problem? Explain.

b) Do you think that the programs’ accuracy is significantly better than “chance”? Explain.

c) Do the differences between the results of the different runs appear to be important/significant? Explain your answer.

d) Are there any differences in the more detailed results, as seen in the confusion matrices, between the different runs? Explain any important differences that you see.

e) In this particular case, does discretization appear to make much difference (in either accuracy or the model generated)? Explain.

f) Which models are the most comprehensible?

g) Try to explain what you can get out of the models generated if you were trying to learn about the data based on these runs.

h) Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

i) Which run do you want to use as a comparison for more sophisticated efforts? Explain.

Turn in:

§ Disk with results files

§ Table summarizing results

Answers to questions in #5 above