CSC 470 Data Mining Spring 2004

Assignment 5 Data Mining Using WEKA – Experimental Analysis 10 points

Assigned: 03/25/04

Due: 04/01/04 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

For these experiments you need two data sets (all based on the adult census data you worked with in previous assignments):

a) One data set, which you already have prepared for Assignment 2:

§ Adult census dataset with mix of numeric and nominal attributes – i.e. the first dataset produced

b) One data set, which you already have prepared for Assignment 4:

§ Adult census dataset, with nominal attributes only, with instances containing missing values removed – i.e. the most recent dataset produced

Set up Weka experimenter environment to use the above datasets on 5 variations on the instance-based algorithm – with k=1, k=10, k=20, k=30 (each with distance weighting = 1-distance), and k=30 (with distance weighting = 1/ distance. Make sure to save the results into an arff file. I will demonstrate use of the experimenter environment in class. If you miss, get somebody to show you. Save the experiment set up into a configuration file so that it can easily be repeated. You should have:

	Data
Algorithm	Non-Discretized Census	Discretized, No Missing, Census
IBK - K=1	Yes	Yes
IBK - K=10	Yes	Yes
IBK - K=20	Yes	Yes
IBK - K=30	Yes	Yes
IBK - K=30, 1/ dist	Yes	Yes

Run Weka experimenter environment on these combinations. It will take several minutes (10?)

Go to the analyze tab, if you “Perform Test” after opening the results file, results should be displayed. Find the results, report in a table like the above, the percent correct for each. IBK with K=1 serves as the “baseline” of the experiment (assuming that is the first algorithm you specify). Specify which algorithms are statistically significantly different from the baseline (indicate better or worse).
Answer the questions listed below.

a) For which dataset does using more neighbors prove beneficial. Explain. Why do you think this is?

b) Does there reach a point where having more neighbors doesn’t help? Explain.

c) With the current data and methods, does the difference in distance weighting seem to make much of a difference? Explain.

d) Using WordPad, look at the results arff file. Obviously, this contains a lot more information than shown via “Perform Test.” What might the more detailed results enable? (suppose that you imported it into Excel). I’m interested in some specific benefits.

Turn in:

§ Disk with data files and experiment configuration file.

§ Table summarizing results

Answers to questions in #4 above

IBK - K=1

IBK - K=10