CSC 470 Data Mining Spring 2004

CSC 470 Data Mining Fall 2005 – Evening Section

Assignment 5 Data Mining Using WEKA – Experimental Analysis 10 points

Assigned: 11/09/05

Due: 11/30/05 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

(I had planned on this being due next week 11/16 – and the only reason I changed was in case of anybody missing class on 11/9 – since I plan on this being done in class time on 11/9)

Task:

For these experiments you need two data sets (all based on the community data you worked with in previous assignments):

a) One data set, which you already have prepared for Assignment 1:

§ Community dataset with mix of numeric and nominal attributes – i.e. the first dataset produced

b) One data set, which you already have prepared for Assignment 4:

§ Community dataset, with nominal attributes only, with instances containing missing values removed – i.e. one of the most recent dataset produced

Set up Weka experimenter environment to use the above datasets on 5 variations on the instance-based algorithm – with k=1, k=5, k=10, k=20, 30, 40, 50, 60, 70, 80, 90, & 100 (each with distance weighting = 1-distance), and k=5, & 10 (with distance weighting = 1/ distance). Make sure to save the results into an arff file. I will demonstrate use of the experimenter environment in class. If you miss, get somebody to show you. Save the experiment set up into a configuration file so that it can easily be repeated. You should have:

	Data
Algorithm	Non-Discretized Community	Discretized, No Missing, Community
IBK - K=1	Yes	Yes
IBK - K=5	Yes	Yes
IBK - K=5, 1/ dist	Yes	Yes
IBK - K=10	Yes	Yes
IBK - K=10, 1/ dist	Yes	Yes
IBK - K=20	Yes	Yes
IBK - K=30	Yes	Yes
IBK - K=40	Yes	Yes
IBK - K=50	Yes	Yes
IBK - K=60	Yes	Yes
IBK - K=70	Yes	Yes
IBK - K=80	Yes	Yes
IBK - K=90	Yes	Yes
IBK - K=100	Yes	Yes

Run Weka experimenter environment on these combinations. It will take several minutes (>10?)

Go to the analyze tab, if you “Perform Test” after opening the results file, results should be displayed. Find the results, report in a table like the above, the percent correct for each. IBK with K=1 serves as the “baseline” of the experiment (assuming that is the first algorithm you specify). Specify which algorithms are statistically significantly different from the baseline (indicate better or worse).
Answer the questions listed below.

a) Does there reach a point where having more neighbors doesn’t help? Explain.

b) For which dataset does using more neighbors prove beneficial. Explain. Why do you think this is?

c) With the current data and methods, does the difference in distance weighting seem to make much of a difference? Explain.

d) Using WordPad, look at the results arff file. Obviously, this contains a lot more information than shown via “Perform Test.” What might the more detailed results enable? (suppose that you imported it into Excel). I’m interested in some specific benefits.

Turn in:

§ Disk with data files and experiment configuration file.

§ Table summarizing results

Answers to questions in #4 above

IBK - K=1

IBK - K=5

IBK - K=10