CSC 470 Data Mining Fall 2005 – Day Section
Assignment 5 Data Mining Using WEKA – Experimental Analysis 10 points
Assigned: 10/27/05
Due: 11/10/05 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a) One data set, which you already have prepared for Assignment 1:
§ Community dataset with mix of numeric and nominal attributes – i.e. the first dataset produced
b) One data set, which you already have prepared for Assignment 4:
§ Community dataset, with nominal attributes only, with instances containing missing values removed – i.e. one of the most recent dataset produced
|
Data |
|
Algorithm |
Non-Discretized Community |
Discretized, No Missing, Community |
IBK - K=1 |
Yes |
Yes |
IBK - K=5 |
Yes |
Yes |
IBK - K=5, 1/ dist |
Yes |
Yes |
IBK - K=10 |
Yes |
Yes |
IBK - K=10, 1/ dist |
Yes |
Yes |
IBK - K=20 |
Yes |
Yes |
IBK - K=30 |
Yes |
Yes |
IBK - K=40 |
Yes |
Yes |
IBK - K=50 |
Yes |
Yes |
IBK - K=60 |
Yes |
Yes |
IBK - K=70 |
Yes |
Yes |
IBK - K=80 |
Yes |
Yes |
IBK - K=90 |
Yes |
Yes |
IBK - K=100 |
Yes |
Yes |
Run Weka experimenter environment on these combinations. It will take several minutes (>10?)
a) Does there reach a point where having more neighbors doesn’t help? Explain.
b) For which dataset does using more neighbors prove beneficial. Explain. Why do you think this is?
c) With the current data and methods, does the difference in distance weighting seem to make much of a difference? Explain.
d) Using WordPad, look at the results arff file. Obviously, this contains a lot more information than shown via “Perform Test.” What might the more detailed results enable? (suppose that you imported it into Excel). I’m interested in some specific benefits.
Turn in:
§ Disk with data files and experiment configuration file.
§ Table summarizing results