CIS 658 Data Mining Fall 2007
Assignment 4 Data Mining Using WEKA – Experimental Analysis 10 points
Assigned: 11/29/07
Due: 12/06/07 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
(I plan on this being mostly done in class time on 11/29)
Task:
a) One data set, which you already have prepared for Assignment 1:
§ Community dataset with all attributes discretized – i.e. the second dataset produced
b) One data set, which you already have prepared for Assignment 3:
§ Community dataset, with nominal attributes only, with instances containing missing values removed – i.e. one of the most recent dataset produced
|
Data |
|
Algorithm |
Non-Discretized Community |
Discretized, No Missing, Community |
IBK - K=1 |
Yes |
Yes |
IBK - K=5 |
Yes |
Yes |
IBK - K=10 |
Yes |
Yes |
IBK - K=20 |
Yes |
Yes |
IBK - K=30 |
Yes |
Yes |
IBK - K=40 |
Yes |
Yes |
Run Weka experimenter environment on these combinations. It will take ‘a while’ ( < 1 hour ).
a) Does you think we’ve reached the point where having more neighbors doesn’t help? Explain.
b) Do you think that using more neighbors proves to be more beneficial for one dataset than another? Explain.
c) Using WordPad or Excel, look at the results arff file. Obviously, this contains a lot more information than shown via “Perform Test.” What might the more detailed results enable? (suppose that you imported it into Excel). I’m interested in some specific benefits.
Turn in:
§ Zip file with
o data files,
o results arff file,
o experiment configuration file.
o Table summarizing results (Either Excel or Word)