CSC 470 Data Mining Spring
2004
Assignment
5 Data Mining Using WEKA –
Experimental Analysis 10 points
Assigned: 03/25/04
Due: 04/01/04 AT
THE START OF CLASS.
IF
NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a)
One data
set, which you already have prepared for Assignment 2:
§
Adult
census dataset with mix of numeric and nominal attributes – i.e. the
first dataset produced
b)
One
data set, which you already have prepared for Assignment 4:
§
Adult
census dataset, with nominal attributes only, with instances containing missing
values removed – i.e. the most recent dataset produced
|
Data |
|
Algorithm |
Non-Discretized Census |
Discretized, No Missing,
Census |
IBK - K=1 |
Yes |
Yes |
IBK - K=10
|
Yes |
Yes |
IBK
- K=20 |
Yes |
Yes |
IBK
- K=30 |
Yes |
Yes |
IBK - K=30, 1/ dist |
Yes |
Yes |
Run Weka experimenter environment on these
combinations. It will take several minutes (10?)
a)
For
which dataset does using more neighbors prove beneficial. Explain. Why do you think this is?
b)
Does
there reach a point where having more neighbors doesn’t help? Explain.
c)
With
the current data and methods, does the difference in distance weighting seem to
make much of a difference? Explain.
d)
Using
WordPad, look at the results arff file. Obviously, this contains a lot more
information than shown via “Perform Test.” What might the more detailed results
enable? (suppose that you imported it into Excel). I’m interested in some
specific benefits.
Turn in:
§
Disk
with data files and experiment configuration file.
§
Table
summarizing results