CSC 470 Data Mining Spring
2004
Assignment
2 Data
Mining Using WEKA 10
points
Assigned: 02/12/04
Due: 02/26/04 AT
THE START OF CLASS.
IF
NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a)
Two data
sets, which you already have prepared for Assignment 1:
§
Original
data – with created attributes for capital gain and loss etc;
§
unsupervised
discretization on hrsperwk with automatic number of bins selection
(findNumBins=true, bins=5).
b)
One
with nominal only attributes, produced by discretizing all numeric
attributes. (findNumBins=true, bins=5)
a)
Three
OneR rules.
b)
Three
NaiveBayesSimple models
a)
Under
Results-list, right-click on the results you want to save
b)
Choose
“Save result buffer”
c)
Specify
the file name and location in the file chooser window that opens (end your file
name in .txt to make it easy to open later)
a)
What
do you think the “chance” probability of getting a prediction correct is in
this problem? Explain.
b)
Do
you think that the programs’ accuracy is significantly better than “chance”?
Explain.
c)
Do
the differences between the results of the different runs appear to be
important/significant? Explain your answer.
d)
Are
there any differences in the more detailed results, as seen in the confusion
matrices, between the different runs?
Explain any important differences that you see.
e)
In
this particular case, does discretization appear to make much difference (in
either accuracy or the model generated)? Explain.
f)
Which
models are the most comprehensible?
g)
Try
to explain what you can get out of the models generated if you were trying to
learn about the data based on these runs.
h)
Do
you trust/believe in the models that were generated? Do they fit you’re your
expectations enough that you would be willing to accept patterns that you don’t
already know about? Explain.
i)
Which
run do you want to use as a comparison for more sophisticated efforts? Explain.
Turn in:
§
Disk
with results files
§
Table
summarizing results