CSC 470 Data Mining Spring
2004
Assignment
4 Data Mining Using WEKA 10
points
Assigned: 03/11/04
Due: 03/18/04 AT
THE START OF CLASS.
IF
NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a)
One data
set, which you already have prepared for Assignment 2:
§
with
nominal only attributes, produced by discretizing all numeric
attributes. (findNumBins=true, bins=5).
We need to use the fully discretized dataset because ID3 and Prism cannot
handle numeric attributes.
b)
One
based on the above, with all instances having any missing values eliminated.
This may be produced any way that you want, but WEKA provides a filter for this
kind of thing. Under the “Preprocess” tab, first determine which
attributes have missing values, then find and choose the filter “RemoveWith
Values.” You can only filter based on
one attribute at a time. Change the
options to AttributeIndex = attr# filtering based on; Invert Selection = True.
Repeat until no missing values remain.
|
Data |
|
Algorithm |
Fully Discretized |
Fully Discretized, no
Missing Values |
ID3 |
Not Applicable |
Yes |
J48 |
Yes |
Yes |
Prism |
Not Applicable |
Yes |
OneR |
Done in Assignment #2 |
Yes |
NaiveBayesSimple |
Done in Assignment #2 |
Yes |
a)
Under
Results-list, right-click on the results you want to save
b)
Choose
“Save result buffer”
c)
Specify
the file name and location in the file chooser window that opens (end your file
name in .txt to make it easy to open later)
a)
ID3
and Prism are more sophisticated algorithms than OneR and NaiveBayes in that
they decide which attributes to use instead of using all (one just one). Does
this sophistication pay off with better performance? Explain.
b)
Rank
order the models ID3, J48, Prism, and OneR on the comprehensibility of the model
generated. Explain your ordering.
c)
J48
is a more sophisticated algorithm than ID3 (which was covered in the book) –
developed by Quinlan later in his career.
You don’t know the details of the algorithm, but looking at the results,
what advantages do you see? Explain.
d)
ID3
and Prism cannot handle missing values (at all). One of the criteria for saying that an algorithm is robust is
that it can handle missing values without being negatively affected. For the
algorithms that can handle missing values, comment on how well they perform
when faced with missing values, as opposed to when there are missing values.
e)
Considering
that the number of missing values is fairly low, do you think the algorithms’
performance would hold up with twice as many missing values? (think about how
the algorithms that you know work)
f)
If
you were an interested business person (not an IT person), of the models
generated by ID3 and Prism, which do you think would be more useful? Explain.
g)
Try
to explain what you can get out of the model generated by J48 if you were
trying to learn about the data based on these runs.
h)
Do
you trust/believe in the models that were generated? Do they fit you’re your
expectations enough that you would be willing to accept patterns that you don’t
already know about? Explain.
Turn in:
§
Disk
with results files
§
Table
summarizing results