CSC 470 Data Mining Spring
2004
Assignment
4 Answers
1.
For these experiments
you need two data sets (all based on the adult census data you worked
with in previous assignments):
a)
…
b)
One based on the above, with all instances having any
missing values eliminated. This may be produced any way that you want, but WEKA
provides a filter for this kind of thing. Under the “Preprocess” tab,
first determine which attributes have missing values, then find and choose the
filter “RemoveWith Values.” You can
only filter based on one attribute at a time.
Change the options to AttributeIndex = attr# filtering based on; Invert Selection = True. Repeat until no missing values remain.
The resulting dataset should
have 925 instances
2.
Run Weka classifiers on
applicable data sets, as specified below, thus producing eight different
models of the census data (you should have 2 of these from assignment #2). Run
the classifier with 10-fold cross validation test option, predicting the
income attribute, and examine the bottom section of the classifier output (in
the Classifier output window). Use
default options for all runs. You should have: …
Should have 8 results files
on disk.
3.
Save results of each run
into an appropriately named file:
a)
Under Results-list,
right-click on the results you want to save
b)
Choose “Save result
buffer”
c)
Specify the file name
and location in the file chooser window that opens (end your file name in .txt
to make it easy to open later)
4.
Collect information
about the accuracy of each model: Use correctly classified instances
as a measure of accuracy. Report for all, preferably in a table (perhaps
mirroring the above table).
|
Data |
|
Algorithm |
Fully Discretized |
Fully Discretized, no
Missing Values |
ID3 |
Not Applicable |
73.0 |
J48 |
81.8 |
81.8 |
Prism |
Not Applicable |
79.7 |
OneR |
78.5 |
77.7 |
NaiveBayesSimple |
82.2 |
81.1 |
5.
Answer the questions listed below.
a)
ID3 and Prism are more
sophisticated algorithms than OneR and NaiveBayes in that they decide which
attributes to use instead of using all (one just one). Does this sophistication
pay off with better performance? Explain.
Not in this case. Naïve
Bayes does better than either one on these datasets.
b)
Rank order the models
ID3, J48, Prism, and OneR on the comprehensibility of the model generated.
Explain your ordering.
OneR is the most comprehensible, because it breaks everything down into one attribute and its values. J48 is next most comprehensible because the tree is much smaller and more compact than the remaining two, taking less than a page. ID3 and Prism generate models that take up more than 20 pages. I would say that Prism’s rules are more comprehensible because 1) they are neater; 2) you can read a rule from one spot in the output, whereas finding the attribute values that lead to a leaf node may involve scanning over multiple pages. Plus, it is helpful that all of the rules for less income are together and all of the rules for more income are together (on the other hand, the latter rules for each category start getting very hairy.
c)
J48 is a more
sophisticated algorithm than ID3 (which was covered in the book) – developed by
Quinlan later in his career. You don’t
know the details of the algorithm, but looking at the results, what advantages
do you see? Explain.
More compact tree,
which makes it more comprehensible. Can handle missing values, so can run on
data that ID3 can’t run on. At least on these datasets, gets higher percent
correct than ID3.
d)
ID3 and Prism cannot
handle missing values (at all). One of
the criteria for saying that an algorithm is robust is that it can handle
missing values without being negatively affected. For the algorithms that can
handle missing values, comment on how well they perform when faced with missing
values, as opposed to when there are missing values.
Missing values don’t appear to hurt these algorithms, since the results on the full dataset have at least as high, if not higher, percent correct than on the reduced/filtered dataset.
e)
Considering that the
number of missing values is fairly low, do you think the algorithms’
performance would hold up with twice as many missing values? (think about how
the algorithms that you know work)
OneR treats a missing value as another alternative value. A missing value, if it doesn’t somehow tie into what is predicted will probably be split fairly evenly among the possible predictions, making a fairly high number of errors on training data. If an attribute had double as many missing values, it’s error rate would go up, and that attribute would be less likely to be used (which is probably an appropriate response to the situation). If the increased missing values were spread among all attributes, with no pattern, it would probably increase errors in general, but not affect choice of attribute to use (which is probably appropriate).
Naïve Bayes is pretty robust
with regard to missing values. The handling is natural and principled. I think
double the missing data should not be a problem.
J48 we don’t know much about
its handling of missing values, but based on the performance being exactly the
same with and without missing values, I suspect that it will still do ok.
f)
If you were an
interested business person (not an IT person), of the models generated by ID3
and Prism, which do you think would be more useful? Explain.
I think Prism’s rules are more useful, because they are not spread across many pages. One has to remember, however, that they are not independent from each other.
g)
Try to explain what you
can get out of the model generated by J48 if you were trying to learn about the
data based on these runs.
Capital gains/losses,
marital status, race, education, and age appear to be good variables for
predicting income. If somebody had capital gains or losses, then they had
enough money to invest, If they are never married , or separated, then their
income is ‘less’ – perhaps they can invest when they have lower income than
married people (more disposable income). Married people with capital
gains/losses have ‘more’ income unless they are native American. Divorced
people are divided based on the size of the capital gains, the largest net
gains are associated with people with ‘more’ income. Widowed people with
capital gains/losses have ‘more’ income.
Among those with no capital
gains/losses, divorced, never married, separated, and widowed people have
‘less’ income. Among married people, education helps predict income. The
highest education level predicts ‘more’ income; the lowest three levels of
education predict ‘less’ income. The level in between is divided based on
age. Peak earning years are 43-56.
People under 30 have ‘less’ income.
30-43 is ‘more’ income unless self-employed.
h)
Do you trust/believe in
the models that were generated? Do they fit you’re your expectations enough
that you would be willing to accept patterns that you don’t already know about?
Explain.
J48’s tree makes a lot of sense to me. I think I would trust it. It is hard to get a feel for the models generated by ID3 and Prism because they are so massive. I think they probably overfit the data. As discussed in Assignment 2, with work you can pull info out of the Naïve Bayes model, and what I see in analyzing that makes a lot of sense. I trust these results.
Turn
in:
§
Disk
with results files
§
Table
summarizing results
§
Answers
to questions in #5 above