CSC 470 Data Mining Spring 2004

Assignment 4 Answers

1. For these experiments you need two data sets (all based on the adult census data you worked with in previous assignments):

a) …

b) One based on the above, with all instances having any missing values eliminated. This may be produced any way that you want, but WEKA provides a filter for this kind of thing. Under the “Preprocess” tab, first determine which attributes have missing values, then find and choose the filter “RemoveWith Values.” You can only filter based on one attribute at a time. Change the options to AttributeIndex = attr# filtering based on; Invert Selection = True. Repeat until no missing values remain.

The resulting dataset should have 925 instances

2. Run Weka classifiers on applicable data sets, as specified below, thus producing eight different models of the census data (you should have 2 of these from assignment #2). Run the classifier with 10-fold cross validation test option, predicting the income attribute, and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have: …

Should have 8 results files on disk.

3. Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

4. Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table (perhaps mirroring the above table).

	Data
Algorithm	Fully Discretized	Fully Discretized, no Missing Values
ID3	Not Applicable	73.0
J48	81.8	81.8
Prism	Not Applicable	79.7
OneR	78.5	77.7
NaiveBayesSimple	82.2	81.1

5. Answer the questions listed below.

a) ID3 and Prism are more sophisticated algorithms than OneR and NaiveBayes in that they decide which attributes to use instead of using all (one just one). Does this sophistication pay off with better performance? Explain.

Not in this case. Naïve Bayes does better than either one on these datasets.

b) Rank order the models ID3, J48, Prism, and OneR on the comprehensibility of the model generated. Explain your ordering.

OneR is the most comprehensible, because it breaks everything down into one attribute and its values. J48 is next most comprehensible because the tree is much smaller and more compact than the remaining two, taking less than a page. ID3 and Prism generate models that take up more than 20 pages. I would say that Prism’s rules are more comprehensible because 1) they are neater; 2) you can read a rule from one spot in the output, whereas finding the attribute values that lead to a leaf node may involve scanning over multiple pages. Plus, it is helpful that all of the rules for less income are together and all of the rules for more income are together (on the other hand, the latter rules for each category start getting very hairy.

c) J48 is a more sophisticated algorithm than ID3 (which was covered in the book) – developed by Quinlan later in his career. You don’t know the details of the algorithm, but looking at the results, what advantages do you see? Explain.

More compact tree, which makes it more comprehensible. Can handle missing values, so can run on data that ID3 can’t run on. At least on these datasets, gets higher percent correct than ID3.

d) ID3 and Prism cannot handle missing values (at all). One of the criteria for saying that an algorithm is robust is that it can handle missing values without being negatively affected. For the algorithms that can handle missing values, comment on how well they perform when faced with missing values, as opposed to when there are missing values.

Missing values don’t appear to hurt these algorithms, since the results on the full dataset have at least as high, if not higher, percent correct than on the reduced/filtered dataset.

e) Considering that the number of missing values is fairly low, do you think the algorithms’ performance would hold up with twice as many missing values? (think about how the algorithms that you know work)

OneR treats a missing value as another alternative value. A missing value, if it doesn’t somehow tie into what is predicted will probably be split fairly evenly among the possible predictions, making a fairly high number of errors on training data. If an attribute had double as many missing values, it’s error rate would go up, and that attribute would be less likely to be used (which is probably an appropriate response to the situation). If the increased missing values were spread among all attributes, with no pattern, it would probably increase errors in general, but not affect choice of attribute to use (which is probably appropriate).

Naïve Bayes is pretty robust with regard to missing values. The handling is natural and principled. I think double the missing data should not be a problem.

J48 we don’t know much about its handling of missing values, but based on the performance being exactly the same with and without missing values, I suspect that it will still do ok.

f) If you were an interested business person (not an IT person), of the models generated by ID3 and Prism, which do you think would be more useful? Explain.

I think Prism’s rules are more useful, because they are not spread across many pages. One has to remember, however, that they are not independent from each other.

g) Try to explain what you can get out of the model generated by J48 if you were trying to learn about the data based on these runs.

Capital gains/losses, marital status, race, education, and age appear to be good variables for predicting income. If somebody had capital gains or losses, then they had enough money to invest, If they are never married , or separated, then their income is ‘less’ – perhaps they can invest when they have lower income than married people (more disposable income). Married people with capital gains/losses have ‘more’ income unless they are native American. Divorced people are divided based on the size of the capital gains, the largest net gains are associated with people with ‘more’ income. Widowed people with capital gains/losses have ‘more’ income.

Among those with no capital gains/losses, divorced, never married, separated, and widowed people have ‘less’ income. Among married people, education helps predict income. The highest education level predicts ‘more’ income; the lowest three levels of education predict ‘less’ income. The level in between is divided based on age. Peak earning years are 43-56. People under 30 have ‘less’ income. 30-43 is ‘more’ income unless self-employed.

h) Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

J48’s tree makes a lot of sense to me. I think I would trust it. It is hard to get a feel for the models generated by ID3 and Prism because they are so massive. I think they probably overfit the data. As discussed in Assignment 2, with work you can pull info out of the Naïve Bayes model, and what I see in analyzing that makes a lot of sense. I trust these results.

Turn in:

§ Disk with results files

§ Table summarizing results

§ Answers to questions in #5 above