CSC 470 Data Mining Spring 2004

Assignment 4 Data Mining Using WEKA 10 points

Assigned: 03/11/04

Due: 03/18/04 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

For these experiments you need two data sets (all based on the adult census data you worked with in previous assignments):

a) One data set, which you already have prepared for Assignment 2:

§ with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5). We need to use the fully discretized dataset because ID3 and Prism cannot handle numeric attributes.

b) One based on the above, with all instances having any missing values eliminated. This may be produced any way that you want, but WEKA provides a filter for this kind of thing. Under the “Preprocess” tab, first determine which attributes have missing values, then find and choose the filter “RemoveWith Values.” You can only filter based on one attribute at a time. Change the options to AttributeIndex = attr# filtering based on; Invert Selection = True. Repeat until no missing values remain.

Run Weka classifiers on applicable data sets, as specified below, thus producing eight different models of the census data (you should have 2 of these from assignment #2). Run the classifier with 10-fold cross validation test option, predicting the income attribute, and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have:

	Data
Algorithm	Fully Discretized	Fully Discretized, no Missing Values
ID3	Not Applicable	Yes
J48	Yes	Yes
Prism	Not Applicable	Yes
OneR	Done in Assignment #2	Yes
NaiveBayesSimple	Done in Assignment #2	Yes

Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table (perhaps mirroring the above table).
Answer the questions listed below.

a) ID3 and Prism are more sophisticated algorithms than OneR and NaiveBayes in that they decide which attributes to use instead of using all (one just one). Does this sophistication pay off with better performance? Explain.

b) Rank order the models ID3, J48, Prism, and OneR on the comprehensibility of the model generated. Explain your ordering.

c) J48 is a more sophisticated algorithm than ID3 (which was covered in the book) – developed by Quinlan later in his career. You don’t know the details of the algorithm, but looking at the results, what advantages do you see? Explain.

d) ID3 and Prism cannot handle missing values (at all). One of the criteria for saying that an algorithm is robust is that it can handle missing values without being negatively affected. For the algorithms that can handle missing values, comment on how well they perform when faced with missing values, as opposed to when there are missing values.

e) Considering that the number of missing values is fairly low, do you think the algorithms’ performance would hold up with twice as many missing values? (think about how the algorithms that you know work)

f) If you were an interested business person (not an IT person), of the models generated by ID3 and Prism, which do you think would be more useful? Explain.

g) Try to explain what you can get out of the model generated by J48 if you were trying to learn about the data based on these runs.

h) Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

Turn in:

§ Disk with results files

§ Table summarizing results

Answers to questions in #5 above