CIS 658 Data Mining Fall 2007 Evening

Assignment 3 Data Mining Using WEKA 10 points

Assigned: 10/25/07

Due: 11/01/07 AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Turn in:

§ Zip file with:

o data (#1 below) and results (#2-3 below) files. PLEASE DO NOT print these!!!

o Table summarizing results (#4 below)

Answers to questions in #5 above

Task:

For these experiments you need three data sets (all based on the community data you worked with in previous assignments):

a) One data set, which you already have prepared for Assignments 1 & 2:

§ with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5). We need to use the fully discretized dataset because ID3 and Prism cannot handle numeric attributes.

§ I will provide a copy of mine on my www site for use by anybody who has problems with their version.

b) One based on the above, with all instances having any missing values eliminated. This may be produced any way that you want, but WEKA provides a filter for this kind of thing. Under the “Preprocess” tab, first determine which attributes (by #) have missing values (I think there are 21 of these – all but otherPerCap are late in the data – concerning police, and the last – violent crime), then find and choose the filter “RemoveWith Values” under unsupervised instance. You can only filter based on one attribute at a time. Change the options to:

§ AttributeIndex = attr# filtering based on;

§ MatchMissingValues = True;

§ Invert Selection = True.

§ Make sure you actually Apply the filter

Repeat until no missing values remain – you should not need to do anything close to 21 removes since most of the missing values are for the same records - that are missing police info (it took me 2 removes)

Make sure you Save to a different filename than the file in a) so that you still have both!

c) One based on a) above, but with values substituted for missing values. Do this in Weka using the unsupervised attribute filter “Replace Missing Values”. (This filter replaces missing nominal values with the most common (mode) value for that attribute)

§ To replace missing values in violent crime, make sure you set the class to “no class” before running the filter.

§ Remember, to go back to the first. You cannot do this from the second file, as all of the missing values are gone.

§ Make sure you Save to a different filename than the file in a) so that you still have both!

Run Weka classifiers on applicable data sets, as specified below, thus producing thirteen different models of the community data (you should have 1 of these from assignment #3). Run the classifier with 10-fold cross validation test option, predicting the violent crime attribute, and examine the bottom section of the classifier output (in the Classifier output window). Use default options for all runs. You should have:

	Data
Algorithm	Fully Discretized	Fully Discretized, Missing Values Removed	Fully Discretized, Missing Values Replaced
ID3	Not Applicable	Yes	Yes
J48	Yes	Yes	Yes
Prism	Not Applicable	Yes	Yes
OneR	Done in Assignment #2	Yes	Yes
NaiveBayesSimple	Yes	Yes	Yes

Save results of each run into an appropriately named file:

a) Under Results-list, right-click on the results you want to save

b) Choose “Save result buffer”

c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table (perhaps mirroring the above table).
Answer the questions listed below.

a) ID3 and Prism are more sophisticated algorithms than OneR and NaiveBayes in that they decide which attributes to use instead of using all (or just one). Does this sophistication pay off with better performance? Explain.

b) Rank order the models ID3, J48, Prism, and OneR on the comprehensibility (understandability) of the model generated. Explain your ordering.

c) J48 is a more sophisticated algorithm than ID3 (which was covered in the book) – developed by Quinlan later in his career. You don’t know the details of the algorithm, but looking at the results, does it seem to be a noticeable improvement? Explain.

d) ID3 and Prism cannot handle missing values (at all). One of the criteria for saying that an algorithm is robust is that it can handle missing values without being negatively affected. For the algorithms that can handle missing values, comment on how well they perform when faced with missing values, as opposed to when there are no missing values.

e) Both approaches to eliminating missing values may seem questionable (removing instances reduces the amount of usable data, replacing them with made up values may seem unjustifiable). Do you think one seems to work better than the other? Explain.

f) Considering that the number of missing values is fairly high (85%) in 19 attributes, do you think the algorithms’ performance would do better with much fewer missing values? (think about how the algorithms that you know work)

g) If you were an interested business person (user, not an IT person. Could be government or non-profit given the subject area), of the models generated by ID3 and Prism, which do you think would be more useful? Explain.

h) If one is a technical person (e.g. IT), do you think that you could something useful out of the model generated by J48 if you were trying to learn about the data based on these runs? Explain.

i) Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.