CSC 470 Data Mining Fall 2005

Assignment 4                                        Data Mining Using WEKA                                10 points

 

Assigned: 10/13/05

Due:         11/08/05   AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Task:

  1. For these experiments you need three data sets (all based on the community data you worked with in previous assignments):

a)      One data set, which you already have prepared for Assignments 1 & 3:

§         with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5). We need to use the fully discretized dataset because ID3 and Prism cannot handle numeric attributes.

b)      One based on the above, with all instances having any missing values eliminated. This may be produced any way that you want, but WEKA provides a filter for this kind of thing. Under the “Preprocess” tab, first determine which attributes (by #) have missing values (I think there are 11 of these), then find and choose the filter “RemoveWith Values”  under unsupervised instance. You can only filter based on one attribute at a time.  Change the options to:

§         AttributeIndex = attr# filtering based on; 

§         MatchMissingValues = True;

§         Invert Selection = True. 

Repeat until no missing values remain – you may not need to do 11 removes if a record has missing values for more than one attribute.

Make sure you Save to a different filename than the file in a) so that you still have both!

c)      One based on a) above, but with values substituted for missing values. Do this in Weka using the unsupervised attribute filter “Replace Missing Values”

  1. Run Weka classifiers on applicable data sets, as specified below, thus producing thirteen different models of the community data (you should have 2 of these from assignment #3). Run the classifier with 10-fold cross validation test option, predicting the drug admissions attribute, and examine the bottom section of the classifier output (in the Classifier output window).  Use default options for all runs. You should have:

 

Data

 

Algorithm

Fully Discretized

Fully Discretized,  Missing Values Removed

Fully Discretized,  Missing Values Replaced

ID3

Not Applicable

Yes

Yes

J48

Yes

Yes

Yes

Prism

Not Applicable

Yes

Yes

OneR

Done in Assignment #3

Yes

Yes

NaiveBayesSimple

Done in Assignment #3

Yes

Yes

  1. Save results of each run into an appropriately named file:

a)      Under Results-list, right-click on the results you want to save

b)      Choose “Save result buffer”

c)      Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)

  1. Collect information about the accuracy of each model: Use correctly classified instances as a measure of accuracy. Report for all, preferably in a table (perhaps mirroring the above table).
  2. Answer the questions listed below.

a)      ID3 and Prism are more sophisticated algorithms than OneR and NaiveBayes in that they decide which attributes to use instead of using all (or just one). Does this sophistication pay off with better performance? Explain.

b)      Rank order the models ID3, J48, Prism, and OneR on the comprehensibility of the model generated. Explain your ordering.

c)      J48 is a more sophisticated algorithm than ID3 (which was covered in the book) – developed by Quinlan later in his career.  You don’t know the details of the algorithm, but looking at the results, does it seem to be a noticeable improvement? Explain.

d)      ID3 and Prism cannot handle missing values (at all).  One of the criteria for saying that an algorithm is robust is that it can handle missing values without being negatively affected. For the algorithms that can handle missing values, comment on how well they perform when faced with missing values, as opposed to when there are no missing values.

e)      Both approaches to eliminating missing values may seem questionable (removing instances reduces the amount of usable data, replacing them with made up values may seem unjustifiable). Do you think one seems to work better than the other? Explain.

f)       Considering that the number of missing values is fairly low, do you think the algorithms’ performance would hold up with twice as many missing values? (think about how the algorithms that you know work)

g)      If you were an interested business person (not an IT person), of the models generated by ID3 and Prism, which do you think would be more useful?  Explain.

h)      If one is a technical person (e.g. IT), do you think that you could something useful out of the model generated by J48 if you were trying to learn about the data based on these runs? Explain.

i)        Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.

Turn in:

§         Disk with results files

§         Table summarizing results