CSC 470 Data Mining Fall 2005
Assignment 4 Data Mining Using WEKA 10 points
Assigned: 10/13/05
Due: 11/08/05 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Task:
a) One data set, which you already have prepared for Assignments 1 & 3:
§ with nominal only attributes, produced by discretizing all numeric attributes. (findNumBins=true, bins=5). We need to use the fully discretized dataset because ID3 and Prism cannot handle numeric attributes.
b) One based on the above, with all instances having any missing values eliminated. This may be produced any way that you want, but WEKA provides a filter for this kind of thing. Under the “Preprocess” tab, first determine which attributes (by #) have missing values (I think there are 11 of these), then find and choose the filter “RemoveWith Values” under unsupervised instance. You can only filter based on one attribute at a time. Change the options to:
§ AttributeIndex = attr# filtering based on;
§ MatchMissingValues = True;
§ Invert Selection = True.
Repeat until no missing values remain – you may not need to do 11 removes if a record has missing values for more than one attribute.
Make sure you Save to a different filename than the file in a) so that you still have both!
c) One based on a) above, but with values substituted for missing values. Do this in Weka using the unsupervised attribute filter “Replace Missing Values”
|
Data |
|
||
Algorithm |
Fully Discretized |
Fully Discretized, Missing Values Removed |
Fully Discretized, Missing Values Replaced |
|
ID3 |
Not Applicable |
Yes |
Yes |
|
J48 |
Yes |
Yes |
Yes |
|
Prism |
Not Applicable |
Yes |
Yes |
|
OneR |
Done in Assignment #3 |
Yes |
Yes |
|
NaiveBayesSimple |
Done in Assignment #3 |
Yes |
Yes |
|
a) Under Results-list, right-click on the results you want to save
b) Choose “Save result buffer”
c) Specify the file name and location in the file chooser window that opens (end your file name in .txt to make it easy to open later)
a) ID3 and Prism are more sophisticated algorithms than OneR and NaiveBayes in that they decide which attributes to use instead of using all (or just one). Does this sophistication pay off with better performance? Explain.
b) Rank order the models ID3, J48, Prism, and OneR on the comprehensibility of the model generated. Explain your ordering.
c) J48 is a more sophisticated algorithm than ID3 (which was covered in the book) – developed by Quinlan later in his career. You don’t know the details of the algorithm, but looking at the results, does it seem to be a noticeable improvement? Explain.
d) ID3 and Prism cannot handle missing values (at all). One of the criteria for saying that an algorithm is robust is that it can handle missing values without being negatively affected. For the algorithms that can handle missing values, comment on how well they perform when faced with missing values, as opposed to when there are no missing values.
e) Both approaches to eliminating missing values may seem questionable (removing instances reduces the amount of usable data, replacing them with made up values may seem unjustifiable). Do you think one seems to work better than the other? Explain.
f) Considering that the number of missing values is fairly low, do you think the algorithms’ performance would hold up with twice as many missing values? (think about how the algorithms that you know work)
g) If you were an interested business person (not an IT person), of the models generated by ID3 and Prism, which do you think would be more useful? Explain.
h) If one is a technical person (e.g. IT), do you think that you could something useful out of the model generated by J48 if you were trying to learn about the data based on these runs? Explain.
i) Do you trust/believe in the models that were generated? Do they fit you’re your expectations enough that you would be willing to accept patterns that you don’t already know about? Explain.
Turn in:
§ Disk with results files
§ Table summarizing results