CIS 658 Data Mining Fall 2007

Assignment 4                                        Data Mining Using WEKA – Experimental Analysis                   10 points

Assigned: 11/29/07

Due:         12/06/07   AT THE START OF CLASS.  

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

(I plan on this being mostly done in class time on 11/29)

 

Task:

  1. For these experiments you need two data sets (all based on the community data you worked with in previous assignments – two of the three datasets used in assignment 3):

a)       One data set, which you already have prepared for Assignment 1:

§  Community dataset with all attributes discretized – i.e. the second dataset produced

b)       One data set, which you already have prepared for Assignment 3:

§  Community dataset, with nominal attributes only, with instances containing missing values removed – i.e. one of the most recent dataset produced

  1. Set up Weka experimenter environment to use the above datasets on 5 variations on the  instance-based algorithm – with k=1, k=5, k=10, k=20, 30, & 40 (each with distance weighting = 1-distance). Make sure to save the results into an arff file. I demonstrated use of the experimenter environment in class, and quicky instructions are posted on my www site under ‘Review’. Save the experiment set up into a configuration file so that it can easily be repeated.  You should have:

 

Data

Algorithm

Non-Discretized Community

Discretized, No Missing, Community

IBK - K=1

Yes

Yes

IBK - K=5

Yes

Yes

IBK - K=10

Yes

Yes

IBK - K=20

Yes

Yes

IBK - K=30

Yes

Yes

IBK - K=40

Yes

Yes

      Run Weka experimenter environment on these combinations. It will take ‘a while’ ( < 1 hour ).

  1. Go to the analyze tab, if you “Perform Test” after opening the results file, results should be displayed. Find the results, report in a table like the above, the percent correct for each. IBK with K=1 serves as the “baseline” of the experiment (assuming that is the first algorithm you specify). Specify in the results table which algorithms are statistically significantly different from the baseline (indicate better or worse).
  2. Answer the questions listed below.

a)       Does you think we’ve reached the point where having more neighbors doesn’t help? Explain.

b)       Do you think that using more neighbors proves to be more beneficial for one dataset than another?   Explain.

c)       Using WordPad or Excel, look at the results arff file. Obviously, this contains a lot more information than shown via “Perform Test.” What might the more detailed results enable? (suppose that you imported it into Excel). I’m interested in some specific benefits.

Turn in:

§  Zip file with

o    data files,

o    results arff file,

o    experiment configuration file.

o    Table summarizing results (Either Excel or Word)