CSC 470 Data Mining Fall 2005

Assignment 2                                        Data Mining Programming                                

20 points – double other assignments, because it will be a lot more work.

 

Assigned: 09/22/05

Due:         11/03/05   AT THE START OF CLASS.

IF NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!

Note, that at least assignment 3 will also be assigned during this time-frame, and this task will probably be challenging, so DON’T PUT THIS OFF!!

Groups: You may work alone or in pairs for this assignment.

 

Task:

  1. Write your own version of the OneR algorithm using any programming language which you choose. 
  2. In order to receive an A grade on this assignment, your program needs to be able to read from ARFF files (any ARFF file, subject to simplifications discussed below). But, I encourage you to put a higher priority on the core algorithm.  It is better to have a program that carries out OneR but cannot read a ARFF file than to have one that attempts to open an ARFF file and fails, not displaying any ability to do OneR.
  3. Your program should divide the data into training and test, and report accuracy of predictions on test data. It should also report the Rule/ Tree Stub learned.
  4. I am willing to offer up to 20% extra credit for successfully carrying out full 10-fold cross validation and display of confusion matrix.  However, I am only offering this if the rest of your program basically works (few, if any, mistakes). Focus on your main task first.

Simplifications:

  1. Data only includes nominal attributes (as OneR expects).  You need not handle numeric attributes by discretizing data.
  2. No missing values in data.
  3. The last attribute in the data is ALWAYS the one to predict. No user options for anything else are needed.
  4. As mentioned in #4 under task, cross validation is not necessary; do one training and one test (90% of data for training; 10% for test)
  5. As mentioned in #4 under task, producing a confusion matrix as an output is not necessary. Accuracy alone is sufficient.

 

Notes:

  1. Your program should give intermediate output showing what is happening as it runs, in order for me to follow its behavior and performance.
  2. If your program doesn’t read from ARFF files, it should read from some sort of general structure for files, so that your program can be tried on more than one set of data.
  3. If you plan on using Java, but do not have much background in reading from files, I will try to post a simple file reading example on my WWW page.

Miscellaneous: You must develop code of your own for this assignment. You may not copy or derive your program from previously existing or classmate’s programs. The whole idea is to understand an algorithm well enough to program it, and to increase depth of understanding of data mining.  Many things can be done in data mining with existing software, but to push the envelope you need to write your own programs.  Being able to program existing methods is a foundation for being able to design and write your own new data mining methods.

Turn in:

§         Disk with all files necessary to run your program.

§         Simple instructions of what I need to do to run your program.