CSC 470 Data Mining Fall 2005
Assignment 2 Data
Mining Programming
20 points – double other
assignments, because it will be a lot more work.
Assigned: 09/22/05
Due: 11/03/05 AT THE START OF CLASS.
IF NOT TURNED IN AT THE START
OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Note, that at least
assignment 3 will also be assigned during this time-frame, and this task will
probably be challenging, so DON’T PUT THIS OFF!!
Groups: You may work alone or in pairs for this assignment.
Task:
- Write your own version of
the OneR algorithm using any programming language which you choose.
- In order to receive an A
grade on this assignment, your program needs to be able to read from ARFF
files (any ARFF file, subject to simplifications discussed below). But, I
encourage you to put a higher priority on the core algorithm. It is
better to have a program that carries out OneR but cannot read a ARFF file
than to have one that attempts to open an ARFF file and fails, not
displaying any ability to do OneR.
- Your program should divide
the data into training and test, and report accuracy of predictions on
test data. It should also report the Rule/ Tree Stub learned.
- I am willing to offer up to
20% extra credit for successfully carrying out full 10-fold cross
validation and display of confusion matrix. However, I am only offering
this if the rest of your program basically works (few, if any, mistakes).
Focus on your main task first.
Simplifications:
- Data only includes nominal
attributes (as OneR expects). You need not handle
numeric attributes by discretizing data.
- No missing values in data.
- The last attribute in the
data is ALWAYS the one to predict. No user options for anything
else are needed.
- As mentioned in #4 under
task, cross validation is not necessary; do one training and one test (90%
of data for training; 10% for test)
- As mentioned in #4 under
task, producing a confusion matrix as an output is not necessary. Accuracy
alone is sufficient.
Notes:
- Your program should give
intermediate output showing what is happening as it runs, in order for me
to follow its behavior and performance.
- If your program doesn’t
read from ARFF files, it should read from some sort of general structure
for files, so that your program can be tried on more than one set of data.
- If you plan on using Java,
but do not have much background in reading from files, I will try to post
a simple file reading example on my WWW page.
Miscellaneous: You must develop code of your own for
this assignment. You may not copy or derive your program from previously
existing or classmate’s programs. The whole idea is to understand an
algorithm well enough to program it, and to increase depth of understanding of
data mining. Many things can be done in data mining with existing software,
but to push the envelope you need to write your own programs. Being able to
program existing methods is a foundation for being able to design and write
your own new data mining methods.
Turn
in:
§
Disk with all files necessary to
run your program.
§
Simple instructions of what I need
to do to run your program.