CSC 470 Data Mining Spring
2004
Assignment
3 Data Mining Programming
20
points – double other assignments, because it will be a lot more work.
Assigned: 02/26/04
Due: 03/25/04 AT
THE START OF CLASS.
IF
NOT TURNED IN AT THE START OF CLASS, YOU WILL BE CONSIDERED LATE!!!
Note,
that at least assignment 4 will also be assigned during this time-frame, and
this task will probably be challenging, so DON’T PUT THIS OFF!!
Groups: You may work alone or in
pairs for this assignment.
Task:
- Write your own version
of the OneR algorithm using any programming language which you
choose.
- In order to receive an A grade on this
assignment, your program needs to be able to read from ARFF files (any
ARFF file, subject to simplifications discussed below). But, I encourage
you to put a higher priority on the core algorithm. It is better to have a program that
carries out OneR but cannot read a ARFF file than to have one that
attempts to open an ARFF file and fails, not displaying any ability to do
OneR.
- Your program should divide the data into
training and test, and report accuracy of predictions on test data.
- I am willing to offer up to 20% extra credit
for successfully carrying out full 10-fold cross validation and display of
confusion matrix. However, I am only
offering this if the rest of your program basically works (few, if any,
mistakes). Focus on your main task first.
Simplifications:
- Data only includes
nominal attributes (as OneR expects).
You need not handle numeric attributes by
discretizing data.
- No missing values in
data.
- The last attribute in
the data is ALWAYS the one to predict. No user options for anything
else are needed.
- As mentioned in #4
under task, cross validation is not necessary; do one training and one
test (90% of data for training; 10% for test)
- As mentioned in #4
under task, producing a confusion matrix as an output is not necessary.
Accuracy alone is sufficient.
Notes:
- Your program should
give intermediate output showing what is happening as it runs, in order
for me to follow its behavior and performance.
- If your program doesn’t
read from ARFF files, it should read from some sort of general structure
for files, so that your program can be tried on more than one set of data.
- If you plan on using
Java, but do not have much background in reading from files, I will try to
post a simple file reading example on my WWW page.
Miscellaneous: You must develop code of
your own for this assignment. You may not copy or derive your
program from previously existing or classmate’s programs. The whole
idea is to understand an algorithm well enough to program it, and to increase
depth of understanding of data mining.
Many things can be done in data mining with existing software, but to
push the envelope you need to write your own programs. Being able to program existing methods is a
foundation for being able to design and write your own new data mining methods.
Turn in:
§
Disk
with all files necessary to run your program.
§
Simple
instructions of what I need to do to run your program.