Test 1

CSC 470 Fall 2005

10/12/05 Midterm Exam Test Form C

Name:

Instructions:

Answer all questions in the answer sheet provided. On Multiple Choice, choose the ONE BEST ANSWER.

Remember to put the letter of your test form on the top of your answer sheet.

Hand in Test, Answer Sheet, and Help Sheets, all with your name.

Multiple Choice

(3 points each)

1. Which of the following is true of the NaiveBayes algorithm?

A) cannot handle missing data in test examples

B) cannot handle numeric values for attributes that aren’t being predicted

C) is able to make numeric predictions

D) easily accommodates missing data in training examples

E) all of the above

F) none of the above

2. Which of the following is true about data mining?

A) in many real world data sets, some records are missing values for some attributes

B) in many real world data sets, there are some errors in the data

C) in many real world data sets, some advance preparation may be needed to create interesting new attributes

D) all of the above

E) none of the above

3. Which of the following is true about data mining?

A) simple algorithms sometimes work surprisingly well

B) different approaches work better for different data

C) successful data mining usually involves trying a number of approaches in a series of experiments

D) all of the above

E) none of the above

4. Which of the following is true about the OneR algorithm?

A) it considers all attributes

B) it chooses exactly one attribute to use in making predictions during tests

C) it evaluates decisions based on error rate on training data

D) all of the above

E) none of the above

Completion (fill in the blank) (mostly key terms (not all of which are a single word))

(3 points each)

5. Many machine learning algorithms follow a(n) ________ approach to search – that is, they are searching for a solution and once they build part of the answer they never retract or reconsider that decision, they only move forward.

6. An algorithm is said to be ________ if it may look at the attribute to be predicted as it proceeds (during training or data preparation (NOT testing)).

7. In instance-based representation, predictions are made by finding the most similar training example(s) to a given test example; these examples are known as the ________.

8. In the NaiveBayes approach, the probability of a given hypothesis (without evidence) is known as the ________, since it would be the estimated probability before seeing any evidence.

9. With some data mining approaches, what the program has learned and how it is using it is incomprehensible to humans. These approaches can be considered ________ approaches.

10. An attribute whose values are taken from a limited set of possible values would be considered to be a(n) ________ attribute. For example, for loan approval, what the loan is for could have possible values of house, car, pc, etc; the set would be finite and known.

True/False - If false, explain why.

(4 points each)

11. Normalization or standardization can easily be done in a spreadsheet program, such as Excel.

12. A disadvantage of instance-based approaches is that there is not really a structural pattern that a user can examine to determine a “take-home” message.

13. Data mining is too new of a field to be practical; nobody has made any money from it yet.

14. Machine learning techniques are only suitable for predicting from a small number of categories; numeric prediction is not possible.

Short Answer

(6 points each)

15. Briefly explain the ways in which people play an important part in a data mining effort.

16. Briefly explain how data mining raises ethical issues related to discrimination.

Problems

(points as shown)

(10 points)

17. On the answer sheet, fill in all of the information concerning results on an experiment based on the confusion matrix below:

=====Confusion Matrix ====

a b ß Classified As

5 2 a = cancer

7 4 b = not

(20 points)

18. Given the following tallies of training data, using NaiveBayes with Laplace Estimator (not currently included in tallies), what result (Puchase = Yes or No) will be predicted on the following test example? (Show your work!!!!)

Area	Purchase = Yes	Purchase = No
Mt Airy	3	3
Germantown	4	2
Manyunk	0	5

Home	Purchase = Yes	Purchase = No
Own	4	4
Rent	3	6

Age	Purchase = Yes	Purchase = No
Young	5	1
Established	2	4
Middle Aged	0	3
Old	0	2

To Predict	Yes	No
Purchase	7	10

Test Instance: Mt Airy, Rent, Established

(12 points)

19. Data about TV shows that a company buys advertisements during or not is shown below. Use supervised discretization, as done by OneR, with N = 3 on the attribute Ratings to create the ranges that will be the resulting discrete values.

Type	Ratings	Time Slot	Buy?
Comedy	28	Early	Yes
Drama	30	Late	No
Comedy	27	Late	No
Reality	49	Early	No
News	29	Late	No
Comedy	40	Early	Yes
Drama	38	Late	No
Reality	48	Early	No
Comedy	35	Late	No
Comedy	53	Early	Yes
Drama	41	Late	No
Comedy	41	Early	Yes
News	33	Late	No
Comedy	25	Early	Yes
Comedy	30	Late	No
Comedy	21	Early	No
Reality	51	Early	No
Comedy	45	Early	No
Comedy	52	Early	Yes