CSC 470                                                                Fall 2005

10/12/05                                                                               Midterm Exam                                    Test Form C

 

Name:

 

Instructions:

    Answer all questions in the answer sheet provided. On Multiple Choice, choose the ONE BEST ANSWER.

    Remember to put the letter of your test form on the top of your answer sheet.

    Hand in Test, Answer Sheet, and Help Sheets, all with your name.

 

 

Multiple Choice

(3 points each)

 

1.       Which of the following is true of the NaiveBayes algorithm?

A)      cannot handle missing data in test examples

B)      cannot handle  numeric values for attributes that aren’t being predicted

C)      is able to make numeric predictions

D)      easily accommodates missing data in training examples

E)      all of the above

F)       none of the above

 

2.       Which of the following is true about data mining?

A)      in many real world data sets, some records are missing values for some attributes

B)      in many real world data sets, there are some errors in the data

C)      in many real world data sets, some advance preparation may be needed to create interesting new attributes

D)      all of the above

E)      none of the above

 

3.       Which of the following is true about data mining?

A)      simple algorithms sometimes work surprisingly well

B)      different approaches work better for different data

C)      successful data mining usually involves trying a number of approaches in a series of experiments

D)      all of the above

E)      none of the above

 

4.       Which of the following is true about the OneR algorithm?

A)      it considers all attributes

B)      it chooses exactly one attribute to use in making predictions during tests

C)      it evaluates decisions based on error rate on training data

D)      all of the above

E)      none of the above

 

 

Completion (fill in the blank) (mostly key terms (not all of which are a single word))

(3 points each)

 

5.       Many machine learning algorithms follow a(n) ________ approach to search – that is, they are searching for a solution and once they build part of the answer they never retract or reconsider that decision, they only move forward.

 

6.       An algorithm is said to be ________ if it may look at the attribute to be predicted as it proceeds (during training or data preparation (NOT testing)).

 

7.       In instance-based representation, predictions are made by finding the most similar training example(s) to a given test example; these examples are known as the ________.

 

8.       In the NaiveBayes approach, the probability of a given hypothesis (without evidence) is known as the ________, since it would be the estimated probability before seeing any evidence.

 

9.       With some data mining approaches, what the program has learned and how it is using it is incomprehensible to humans. These approaches can be considered ________ approaches.

 

10.    An attribute whose values are taken from a limited set of possible values would be considered to be a(n) ________ attribute. For example, for loan approval, what the loan is for could have possible values of house, car, pc, etc; the set would be finite and known.

 

 

True/False  - If false, explain why.

(4 points each)

 

11.    Normalization or standardization can easily be done in a spreadsheet program, such as Excel.

 

12.    A disadvantage of instance-based approaches is that there is not really a structural pattern that a user can examine to determine a “take-home” message.

 

13.    Data mining is too new of a field to be practical; nobody has made any money from it yet.

 

14.    Machine learning techniques are only suitable for predicting from a small number of categories; numeric prediction is not possible.

 

 

Short Answer

(6 points each)

 

15.    Briefly explain the ways in which people play an important part in a data mining effort.

 

16.    Briefly explain how data mining raises ethical issues related to discrimination.

 

 

Problems

(points as shown)

 

 

(10 points)

17.    On the answer sheet, fill in all of the information concerning results on an experiment based on the confusion matrix below:

 

=====Confusion Matrix ====

a              b              ß Classified As

5              2              a = cancer

7              4              b = not

 


(20 points)

18.    Given the following tallies of training data, using NaiveBayes with Laplace Estimator (not currently included in tallies), what result (Puchase = Yes or No) will be predicted on the following test example? (Show your work!!!!)

 

Area

Purchase = Yes

Purchase = No

Mt Airy

3

3

Germantown

4

2

Manyunk

0

5

 

Home

Purchase = Yes

Purchase = No

Own

4

4

Rent

3

6

 

Age

Purchase = Yes

Purchase = No

Young

5

1

Established

2

4

Middle Aged

0

3

Old

0

2

 

To Predict

Yes

No

Purchase

7

10

 

Test Instance:  Mt Airy, Rent, Established

 

 

(12 points)

19.    Data about TV shows that a company buys advertisements during or not is shown below. Use supervised discretization, as done by OneR, with N = 3 on the attribute Ratings to create the ranges that will be the resulting discrete values.

 

Type

Ratings

Time Slot

Buy?

Comedy

28

Early

Yes

Drama

30

Late

No

Comedy

27

Late

No

Reality

49

Early

No

News

29

Late

No

Comedy

40

Early

Yes

Drama

38

Late

No

Reality

48

Early

No

Comedy

35

Late

No

Comedy

53

Early

Yes

Drama

41

Late

No

Comedy

41

Early

Yes

News

33

Late

No

Comedy

25

Early

Yes

Comedy

30

Late

No

Comedy

21

Early

No

Reality

51

Early

No

Comedy

45

Early

No

Comedy

52

Early

Yes