CSC 470                                                 Spring 2004

02/19/04                                                                 Midterm Exam                                       Test Form A

 

Name:

 

Instructions:

    Answer all questions in the answer sheet provided. On Multiple Choice, choose the ONE BEST ANSWER.

    Remember to put the letter of your test form on the top of your answer sheet.

    Hand in Test, Answer Sheet, and Help Sheets, all with your name.

 

 

Multiple Choice

(3 points each)

 

1.        Which of the following is true about “black-box” data mining approaches?

A)     produce a structural description that can be used by people

B)      organize examples into a cube shape that they are inside or outside of

C)      may not be as trusted by humans, since their methods are not visible

D)      all of the above

E)       none of the above

 

2.        Which of the following is true about data mining?

A)     graphical display of data can help to find problems in the data

B)      automatic methods are of such power and quality that human analysis of input data is not needed

C)      an IT person can generally carry out a data mining project without consultation with people from the “business-side” of the organization.

D)      All of the above

E)       None of the above

 

3.        Which of the following is true about numeric prediction?

A)     regression trees have a regression equation for each leaf node in the tree

B)      model trees predict based on the average of instances at the leaves in the tree

C)      statistical regression is now considered obsolete

D)      all of the above

E)       none of the above

 

4.  Which of the following is true of instance-based approaches?

A)     no training examples should ever be discarded

B)      there should never be any attempt at generalization of examples

C)      all examples are equally valuable for successful prediction

D)      all of the above

E)       none of the above

 

 

Completion (fill in the blank) (mostly key terms (not all of which are a single word))

(3 points each)

 

5. Some efforts to learn rules do not focus on learning rules that predict a certain attribute, but instead look for any relationship between attributes that might be interesting. Such ________ rules are largely valuable to the extent that a person can figure out a way to use them; they are not used for automatic prediction or decision making.

 

6.  Sometimes learning methods try to match training data “too exactly”, resulting in a model that suits training data very well, but does not do as well when tested.  Idiosyncratic records in the training data may lead the algorithm astray. This problem is known as ________.

 

7. Many machine learning algorithms have a(n) ________ that lead them to prefer some possible conclusions over others – and to not even consider some possibilities.

 

8.        Many data mining techniques include some form of ________ - a rule of thumb that is frequently helpful but which is not guaranteed to be successful.

 

9.        Some machine learning schemes require all numeric attributes to be on the same scale. Hence data preparation needs to include ________, which gives an attribute a new value based on how it compares to other values for the attribute.

 

10.     In evaluating data mining success it is common to use an experimental method known as ________; in this method, there are repeated training then test, such that eventually every example has been used in multiple training session and has been tested on once.

 

 

True/False  - If false, explain why!!!!!!

(4 points each)

 

11. The field of machine learning has been stalled because of the inability to practically define “learning.”

 

12. Decision trees are superior to decision rules.

 

13. For any data mining effort, data must be put into .arff format.

 

14. Statistical regression is the standard for comparison for numeric prediction; any approach begins to judge its worth by whether it can do better than regression.

 

15. In theory, in instance-based approaches, all attributes should be weighted equally.

 

 

Short Answer

(5 points each)

 

16. Briefly explain why data mining approaches should be tested on data that is separate from that used for training.

 

17. Briefly explain at least 2 ethical issues related to data mining.

 

 

Problems

(points as shown)

 

 

(10 points)

18. On the answer sheet, fill in all of the information concerning results on an experiment based on the confusion matrix below:

 

=====Confusion Matrix ====

a              b              ß Classified As

5              4              a = Rainy

2              8              b = NoRain

 

 

 

 

 

 


(5 points)

19. Given the following decision tree, determine what prediction will be made for the following test instance:

 

 


                                                                                                                   Outlook

 


                                                                                Sunny                    overcast                                 rainy

 

 


                                                                                 Temp                         windy                                 no

 

                                                Hot                         mild           cool      true            false

 

 


                                                windy

 


                                true                         false        yes          yes                no                     yes

 

 

 

 


                                yes                          no

 

 

 

TEST:  Outlook=Sunny, Temp=Hot, Humidity=Normal, Windy=True

 

 

 

(15 points)

20. In the OneR algorithm, given the following training data, what would the final learned concept description be? (this data is made up and does not necessarily reflect any reality). Show your work!!!!

 

Pressure

Pressure Change

Temperature

Forecast (class to be predicted)

High

Steady

Warm

NoRain

Low

Increasing

Cool

NoRain

Low

Decreasing

Cool

Rain

High

Decreasing

Warm

Rain

Medium

Steady

Cool

NoRain

High

Increasing

Warm

NoRain

Medium

Increasing

Warm

NoRain

Medium

Decreasing

Cool

Rain

High

Increasing

Cool

NoRain

Low

Decreasing

Warm

Rain

High

Steady

Warm

NoRain

Low

Steady

Cool

NoRain

High

Decreasing

Cool

Rain

Low

Increasing

Warm

NoRain

Low

Steady

Warm

NoRain

Medium

Steady

Warm

NoRain

Medium

Increasing

Cool

NoRain

 

 

 

 

(10 points)

21. Given the following tallies of training data (which used the Laplace estimator), using NaiveBayes, what result (Rain or NoRain) will be predicted on the following test example? (Show your work!!!!)

 

 

Pressure

Rain

NoRain

High

3

5

Medium

2

5

Low

3

5

 

Pressure Change

Rain

NoRain

Increasing

1

7

Steady

1

7

Decreasing

6

1

 

Temperature

Rain

NoRain

Warm

3

8

Cool

4

6

 

To Predict

Rain

No Rain

Forecast

6

13

 

Test:  Medium, Decreasing, Warm