Assignment 1

CSC 470 Spring 2004

02/19/04 Midterm Exam Test Form A

Name:

Instructions:

Answer all questions in the answer sheet provided. On Multiple Choice, choose the ONE BEST ANSWER.

Remember to put the letter of your test form on the top of your answer sheet.

Hand in Test, Answer Sheet, and Help Sheets, all with your name.

Multiple Choice

(3 points each)

1. Which of the following is true about “black-box” data mining approaches?

A) produce a structural description that can be used by people

B) organize examples into a cube shape that they are inside or outside of

C) may not be as trusted by humans, since their methods are not visible

D) all of the above

E) none of the above

2. Which of the following is true about data mining?

A) graphical display of data can help to find problems in the data

B) automatic methods are of such power and quality that human analysis of input data is not needed

C) an IT person can generally carry out a data mining project without consultation with people from the “business-side” of the organization.

D) All of the above

E) None of the above

3. Which of the following is true about numeric prediction?

A) regression trees have a regression equation for each leaf node in the tree

B) model trees predict based on the average of instances at the leaves in the tree

C) statistical regression is now considered obsolete

D) all of the above

E) none of the above

4. Which of the following is true of instance-based approaches?

A) no training examples should ever be discarded

B) there should never be any attempt at generalization of examples

C) all examples are equally valuable for successful prediction

D) all of the above

E) none of the above

Completion (fill in the blank) (mostly key terms (not all of which are a single word))

(3 points each)

5. Some efforts to learn rules do not focus on learning rules that predict a certain attribute, but instead look for any relationship between attributes that might be interesting. Such ________ rules are largely valuable to the extent that a person can figure out a way to use them; they are not used for automatic prediction or decision making.

6. Sometimes learning methods try to match training data “too exactly”, resulting in a model that suits training data very well, but does not do as well when tested. Idiosyncratic records in the training data may lead the algorithm astray. This problem is known as ________.

7. Many machine learning algorithms have a(n) ________ that lead them to prefer some possible conclusions over others – and to not even consider some possibilities.

8. Many data mining techniques include some form of ________ - a rule of thumb that is frequently helpful but which is not guaranteed to be successful.

9. Some machine learning schemes require all numeric attributes to be on the same scale. Hence data preparation needs to include ________, which gives an attribute a new value based on how it compares to other values for the attribute.

10. In evaluating data mining success it is common to use an experimental method known as ________; in this method, there are repeated training then test, such that eventually every example has been used in multiple training session and has been tested on once.

True/False - If false, explain why!!!!!!

(4 points each)

11. The field of machine learning has been stalled because of the inability to practically define “learning.”

12. Decision trees are superior to decision rules.

13. For any data mining effort, data must be put into .arff format.

14. Statistical regression is the standard for comparison for numeric prediction; any approach begins to judge its worth by whether it can do better than regression.

15. In theory, in instance-based approaches, all attributes should be weighted equally.

Short Answer

(5 points each)

16. Briefly explain why data mining approaches should be tested on data that is separate from that used for training.

17. Briefly explain at least 2 ethical issues related to data mining.

Problems

(points as shown)

(10 points)

18. On the answer sheet, fill in all of the information concerning results on an experiment based on the confusion matrix below:

=====Confusion Matrix ====

a b ß Classified As

5 4 a = Rainy

2 8 b = NoRain

(5 points)

19. Given the following decision tree, determine what prediction will be made for the following test instance:

Outlook

Sunny overcast rainy

Temp windy no

Hot mild cool true false

windy

true false yes yes no yes

yes no

TEST: Outlook=Sunny, Temp=Hot, Humidity=Normal, Windy=True

(15 points)

20. In the OneR algorithm, given the following training data, what would the final learned concept description be? (this data is made up and does not necessarily reflect any reality). Show your work!!!!

Pressure	Pressure Change	Temperature	Forecast (class to be predicted)
High	Steady	Warm	NoRain
Low	Increasing	Cool	NoRain
Low	Decreasing	Cool	Rain
High	Decreasing	Warm	Rain
Medium	Steady	Cool	NoRain
High	Increasing	Warm	NoRain
Medium	Increasing	Warm	NoRain
Medium	Decreasing	Cool	Rain
High	Increasing	Cool	NoRain
Low	Decreasing	Warm	Rain
High	Steady	Warm	NoRain
Low	Steady	Cool	NoRain
High	Decreasing	Cool	Rain
Low	Increasing	Warm	NoRain
Low	Steady	Warm	NoRain
Medium	Steady	Warm	NoRain
Medium	Increasing	Cool	NoRain

(10 points)

21. Given the following tallies of training data (which used the Laplace estimator), using NaiveBayes, what result (Rain or NoRain) will be predicted on the following test example? (Show your work!!!!)

Pressure	Rain	NoRain
High	3	5
Medium	2	5
Low	3	5

Pressure Change	Rain	NoRain
Increasing	1	7
Steady	1	7
Decreasing	6	1

Temperature	Rain	NoRain
Warm	3	8
Cool	4	6

To Predict	Rain	No Rain
Forecast	6	13

Test: Medium, Decreasing, Warm