November 12, 2018

Data Mining - Mid Sem Solutions


Question:
Give an example for each of the following preprocessing activates
a. Incomplete
b. Inconsistent

Answer:
Data Processing: It is a data mining technique that involves transforming raw data into an understandable format. Our Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence it is needed for resolving such issues.
"Preprocessing is needed to improve data quality"

A. Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
E.g. Many tuples have no recorded value for several attributes,
Occupation = “ ” (missing data)

B. Inconsistent: Containing discrepancies in codes or names.
E.g.
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records


Question:
Classify following attributes as binary, discrete or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio).
a. Number of patients in hospital
b. ISBN numbers for books

Answer:
A. Number of patients in hospital - Discrete, quantitative, ratio
B. ISBN numbers for books - Discrete, qualitative, nominal

For your reference:

Attribute: A data field, representing a characteristic or feature of a data object.

Types:
  • Nominal
  • Binary
  • Numeric: quantitative
    • Interval-scaled
    • Ratio-scaled
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes

Binary
Nominal attribute with only 2 states (0 and 1)

Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)

Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings

Examples for exercise:
Brightness as measured by a light meter.
Answer: Continuous, quantitative, ratio

Angles as measured in degrees between 0 ◦ and 360 ◦
Answer: Continuous, quantitative, ratio

Bronze, Silver, and Gold medals as awarded at the Olympics.
Answer: Discrete, qualitative, ordinal

Ability to pass light in terms of the following values: opaque, translucent, transparent.
Answer: Discrete, qualitative, ordinal

Military rank.
Answer: Discrete, qualitative, ordinal

Density of a substance in grams per cubic centimeter.
Answer: Discrete, quantitative, ratio

Question:
Suppose that the data for analysis includes the attribute price, 4, 8, 15, 21, 21, 24, 25, 28, and 34. Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.

Answer:
Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for the given data. The following steps are required to smooth the above data using smoothing by bin means with a bin depth of 3.

Step 1: Sort the data. [Which is already sorted]

Step 2: Partition the data into equi depth bins of depth 3.
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Step 3: Calculate the arithmetic mean of each bin. [Sum of no’s / Count of numbers]

Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

For Reference: Can be asked as

Using Equi width:
width=(max−min)/N
Given as N=3

Step1: Sort the data.

Step2: width=(34-4)/3 = 10

Step3: So interval of bins are as- (1,10) , (11,20), (21,30), (31,40)

Step4: Find the number betwin the interval:
Bin1 -  4, 8
Bin2 -  15
Bin3 -  21, 21, 24, 25,28
Bin4 -  34

Using Equi depth: After sorting the data, follow the below step:

Sorted: 4, 8, 15, 21, 21, 24, 25, 28, 34

Step: Partition the data into equi depth bins of depth 3.
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Question:
Consider the one-dimensional data set shown on the below table

X  
0.6  
3.2  
4.5  
4.6  
4.9  
5.2  
5.6  
5.8  
7.1  
9.5  
Y
-
-
+
+
+
-
-
+
-
-

Classify the data point x=5.0 according to its 3- and 9- nearest neighbors (Using majority Vote)

Answer:
We need to first find the difference of each data set with respect to x=5.0, Refer the below table for the same.

x
X
Difference (x & X)
Y
5.0
0.6
4.4
5.0
3.2
1.8
5.0
4.5
0.5
+
5.0
4.6
0.4
+
5.0
4.9
0.1
+
5.0
5.2
0.2
5.0
5.6
0.6
5.0
5.8
0.8
+
5.0
7.1
2.1
5.0
9.5
4.5

As asked,
Using 3- nearest neighbors method, 3 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6
Classes ->   +
Using Majority Vote, 3-nearest neighbor: +

Using 9- nearest neighbors method, 9 Closest points to the point x=5.0 will be the one who has least difference among them - > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6
Classes -> +  + +  +   
Using Majority Vote, 9-nearest neighbor: 

Question:
Consider the following data set for a binary class problem.
A
B
Class Label
T
F
+
T
T
+
T
T
+
T
F
-
T
T
+
F
F
-
F
F
-
F
F
-
T
T
-
T
F
-




Calculate the gain in the Gini index when splitting on A and B. Which Attribute would the decision tree induction algorithm choose?

Answer:

First we will create table as per information provided above:

Using Gini Index method:
The overall gini before splitting is: Goriginal= 1 − (4/10)2 − (6/10)= 0.48

The gain in gini after splitting on A is:
GA=T = 1 − ( 4/7)2 − (3/7)2 = 0.4898
GA=F = 1 − ( 3/3)2 − (0/3)2 = 0
∆ = Goriginal − 7/10GA=T − 3/10GA=F = 0.1371 

The gain in gini after splitting on B is:
GB=T = 1 − ( 1/4)2 − (3/4)2 = 0.3750
GB=F = 1 − ( 1/6)2 − (5/6)2 = 0.2778
∆ = Goriginal − 4/10GB=T − 6/10GB=F = 0.1633

Therefore, attribute B will be chosen to split the node. 

Note: Below is the video, we have created for your reference, hope you will have at-least idea of solving the above problem.
For Reference: Can be asked

--> Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose 
Answer:
Using Information Gain method:
The contingency tables after splitting on attributes A and B are:

The overall entropy before splitting is:
Eoriginal = −0.4log0.4 − 0.6log0.6=0.9710

The information gain after splitting on A is:
EA=T  = −(4/7)log(4/7) − (3/7)log(3/7) = 0.9852
EA=F  = −(3/3)log(3/3) − (0/3)log(0/3) = 0
∆ = Eoriginal − 7/10EA=T − 3/10EA=F = 0.2813

The information gain after splitting on B is:
EB=T  = −(3/4)log(3/4) − (1/4)log(1/4) = 0.8113
EB=F  = −(1/6)log(1/6) − (5/6)log(5/6) = 0.6500
∆ = Eoriginal − 4/10EB=T − 6/10EB=F = 0.2565

Therefore, attribute A will be chosen to split the node.

Question: [10 Marks]
Derive Rules for the following data shown in table 1 using indirect method (Based on decision tree) and assign the class using the derived rules for the data given in table 2.
Table1
Table 2

Outlook
Temp
Humidity
Windy
Class
Rain
65
80
FALSE

Sunny
90
75
TRUE


Answer:
Note: Find the video for your reference to derive rules Given:
  • The data set has five attributes.
  • There is a special attribute: the attribute class is the class label.
  • The attribute, temp and humidity are numerical attributes.
  • Other attributes are categorical, that is, they can not be ordered.
  • So based on table1 data set, we want to derive a set of rules to know what values of outlook, temp, humidity and wind determine whether or not to play.
We will first split the table in two part as below: Select the pure node


Outlook
Sunny
Overcast
Rain
 Play
2
4
3
Don't
3
0
2



Windy
TRUE
FALSE
Play
3
6
Don't
4
1

As we can see that the pure node is obtained in outlook while splitting. So it can be taken as first attribute for splitting. [Pure Node means, when all of its data belongs to a single class ]
Now the second step, we will look for Sunny and Rain only.
Note: we will define range for temperature as well as Humidity


Outlook = Sunny
Temp
61..70
71..80
81..90
Play
1
1
0
Don't
0
2
1


Outlook = Sunny
Humidity
61..70
71..80
81..90
91..100
Play
2
0
0
0
Don't
0
0
2
1


Outlook = Sunny
Windy
TRUE
FALSE
Play
1
1
Don't
2
1

As we can see that the pure node is obtained for attribute Humidity, So it can be taken as second attribute for splitting. [71..80 - there is no activity so won't considered in Tree]

When Outlook = Rain

Outlook = Rain
Temp
61..70
71..80
81..90
Play
2
1
1
Don't
1
1
0

Outlook = Rain
Humidity
61..70
71..80
81..90
91..100
Play

3
0
1
Don't
1
1
0
0
Outlook = Rain
Windy
TRUE
FALSE
Play
0
4
Don't
2
0

As we can see that the pure node is obtained for attribute Windy.


Below are the derived rules using indirect method (Based on decision tree) :
R1. IF (outlook=sunny) and (humidity=61..70) THEN (class=Play).
R2. IF (outlook=sunny) and (humidity=81..100) THEN (class=Don't).
R3. IF (outlook=overcast) THEN (class=Play).
R4. IF (outlook=rain) and (windy=true) THEN (class=Don't).
R5. IF (outlook=rain) and (windy=false) THEN (class=Play).

Assigning class in table 2:
Outlook
Temp
Humidity
Windy
Class
Rain
65
80
FALSE
 Play
Sunny
90
75
TRUE
 Don't

Referenced Question:
Draw a decision tree for the below scenario.


Answer:
Note: This is the same question as we solved for the above scenario, you just need to calculate pure node for Temperature and Humidity also.


1 comment: