Question:
Give an example for each of the following preprocessing activates
a. Incomplete
b. Inconsistent
Answer:
Data Processing: It is a data mining technique that involves transforming raw data into an understandable format. Our Realworld data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Hence it is needed for resolving such issues.
"Preprocessing is needed to improve data quality"
A. Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
E.g. Many tuples have no recorded value for several attributes,
Occupation = “ ” (missing data)
B. Inconsistent: Containing discrepancies in codes or names.
E.g.
Age = “42”, Birthday = “03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Question:
Classify following attributes as binary, discrete or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio).
a. Number of patients in hospital
b. ISBN numbers for books
Answer:
A. Number of patients in hospital  Discrete, quantitative, ratio
B. ISBN numbers for books  Discrete, qualitative, nominal
For your reference:
Attribute: A data field, representing a characteristic or feature of a data object.
Types:
 Nominal
 Binary
 Numeric: quantitative
 Intervalscaled
 Ratioscaled
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings
Examples for exercise:
Brightness as measured by a light meter.
Answer: Continuous, quantitative, ratio
Angles as measured in degrees between 0 ◦ and 360 ◦
Answer: Continuous, quantitative, ratio
Bronze, Silver, and Gold medals as awarded at the Olympics.
Answer: Discrete, qualitative, ordinal
Ability to pass light in terms of the following values: opaque, translucent, transparent.
Answer: Discrete, qualitative, ordinal
Military rank.
Answer: Discrete, qualitative, ordinal
Density of a substance in grams per cubic centimeter.
Answer: Discrete, quantitative, ratio
Question:
Suppose that the data for analysis includes the attribute price, 4, 8, 15, 21, 21, 24, 25, 28, and 34. Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.
Use smoothing by bin means
to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique
for the given data. The following steps are required to smooth the
above data using smoothing by bin means with a bin depth of 3.
Step 1: Sort the data. [Which is
already sorted]
Step 2: Partition the data into equi depth
bins of depth 3.
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Step 3: Calculate the arithmetic mean
of each bin.
[Sum of no’s / Count of numbers]
Step 4: Replace each of the values in each bin by the arithmetic mean
calculated for the bin.
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
For Reference: Can be asked as
Using Equi width:
width=(max−min)/N
Given as N=3
Step1: Sort the data.
Step2: width=(344)/3 = 10
Step3: So interval of bins are as (1,10) , (11,20), (21,30), (31,40)
Step4: Find the number betwin the interval:
Bin1  4, 8
Bin2  15
Bin3  21, 21, 24, 25,28
Bin4  34
Using Equi depth: After sorting the data, follow the below step:
Sorted: 4, 8, 15, 21, 21, 24, 25, 28, 34
Step: Partition the data into equi depth bins of depth 3.
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Question:
Consider the onedimensional data set shown on the below table
X

0.6

3.2

4.5

4.6

4.9

5.2

5.6

5.8

7.1

9.5

Y





+

+

+





+





Classify the data point x=5.0 according to its 3 and 9 nearest neighbors (Using majority Vote)
Answer:
We need to first find the
difference of each data set with respect to x=5.0, Refer the below table for
the same.
x

X

Difference (x & X)

Y

5.0

0.6

4.4

−

5.0

3.2

1.8

−

5.0

4.5

0.5

+

5.0

4.6

0.4

+

5.0

4.9

0.1

+

5.0

5.2

0.2

−

5.0

5.6

0.6

−

5.0

5.8

0.8

+

5.0

7.1

2.1

−

5.0

9.5

4.5

−

As asked,
Using 3 nearest neighbors method,
3 Closest points to the point x=5.0 will be the one who has least difference
among them  > 4.9, 5.2, 4.6
Classes > + − +
Using Majority Vote, 3nearest
neighbor: +
Using 9 nearest neighbors method,
9 Closest points to the point x=5.0 will be the one who has least difference
among them  > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6
Classes > + − + + − + − − −
Using Majority Vote, 9nearest
neighbor: −
Question:
Consider the
following data set for a binary class problem.Question:
A

B

Class Label

T

F

+

T

T

+

T

T

+

T

F



T

T

+

F

F



F

F



F

F



T

T



T

F



Calculate the gain in the Gini
index when splitting on A and B. Which Attribute would the decision tree induction algorithm
choose?
Answer:
First we will create table as per information provided above:
Using Gini Index method:
The overall gini before splitting is: Goriginal= 1 − (4/10)^{2} − (6/10)^{2 }= 0.48
The gain in gini after splitting on A is:
GA=T = 1 − ( 4/7)^{2} − (3/7)^{2} = 0.4898
GA=F = 1 − ( 3/3)^{2} − (0/3)^{2} = 0
∆ = Goriginal − 7/10GA=T − 3/10GA=F = 0.1371
The gain in gini after splitting on B is:
GB=T = 1 − ( 1/4)^{2} − (3/4)^{2} = 0.3750
GB=F = 1 − ( 1/6)^{2} − (5/6)^{2} = 0.2778
∆ = Goriginal − 4/10GB=T − 6/10GB=F = 0.1633
Therefore, attribute B will be chosen to split the node.
Note: Below is the video, we have created for your reference, hope you will have atleast idea of solving the above problem.
For Reference: Can be askedNote: Below is the video, we have created for your reference, hope you will have atleast idea of solving the above problem.
> Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose ?
Using Information Gain method:
The contingency tables after splitting on attributes A and B are:
Eoriginal = −0.4log0.4 − 0.6log0.6=0.9710
The information gain after splitting on A is:
EA=T = −(4/7)log(4/7) − (3/7)log(3/7) = 0.9852
EA=F = −(3/3)log(3/3) − (0/3)log(0/3) = 0
∆ = Eoriginal − 7/10EA=T − 3/10EA=F = 0.2813
The information gain after splitting on B is:
EB=T = −(3/4)log(3/4) − (1/4)log(1/4) = 0.8113
EB=F = −(1/6)log(1/6) − (5/6)log(5/6) = 0.6500
∆ = Eoriginal − 4/10EB=T − 6/10EB=F = 0.2565
Therefore, attribute A will be chosen to split the node.
Derive Rules for the following data shown in table 1 using indirect method (Based on decision tree) and assign the class using the derived rules for the data given in table 2.
Table1 
Outlook

Temp

Humidity

Windy

Class

Rain

65

80

FALSE


Sunny

90

75

TRUE

Answer:
Note: Find the video for your reference to derive rules Given:
 The data set has five attributes.
 There is a special attribute: the attribute class is the class label.
 The attribute, temp and humidity are numerical attributes.
 Other attributes are categorical, that is, they can not be ordered.
 So based on table1 data set, we want to derive a set of rules to know what values of outlook, temp, humidity and wind determine whether or not to play.
Outlook

Sunny

Overcast

Rain

Play

2

4

3

Don't

3

0

2

Windy

TRUE

FALSE

Play

3

6

Don't

4

1

As we can see that the pure node is obtained in outlook while splitting. So it can be taken as first attribute for splitting. [Pure Node means, when all of its data belongs to a single class ]
Now the second step, we will look for Sunny and Rain only.
Note: we will define range for temperature as well as Humidity
Outlook = Sunny


Temp

61..70

71..80

81..90

Play

1

1

0

Don't

0

2

1

Outlook = Sunny


Humidity

61..70

71..80

81..90

91..100

Play

2

0

0

0

Don't

0

0

2

1

Outlook = Sunny


Windy

TRUE

FALSE

Play

1

1

Don't

2

1

As we can see that the pure node is obtained for attribute Humidity, So it can be taken as second attribute for splitting. [71..80  there is no activity so won't considered in Tree]
When Outlook = Rain
Outlook = Rain


Temp

61..70

71..80

81..90

Play

2

1

1

Don't

1

1

0

Outlook = Rain


Humidity

61..70

71..80

81..90

91..100

Play

3

0

1


Don't

1

1

0

0

Outlook = Rain


Windy

TRUE

FALSE

Play

0

4

Don't

2

0

As we can see that the pure node is obtained for attribute Windy.
Below are the derived rules using indirect method (Based on decision tree) :
R1. IF (outlook=sunny) and (humidity=61..70) THEN (class=Play).
R2. IF (outlook=sunny) and (humidity=81..100) THEN (class=Don't).
R3. IF (outlook=overcast) THEN (class=Play).
R4. IF (outlook=rain) and (windy=true) THEN (class=Don't).
R5. IF (outlook=rain) and (windy=false) THEN (class=Play).
Assigning class in table 2:
Outlook

Temp

Humidity

Windy

Class

Rain

65

80

FALSE

Play

Sunny

90

75

TRUE

Don't

Referenced Question:
Draw a decision tree for the below scenario.
Answer:
Note: This is the same question as we solved for the above scenario, you just need to calculate pure node for Temperature and Humidity also.
No comments:
Post a Comment