Note: This is just a reference paper which you can go through, we are facing some issue with the website. If you have any more important question/answer, let us know.
Share it on our Email  1trickyworld1@gmail.com
Question:
For the following vectors x and y, calculate the cosine similarity and euclidean distance measures:
x =(4,4,4,4), y=(2,2,2,2)
Solution:
Cosine
x ● y = 4*2 + 4*2 + 4*2 + 4*2 = 32
x = sqrt(4*4 + 4*4 + 4*4 + 4*4) = sqrt (64) = 8
y = sqrt(2*2 + 2*2 + 2*2 + 2*2) = sqrt (16) = 4
cos(x,y) = (x ● y) / (x*y) = (32)/ (8*4)
cos(x,y) = 1
x ● y = 4*2 + 4*2 + 4*2 + 4*2 = 32
x = sqrt(4*4 + 4*4 + 4*4 + 4*4) = sqrt (64) = 8
y = sqrt(2*2 + 2*2 + 2*2 + 2*2) = sqrt (16) = 4
cos(x,y) = (x ● y) / (x*y) = (32)/ (8*4)
cos(x,y) = 1
Euclidean
d(x, y) = sqrt((42)^2 + (42)^2 + (42)^2 + (42)^2)
Euclidean distance = 4
d(x, y) = sqrt((42)^2 + (42)^2 + (42)^2 + (42)^2)
Euclidean distance = 4
Question:
Consider the onedimensional data set shown on the below table
X

0.6

3.2

4.5

4.6

4.9

5.2

5.6

5.8

7.1

9.5

Y





+

+

+





+





Classify the data point x=5.0 according to its 3 and 9 nearest neighbors (Using majority Vote)
Answer:
We need to first find the
difference of each data set with respect to x=5.0, Refer the below table for
the same.
x

X

Difference (x & X)

Y

5.0

0.6

4.4

−

5.0

3.2

1.8

−

5.0

4.5

0.5

+

5.0

4.6

0.4

+

5.0

4.9

0.1

+

5.0

5.2

0.2

−

5.0

5.6

0.6

−

5.0

5.8

0.8

+

5.0

7.1

2.1

−

5.0

9.5

4.5

−

As asked,
Using 3 nearest neighbors method,
3 Closest points to the point x=5.0 will be the one who has least difference
among them  > 4.9, 5.2, 4.6
Classes > + − +
Using Majority Vote, 3nearest
neighbor: +
Using 9 nearest neighbors method,
9 Closest points to the point x=5.0 will be the one who has least difference
among them  > 4.9, 5.2, 4.6, 4.5, 5.6, 5.8, 3.2, 7.1, 0.6
Classes > + − + + − + − − −
Using Majority Vote, 9nearest
neighbor: −Question:
Suppose a group of 12 sales price records has been sorted as follows:
5; 10; 11; 13; 15; 35; 50; 55; 72; 90; 204; 215:
Partition them into three bins by each of the following methods.
(a) equalfrequency partitioning
(b) equalwidth partitioning
(c) clustering
Answer:
(a) equalfrequency (equidepth) partitioning:
Partition the data into equidepth bins of depth 4: [given as n=4]
Bin 1: 5, 10, 11, 13
Bin 2: 15, 35, 50, 55
Bin 3: 72, 90, 204, 215
(b) equalwidth partitioning:
Partitioning the data into 3 equiwidth bins will require the width to be (215−5)/3 = 70.
We get interval like (1,70),(71,140),(141,210),(211,280)
Bin 1: 5, 10, 11, 13, 15, 35, 50, 55
Bin 2:72, 90
Bin 3: 204
Bin 4: 215
(c) clustering:
Using Kmeans clustering to partition the data into three bins we get
Bin 1: 5, 10, 11, 13, 15, 35
Bin 2: 50, 55, 72, 90
Bin 3: 204, 215
Question:
a. How do you evaluate a
classifier when there is a class imbalance?
Answer: In normal case,
accuracy and error rate can help. In case of class imbalance, we need specificity and
sensitivity.
b. Assume that a search for
‘computer programming’ gave you a result of 100 web pages. Give 56 factors that would have
been used by the search engine to determine the order in which the result pages are listed.
Answer:
Frequency of the search terms
in the pages, Place of occurrence – title/tags/paragraphs etc., no. of users visiting the page,
no. of links to the page, geographical location, searchclick history of user, domain’s
importance, search term appearing in the domain etc.
c. How do clustering tendency,
cluster validity help in data mining?
Answer:
Clustering makes sense
only if the data is nonrandom. Clustering tendency measures such as Hopkin statistic help.
Cluster validity helps in evaluating clusters using unsupervised, supervised, or relative measures.
Question:
1.
A
database has five transactions. Let min sup = 60% and min conf = 80%. (5+2
marks)
TID

items bought

T100

Bread,Butter,Beans,Potato,Jam,
Milk

T200

Bread,Butter,Shampoo,Potato,Jam,
Milk

T300

Beans,Soap,Butter, Bread

T400

Beans, Onion, Apple, Butter,
Milk

T500

Apple, Banana, Jam,
Bread,Butter

(b) List all of the strong association rules (with
support s and confidence c) matching the following buys(X; item1) ^
buys(X; item2) => buys(X; item3) [s; c]
Solution:NOTE: To solve this question, we will first go through 1 question below for your practice and then you can do it by yourself.
Reference question:
(a) Find all frequent itemsets using Apriori method.
Solution:
Database is scanned once to generate frequent 1itemsets. To do this, I use absolute support, where duplicate values are counted only once per TID. The total number of TID is 5, so minimum support of 60% is equivalent to 3/5. Thus itemsets with 1 or 2 support counts are eliminated.
Solution:
Reference question:
A database has five transactions. Let min sup = 60% and min conf = 75%.
TID

items bought

T100

M, O, N, N, K, E, Y, Y

T200

D, D, O, N, K, E, Y

T300

M, M, A, K, E, E

T400

M, U, C, C, Y, C, E, O

T500

C, O, O, K, I, I, E

(a) Find all frequent itemsets using Apriori method.
Solution:
Database is scanned once to generate frequent 1itemsets. To do this, I use absolute support, where duplicate values are counted only once per TID. The total number of TID is 5, so minimum support of 60% is equivalent to 3/5. Thus itemsets with 1 or 2 support counts are eliminated.
Table 1a. 1itemset results, raw
Table 1b. 1itemset results, consolidated
Now, database is scanned second time to generate frequent 2itemsets.
The possible combinations are 5!/(3!2!) = 10. Using absolute support, each
combination is counted per TID, and combinations that are below support value
of 3 are eliminated.
Table 2a. 2itemset results, raw
Table 2a. 2itemset results, consolidated
I proceed to scan the database again to generate frequent 3itemsets.
Sets {E, K}, {K, O}, {E, O} make {E, K, O} possible. Likewise, {E, O}, {E, Y},
{O, Y} make {E, O, Y}.
Table 3a. 3itemset results
Frequent 4itemsets cannot be generated, because sets {K, O, Y} and {E,
K, Y} are missing. So, all frequent itemsets have been found.
(b) List all of the strong association rules (with support s=60% and
confidence c=75%) matching the following metarule, where X is a variable
representing customers, and itemi denotes variables representing items (e.g.,
“A”, “B”, etc.): buys(X; item1) and buys(X; item2) ) => buys(X; item3)
[s;
c]
Solution:
The highest itemsets are {E, K, O} and {E, O, Y}. Thus, there can be
2(3!/(1!2!)) = 6 total possible association rules following the metarule of
selecting 2 inputs for testing association with 1 output.
Association rules from {E, K, O}:
R1. E ∩ K > O confidence =
#{E, K, O} / #{E, K} = 3 / 4 = 75% Therefore, R1 is a strong association rule.
R2. E ∩ O > K confidence = #{E, K, O} / #{E, O} = 3 / 4 = 75%
Therefore, R2 is a strong association rule.
R3. K ∩ O > E confidence = #{E, K, O} / #{K, O} = 3 / 3 = 100%
Therefore, R3 is a strong association rule.
Association rules from {E, O, Y}:
R4. E ∩ O > Y confidence = #{E, O, Y} / #{E, O} = 3 / 4 = 75%
Therefore, R4 is a strong association rule.
R5. E ∩ Y > O confidence = #{E, O, Y} / #{E, Y} = 3 / 3 = 100%
Therefore, R5 is a strong association rule.
R6. O ∩ Y > E confidence = #{E, O, Y} / #{O, Y} = 3 / 3 = 100%
Therefore, R6 is a strong association rule.
In this case, all 6 association rules are strong, meaning that customers
who purchase any of the two products among E, K, O are likely to purchase the
remaining one, and customers who purchase two items among E, O, Y are likely to
purchase the remaining one.
Question: Give appropriate solutions for the following 3+3=6
Marks
a. Suppose that the data for analysis includes the attribute
age. The age values for the data tuples are 70, 20, 16, 16, 52, 15, 20, 21, 22, 25, 22, 30, 25, 46, 25,
33, 36, 35, 40, 35, 35, 33, 35, 45, 25, 19, 13.Use smoothing by bin means to smooth the above data,
using a bin depth of 3. Illustrate your
steps.
Answer:
Step 1: Sort the data. 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35,
35, 36, 40, 45, 46, 52, 70
Step 2: Partition the data into equalfrequency bins of size
3.
Bin 1: 13, 15, 16
Bin 2: 16, 19, 20
Bin 1: 13, 15, 16
Bin 2: 16, 19, 20
Bin 3: 20, 21, 22
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36,40, 45
Bin 9: 46, 52, 70
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36,40, 45
Bin 9: 46, 52, 70
Step 3: Calculate the arithmetic mean of each bin.
Step 4: Replace each of the values in each bin by the
arithmetic mean calculated for the bin.
Bin 1: 14.6, 14.6, 14.6
Bin 2: 18.3, 18.3, 18.3
Bin 3:
21, 21, 21
Bin 4: 24, 24, 24
Bin 5: 26.6, 26.6, 26.6
Bin 6: 33.6, 33.6, 33.6
Bin 7:
35, 35, 35
Bin 8: 40.3,40.3, 40.3
Bin 9: 56, 56, 56
b. Outliers are often discarded as noise. However, one person’s garbage could be another’s treasure. For example, exceptions in credit card transactions can help us detect the fraudulent use of credit cards. Taking fraudulence detection as an example, propose two methods that can be used to detect outliers and discuss which one is more reliable.
Answer:
> Using clustering techniques: After clustering, the different clusters represent the different kinds of data (transactions). The outliers are those data points that do not fall into any cluster. Among the various kinds of clustering methods, densitybased clustering may be the most effective.
> Using prediction (or regression) techniques: Constructed a probability (regression) model based on all of the data. If the predicted value for a data point differs greatly from the given value, then the given value may be consider an outlier.
Such a wonderful blog and very useful content for all readers. Thank you for you gave an innovative post. Well done!!!
ReplyDeleteEmbedded System Course Chennai
Embedded Training in Chennai
Excel Training in Chennai
Corporate Training in Chennai
Oracle Training in Chennai
Unix Training in Chennai
Power BI Training in Chennai
Social Media Marketing Courses in Chennai
Pega Training in Chennai
Embedded System Course Chennai
Embedded Training in Chennai
ReplyDeleteIn the beginning, I would like to thank you much about this great post. Its very useful and helpful for anyone looking for tips. I like your writing style and I hope you will keep doing this good working.
Ethical Hacking Course in Chennai
Certified Ethical Hacking Course in Chennai
PHP Training in Chennai
ccna Training in Chennai
Web Designing Course in Chennai
ethical hacking course in chennai
hacking course in chennai
I have read your blog its very attractive and impressive. I like it your blog.
ReplyDeleteGuest posting sites
Technology
With class 100 cleanrooms, it can even open hard drives, and concentrate information utilizing its exclusive apparatuses and systems. ExcelR Data Science Courses
ReplyDelete