Upload
editor2533
View
218
Download
0
Embed Size (px)
Citation preview
8/7/2019 An Innovative Approach in Text Mining
1/6
An Innovative Approach in Text Mining
(1)R.Santhanalakshmi,Research Scholar,
Dept of MCA, ComputerCenter,
Madurai Kamaraj University, [email protected]
(2) Dr.K.Alagarsamy,Associate Professor
Dept of MCA, ComputerCenter,
Madurai Kamaraj University, [email protected]
Abstract:
The text mining is the
classification and predictive modelings
that are based on bootstrapping
techniques re-use a source data set forthe specific application, which is
specialized for avoid the informationoverloading and redundancy. The resultsoffer a classification and prediction
results are minimum compare with the
original data source.Text is the common approach used to
examine text and data in order to draw
conclusions about the structure and
relationships between sets of informationcontained in the original set or
approximate the some expected values.
In this paper we are going to retrieve thebovine diseases information form the
internet using k-means clustering and
principal component analysis.
Keyword: Bovine Diseases, K-Means
Clustering, Principal component
analysis.
I. Introduction:
1.1 Bovine Diseases:
Bovine Diseases are the commondiseases in cattle sector. It has variety of
forms and N number of symptoms. Here
we discuss some forms. BVDV is one ofthe common causes of infectious
abortion. It is also correlated with a wide
range of diseases from infertility topneumonia, diarrhoea and poor growth.
BVDV is normally the major viral cause
of disease in cattle. BVDV is belongs to
the family of pestiviruses. Other diseases
associated with other pestiviruses
include classical swine fever and borderdisease in sheep. Pestiviruses infect
cloven-hoofed stock only, BVDV hasbeen found in pigs and sheep. BVDVcauses such a wide range of disease it is
rare to be able to diagnose because on
clinical signs alone. Testing the bloodfor antibodies and virus is the best
method of diagnosis. A paired blood
sample for antibodies is useful for
pneumonia, diarrhoea and infertility. Ifthe first sample is taken when the animal
is ill and the second two to three weeks
later, a rise in antibodies suggests thatthere was active infection
BVD is a viral disease of cattle
caused by a pestivirus. It has many
different manifestations in a herd,depending on the herds immune andreproductive status. Transient diarrhoea,
mixed respiratory infection, infertility orabortion and mucosal disease are the
most common clinical signs of the
disease and can be seen simultaneouslyin a herd. Due to its varied
manifestations and sub clinical nature in
many herds, the significance of the
disease has not been understood untilrecently, when diagnostic methods
improved.Bovine herpes virus 1 is a
virus of the family Herpesviridae thatcauses several diseases worldwide in
cattle, including rhinotracheitis,
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
193
ISSN: 2229-60
8/7/2019 An Innovative Approach in Text Mining
2/6
vaginitis, balanoposthitis, abortion,conjunctivitis and enteritis. BHV-1 is
also a contributing factor in shipping
fever. Bovine leukemia virus is a bovinevirus closely related to HTLV-I, a
human tumour virus. BLV is a retroviruswhich integrates a DNA intermediate as
a provirus into the DNA of B-lymphocytes of blood and milk. It
contains an oncogene coding for a
protein called Tax.
1.2 K-Means Clustering:
In statistics and machinelearning, k-means clustering [4] is a
method of cluster analysis which aims to
partitionn
observations intok
clusters inwhich each observation belongs to thecluster with the nearest mean. It is
similar to the expectation-maximization
algorithm for mixtures of Gaussians inthat they both attempt to find the centers
of natural clusters in the data as well as
in the iterative refinement approachemployed by both algorithms.
Procedure:
This algorithm is initiated bycreating k different clusters.
The given sample set is firstrandomly distributed between
these k different clusters.
As a next step, the distancemeasurement between each of
the sample, within a givencluster, to their respective
cluster centroid is calculated.
Samples are then moved to acluster that records the shortestdistance from a sample to the
cluster centroid.
As a first step to the clusteranalysis, the user decides on theNumber of Clusters k. This
parameter could take definite
integer values with the lowerbound of 1 an upper bound that
equals the total number of
samples.
The K-Means algorithm is repeated anumber of times to obtain an optimal
clustering solution, every time starting
with a random set of initial clusters.
1.3 Principal Component Analysis:
The main basis of PCA-based
dimension reduction is that PCA picksup the dimensions with the largest
variances. Mathematically, this is
equivalent to finding the best low rank
approximation of the data via thesingular value decomposition. However,
this noise reduction property alone is
inadequate to explain the effectivenessof PCA.
PCA is a basic method of socialnetwork mining with applications to
ranking and clustering that can be further
deployed in marketing, in user
segmentation by selecting communities
with desired or undesired properties as.In particular the friends list of a blog can
be used for social filtering, that isreading posts that their friends write or
recently read.
Principal Component Analysis is similar
to the HITS ranking algorithm; in factthe hub and authority ranking is defined
by the first left and right singular vectors
and the use of higher dimensions issuggested already and analyzed in detail
in Several authors use HITS for
measuring authority in mailing lists orblogs , the latter result observing a
strong correlation of HITS score and
degree, indicating that the first principalaxis will contain no high-level
information but simply order by number
of friends. We demonstrate that HITS-
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
194
8/7/2019 An Innovative Approach in Text Mining
3/6
style ranking can be used but withspecial care due to the Tightly Knit
Community effect that result in
communities that are small on a globallevel grabbing the first principal axes.
The probably the first who identify theTKC problem in the HITS algorithm,
their algorithmic solution however turnsout to merely compute in and out-
degrees. In contrast we keep PCA as the
underlying matrix method and filter therelevant high-level structural
information by removing TK
II. Proposed Method (SAN Method):
In our method we combined thek-means clustering and principalcomponent analysis for the effective
clustering and optimized solution. While
searching the information from theinternet we have to get what information
we required until otherwise that
searching becomes a null and void.Every clustering method has its own
strategy and importance. We cant saythe single clustering mechanism enoughfor every kind of search and also we
cant ensure every clustering method
provide the same result for same keyterm. For this reason we combined the
both clustering technique and gave new
innovative idea to optimizing thesearching from the large data base or
internet, etc. Both techniques are some
what related to clustering technique. K-
Means clustering grouping the sourcedata into certain groups called as clusters
based on some distance measures
technique. Principal component analyzefocusing dimension reduction based on
the mathematical models.
Our domain informations relatedto Bovine Disease, which are very
specific instead of searching the all
domains. Even though we have specificdomain we should search through out the
internet if its online otherwise in large
data base. As our earlier work we usedmodified HITS algorithm to search and
another one we used stemmingalgorithm with hierarchical clustering. In
this we combine the K-means andPrincipal component analysis and
evaluate the results. Our research end
with comparison making between althose things and optimize which
technique better for my work.
Bovine diseases keyword given
for searching element using that
keyword first we form the initialclusters. For example we have n samplefeature vectors bv1, bv2 bvn all from
the same class, and we know that they
fall into k compact clusters, k < n. Let mibe the mean of the vectors in cluster i.
here for calculating the mean value we
use Euclidean distance formula whichstandard algorithm as well as simple
algorithm to find out the distance
between two elements. If the clusters arewell separated, we can use a minimum-
distance classifier to separate them. That
is, we can say that x is in cluster i if x -mi is the minimum of all the k distances.
This suggests the following procedure
for finding the k means:
Make initial guesses for themeans m1, m2... mk.
Until there are no changes inany mean
oUse the estimatedmeans to classify thesamples into clusters
o For i from 1 to ko Replace bvi with the
mean of all of thesamples for cluster i
o end_for
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
195
8/7/2019 An Innovative Approach in Text Mining
4/6
end_untilIn addition to improve K- means
algorithm while forming the clusteringanalysis we include the brute force
stemming algorithm, suffix stripping andbrute force approach. Brute force
stemmers maintain the lookup tablewhich contains relations between root
forms and inflected forms. To stem a
word, the table is queried to find amatching inflection. If a matching
inflection is found, the associated root isreplaced by the original word. Suffix
stripping algorithms not like lookup
table Instead, a typically smaller list of
rules are stored which provide a path forthe algorithm, given an input word form,
to find its root form. Some examples ofthe rules include, Rule1) if the word
ends in 'ed', remove the 'ed'.Rule2) if the
word ends in 'ing', remove the
'ing'.Rule3) if the word ends in 'ly',remove the 'ly'. Like wise they form
some group of clusters at final stage but
we cant stay these are the final
optimized result so that we going to
analyze this cluster further for that finalclusters distributed into Principal
component analysis.
Principal component analysis is a
mathematical procedure that uses an
orthogonal transformation to convert aset of observations of possibly correlated
variables into a set of values of
uncorrelated variables called principal
components.
PCA is mathematically definedas an orthogonal linear transformation
that transforms the data to a new
coordinate system such that the greatestvariance (difference) by any projection
of the data comes on the first coordinate
the second greatest variance on the
second coordinate, and so on.
Define a data matrix, XT, with zero
samples mean, where each of the n rows
represents a different repetition of dataform the different experiment, and each
of the m columns gives a particular kind
of datum .The singular valuedecomposition of X is X = W V
T,
where the m m matrix W is the matrix
of eigenvectors of XXT, the matrix is
an m n rectangular diagonal matrix
with nonnegative real numbers on the
diagonal, and the matrix V is n n. The
PCA transformation that preserves
dimensionality is then given by:
WXyTT
T
V
V is not uniquely defined in the usual
case when m
8/7/2019 An Innovative Approach in Text Mining
5/6
TTTWWXX
Given a set of points in Euclidean space,
the first principal component
corresponds to a line that passes through
the multidimensional mean andminimizes the sum of squares of the
distances of the points from the line. The
second principal component correspondsto the same concept after all correlation
with the first principal component hasbeen subtracted out from the points. The
singular values (in ) are the square
roots of the eigenvalues of the matrixXXT. Each Eigen value is proportional to
the portion of the variance that is
correlated with each eigenvector. Thesum of all the eigenvalues is equal to the
sum of the squared distances of thepoints from their multidimensional
mean. PCA essentially rotates the set ofpoints around their mean in order to
align with the principal components.
This moves as much of the variance aspossible into the first few dimensions.
The values in the remaining dimensions,therefore, tend to be small and may be
dropped with minimal loss ofinformation. Finally we will get the
reduced cluster as the output of our
query.
III. Result Analysis:
Simulation will carry over in
Matrix lab (Mat lab) software. For
example take this as query: Symptoms ofBovine leukemia: First we will see the
out come of K-means Clustering in
Fig: 1
Fig: 2
The K-means cluster output value given
to the principal component analysis. It
will generate the variance matrix thatwill reduce into further steps finally we
will get these things as the result ofquery Comparison Analysis will give
performance evaluation of combined
approach with linearly:
Sample
size
SAN
method
K-
means
PCA
Feeding
Gouge Dehorhing
Rhinotracheitis
ataxia
Pal tion
Provirus
L m hoc tes
Mononucleosis
B-Sell leukemiaColostrum
leukaemia
BLV
B-Sell leukemia
Lymphocytes
BLV
Ataxia
Colostrums
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
197
8/7/2019 An Innovative Approach in Text Mining
6/6
2750 0.91 0.87 0.82
4550 0.81 0.75 0.69
7700 0.84 0.65 0.62
10100 0.89 0.73 0.65
As the result analysis depicts SANmethod performance will high then any
other methods. While increasing the data
set performance ratio will decrease in K-means and principal component analysis
IV. Conclusion:In this paper we provided an effective
method for information retrieval in
Bovine disease. The SAN method givesan optimum solution compare with
principal component analysis and K-means. In our earlier work we focused inenhancing the Medline & Pubmed using
modified hits algorithm and also we tried
with stemming algorithms. Its ourconclusion among all those methods;
The SAN method gave the effective
solution for bovine disease searching
methodology.
V. References:
[1] Lada A. Adamic and Natalie Glance.The political blogosphere and the 2004u.s. election: divided they blog. In
LinkKDD 05: Proceedings of the 3rd
international workshop on Linkdiscovery, pages 3643, New York, NY,
USA, 2005. ACM.
[2] Pedro Domingos and MattRichardson. Mining the network value of
customers. In KDD 01: Proceedings ofthe seventh ACM SIGKDD international
conference on Knowledge discovery anddata mining, pages 5766, New York,
NY, USA, 2001. ACM.
[3] Lars Backstrom, Dan Huttenlocher,Jon Kleinberg, and Xiangyang Lan.
Group formation in large social
networks: membership, growth, and
evolution. In KDD 06:
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge
discovery and data mining, pages 4454,
New York, NY, USA, 2006. ACMPress.
[4] D Cheng, R Kannan, S Vempala, andG Wang. On a recursive spectral
algorithm for clustering from pairwisesimilarities. Technical report, MIT LCS
Technical Report MIT-LCS-TR-906,
2003.[5] Matthew Hurst, Matthew Siegler,
and Natalie Glance. On estimating the
geographic distribution of social media.In Proceedings Int. Conf. on Weblogs
and Social Media (ICWSM-2007), 2007.
[6] M. Newman. Detecting communitystructure in networks. The EuropeanPhysical Journal B - Condensed Matter,
38(2):321330, March 2004.
[7] Jun Zhang, Mark S. Ackerman, andLada Adamic. Expertise networks in
online communities: structure and
algorithms. In WWW 07: Proceedingsof the 16th international conference on
World Wide Web, pages 221230, New
York, NY, USA, 2007. ACM Press.
R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198
198