An Innovative Approach in Text Mining

8/7/2019 An Innovative Approach in Text Mining

1/6

An Innovative Approach in Text Mining

(1)R.Santhanalakshmi,Research Scholar,

Dept of MCA, ComputerCenter,

Madurai Kamaraj University, [email protected]

(2) Dr.K.Alagarsamy,Associate Professor

Dept of MCA, ComputerCenter,

Madurai Kamaraj University, [email protected]

Abstract:

The text mining is the

classification and predictive modelings

that are based on bootstrapping

techniques re-use a source data set forthe specific application, which is

specialized for avoid the informationoverloading and redundancy. The resultsoffer a classification and prediction

results are minimum compare with the

original data source.Text is the common approach used to

examine text and data in order to draw

conclusions about the structure and

relationships between sets of informationcontained in the original set or

approximate the some expected values.

In this paper we are going to retrieve thebovine diseases information form the

internet using k-means clustering and

principal component analysis.

Keyword: Bovine Diseases, K-Means

Clustering, Principal component

analysis.

I. Introduction:

1.1 Bovine Diseases:

Bovine Diseases are the commondiseases in cattle sector. It has variety of

forms and N number of symptoms. Here

we discuss some forms. BVDV is one ofthe common causes of infectious

abortion. It is also correlated with a wide

range of diseases from infertility topneumonia, diarrhoea and poor growth.

BVDV is normally the major viral cause

of disease in cattle. BVDV is belongs to

the family of pestiviruses. Other diseases

associated with other pestiviruses

include classical swine fever and borderdisease in sheep. Pestiviruses infect

cloven-hoofed stock only, BVDV hasbeen found in pigs and sheep. BVDVcauses such a wide range of disease it is

rare to be able to diagnose because on

clinical signs alone. Testing the bloodfor antibodies and virus is the best

method of diagnosis. A paired blood

sample for antibodies is useful for

pneumonia, diarrhoea and infertility. Ifthe first sample is taken when the animal

is ill and the second two to three weeks

later, a rise in antibodies suggests thatthere was active infection

BVD is a viral disease of cattle

caused by a pestivirus. It has many

different manifestations in a herd,depending on the herds immune andreproductive status. Transient diarrhoea,

mixed respiratory infection, infertility orabortion and mucosal disease are the

most common clinical signs of the

disease and can be seen simultaneouslyin a herd. Due to its varied

manifestations and sub clinical nature in

many herds, the significance of the

disease has not been understood untilrecently, when diagnostic methods

improved.Bovine herpes virus 1 is a

virus of the family Herpesviridae thatcauses several diseases worldwide in

cattle, including rhinotracheitis,

R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

193

ISSN: 2229-60


2/6

vaginitis, balanoposthitis, abortion,conjunctivitis and enteritis. BHV-1 is

also a contributing factor in shipping

fever. Bovine leukemia virus is a bovinevirus closely related to HTLV-I, a

human tumour virus. BLV is a retroviruswhich integrates a DNA intermediate as

a provirus into the DNA of B-lymphocytes of blood and milk. It

contains an oncogene coding for a

protein called Tax.

1.2 K-Means Clustering:

In statistics and machinelearning, k-means clustering [4] is a

method of cluster analysis which aims to

partitionn

observations intok

clusters inwhich each observation belongs to thecluster with the nearest mean. It is

similar to the expectation-maximization

algorithm for mixtures of Gaussians inthat they both attempt to find the centers

of natural clusters in the data as well as

in the iterative refinement approachemployed by both algorithms.

Procedure:

This algorithm is initiated bycreating k different clusters.

The given sample set is firstrandomly distributed between

these k different clusters.

As a next step, the distancemeasurement between each of

the sample, within a givencluster, to their respective

cluster centroid is calculated.

Samples are then moved to acluster that records the shortestdistance from a sample to the

cluster centroid.

As a first step to the clusteranalysis, the user decides on theNumber of Clusters k. This

parameter could take definite

integer values with the lowerbound of 1 an upper bound that

equals the total number of

samples.

The K-Means algorithm is repeated anumber of times to obtain an optimal

clustering solution, every time starting

with a random set of initial clusters.

1.3 Principal Component Analysis:

The main basis of PCA-based

dimension reduction is that PCA picksup the dimensions with the largest

variances. Mathematically, this is

equivalent to finding the best low rank

approximation of the data via thesingular value decomposition. However,

this noise reduction property alone is

inadequate to explain the effectivenessof PCA.

PCA is a basic method of socialnetwork mining with applications to

ranking and clustering that can be further

deployed in marketing, in user

segmentation by selecting communities

with desired or undesired properties as.In particular the friends list of a blog can

be used for social filtering, that isreading posts that their friends write or

recently read.

Principal Component Analysis is similar

to the HITS ranking algorithm; in factthe hub and authority ranking is defined

by the first left and right singular vectors

and the use of higher dimensions issuggested already and analyzed in detail

in Several authors use HITS for

measuring authority in mailing lists orblogs , the latter result observing a

strong correlation of HITS score and

degree, indicating that the first principalaxis will contain no high-level

information but simply order by number

of friends. We demonstrate that HITS-


194


3/6

style ranking can be used but withspecial care due to the Tightly Knit

Community effect that result in

communities that are small on a globallevel grabbing the first principal axes.

The probably the first who identify theTKC problem in the HITS algorithm,

their algorithmic solution however turnsout to merely compute in and out-

degrees. In contrast we keep PCA as the

underlying matrix method and filter therelevant high-level structural

information by removing TK

II. Proposed Method (SAN Method):

In our method we combined thek-means clustering and principalcomponent analysis for the effective

clustering and optimized solution. While

searching the information from theinternet we have to get what information

we required until otherwise that

searching becomes a null and void.Every clustering method has its own

strategy and importance. We cant saythe single clustering mechanism enoughfor every kind of search and also we

cant ensure every clustering method

provide the same result for same keyterm. For this reason we combined the

both clustering technique and gave new

innovative idea to optimizing thesearching from the large data base or

internet, etc. Both techniques are some

what related to clustering technique. K-

Means clustering grouping the sourcedata into certain groups called as clusters

based on some distance measures

technique. Principal component analyzefocusing dimension reduction based on

the mathematical models.

Our domain informations relatedto Bovine Disease, which are very

specific instead of searching the all

domains. Even though we have specificdomain we should search through out the

internet if its online otherwise in large

data base. As our earlier work we usedmodified HITS algorithm to search and

another one we used stemmingalgorithm with hierarchical clustering. In

this we combine the K-means andPrincipal component analysis and

evaluate the results. Our research end

with comparison making between althose things and optimize which

technique better for my work.

Bovine diseases keyword given

for searching element using that

keyword first we form the initialclusters. For example we have n samplefeature vectors bv1, bv2 bvn all from

the same class, and we know that they

fall into k compact clusters, k < n. Let mibe the mean of the vectors in cluster i.

here for calculating the mean value we

use Euclidean distance formula whichstandard algorithm as well as simple

algorithm to find out the distance

between two elements. If the clusters arewell separated, we can use a minimum-

distance classifier to separate them. That

is, we can say that x is in cluster i if x -mi is the minimum of all the k distances.

This suggests the following procedure

for finding the k means:

Make initial guesses for themeans m1, m2... mk.

Until there are no changes inany mean

oUse the estimatedmeans to classify thesamples into clusters

o For i from 1 to ko Replace bvi with the

mean of all of thesamples for cluster i

o end_for


195


4/6

end_untilIn addition to improve K- means

algorithm while forming the clusteringanalysis we include the brute force

stemming algorithm, suffix stripping andbrute force approach. Brute force

stemmers maintain the lookup tablewhich contains relations between root

forms and inflected forms. To stem a

word, the table is queried to find amatching inflection. If a matching

inflection is found, the associated root isreplaced by the original word. Suffix

stripping algorithms not like lookup

table Instead, a typically smaller list of

rules are stored which provide a path forthe algorithm, given an input word form,

to find its root form. Some examples ofthe rules include, Rule1) if the word

ends in 'ed', remove the 'ed'.Rule2) if the

word ends in 'ing', remove the

'ing'.Rule3) if the word ends in 'ly',remove the 'ly'. Like wise they form

some group of clusters at final stage but

we cant stay these are the final

optimized result so that we going to

analyze this cluster further for that finalclusters distributed into Principal

component analysis.

Principal component analysis is a

mathematical procedure that uses an

orthogonal transformation to convert aset of observations of possibly correlated

variables into a set of values of

uncorrelated variables called principal

components.

PCA is mathematically definedas an orthogonal linear transformation

that transforms the data to a new

coordinate system such that the greatestvariance (difference) by any projection

of the data comes on the first coordinate

the second greatest variance on the

second coordinate, and so on.

Define a data matrix, XT, with zero

samples mean, where each of the n rows

represents a different repetition of dataform the different experiment, and each

of the m columns gives a particular kind

of datum .The singular valuedecomposition of X is X = W V

T,

where the m m matrix W is the matrix

of eigenvectors of XXT, the matrix is

an m n rectangular diagonal matrix

with nonnegative real numbers on the

diagonal, and the matrix V is n n. The

PCA transformation that preserves

dimensionality is then given by:

WXyTT

T

V

V is not uniquely defined in the usual

case when m


5/6

TTTWWXX

Given a set of points in Euclidean space,

the first principal component

corresponds to a line that passes through

the multidimensional mean andminimizes the sum of squares of the

distances of the points from the line. The

second principal component correspondsto the same concept after all correlation

with the first principal component hasbeen subtracted out from the points. The

singular values (in ) are the square

roots of the eigenvalues of the matrixXXT. Each Eigen value is proportional to

the portion of the variance that is

correlated with each eigenvector. Thesum of all the eigenvalues is equal to the

sum of the squared distances of thepoints from their multidimensional

mean. PCA essentially rotates the set ofpoints around their mean in order to

align with the principal components.

This moves as much of the variance aspossible into the first few dimensions.

The values in the remaining dimensions,therefore, tend to be small and may be

dropped with minimal loss ofinformation. Finally we will get the

reduced cluster as the output of our

query.

III. Result Analysis:

Simulation will carry over in

Matrix lab (Mat lab) software. For

example take this as query: Symptoms ofBovine leukemia: First we will see the

out come of K-means Clustering in

Fig: 1

Fig: 2

The K-means cluster output value given

to the principal component analysis. It

will generate the variance matrix thatwill reduce into further steps finally we

will get these things as the result ofquery Comparison Analysis will give

performance evaluation of combined

approach with linearly:

Sample

size

SAN

method

K-

means

PCA

Feeding

Gouge Dehorhing

Rhinotracheitis

ataxia

Pal tion

Provirus

L m hoc tes

Mononucleosis

B-Sell leukemiaColostrum

leukaemia

BLV

B-Sell leukemia

Lymphocytes

BLV

Ataxia

Colostrums


197


6/6

2750 0.91 0.87 0.82

4550 0.81 0.75 0.69

7700 0.84 0.65 0.62

10100 0.89 0.73 0.65

As the result analysis depicts SANmethod performance will high then any

other methods. While increasing the data

set performance ratio will decrease in K-means and principal component analysis

IV. Conclusion:In this paper we provided an effective

method for information retrieval in

Bovine disease. The SAN method givesan optimum solution compare with

principal component analysis and K-means. In our earlier work we focused inenhancing the Medline & Pubmed using

modified hits algorithm and also we tried

with stemming algorithms. Its ourconclusion among all those methods;

The SAN method gave the effective

solution for bovine disease searching

methodology.

V. References:

[1] Lada A. Adamic and Natalie Glance.The political blogosphere and the 2004u.s. election: divided they blog. In

LinkKDD 05: Proceedings of the 3rd

international workshop on Linkdiscovery, pages 3643, New York, NY,

USA, 2005. ACM.

[2] Pedro Domingos and MattRichardson. Mining the network value of

customers. In KDD 01: Proceedings ofthe seventh ACM SIGKDD international

conference on Knowledge discovery anddata mining, pages 5766, New York,

NY, USA, 2001. ACM.

[3] Lars Backstrom, Dan Huttenlocher,Jon Kleinberg, and Xiangyang Lan.

Group formation in large social

networks: membership, growth, and

evolution. In KDD 06:

Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge

discovery and data mining, pages 4454,

New York, NY, USA, 2006. ACMPress.

[4] D Cheng, R Kannan, S Vempala, andG Wang. On a recursive spectral

algorithm for clustering from pairwisesimilarities. Technical report, MIT LCS

Technical Report MIT-LCS-TR-906,

2003.[5] Matthew Hurst, Matthew Siegler,

and Natalie Glance. On estimating the

geographic distribution of social media.In Proceedings Int. Conf. on Weblogs

and Social Media (ICWSM-2007), 2007.

[6] M. Newman. Detecting communitystructure in networks. The EuropeanPhysical Journal B - Condensed Matter,

38(2):321330, March 2004.

[7] Jun Zhang, Mark S. Ackerman, andLada Adamic. Expertise networks in

online communities: structure and

algorithms. In WWW 07: Proceedingsof the 16th international conference on

World Wide Web, pages 221230, New

York, NY, USA, 2007. ACM Press.


198

Documents

An Innovative Approach in Text Mining