An Innovative Approach in Text Mining

Embed Size (px)

Citation preview

  • 8/7/2019 An Innovative Approach in Text Mining

    1/6

    An Innovative Approach in Text Mining

    (1)R.Santhanalakshmi,Research Scholar,

    Dept of MCA, ComputerCenter,

    Madurai Kamaraj University, [email protected]

    (2) Dr.K.Alagarsamy,Associate Professor

    Dept of MCA, ComputerCenter,

    Madurai Kamaraj University, [email protected]

    Abstract:

    The text mining is the

    classification and predictive modelings

    that are based on bootstrapping

    techniques re-use a source data set forthe specific application, which is

    specialized for avoid the informationoverloading and redundancy. The resultsoffer a classification and prediction

    results are minimum compare with the

    original data source.Text is the common approach used to

    examine text and data in order to draw

    conclusions about the structure and

    relationships between sets of informationcontained in the original set or

    approximate the some expected values.

    In this paper we are going to retrieve thebovine diseases information form the

    internet using k-means clustering and

    principal component analysis.

    Keyword: Bovine Diseases, K-Means

    Clustering, Principal component

    analysis.

    I. Introduction:

    1.1 Bovine Diseases:

    Bovine Diseases are the commondiseases in cattle sector. It has variety of

    forms and N number of symptoms. Here

    we discuss some forms. BVDV is one ofthe common causes of infectious

    abortion. It is also correlated with a wide

    range of diseases from infertility topneumonia, diarrhoea and poor growth.

    BVDV is normally the major viral cause

    of disease in cattle. BVDV is belongs to

    the family of pestiviruses. Other diseases

    associated with other pestiviruses

    include classical swine fever and borderdisease in sheep. Pestiviruses infect

    cloven-hoofed stock only, BVDV hasbeen found in pigs and sheep. BVDVcauses such a wide range of disease it is

    rare to be able to diagnose because on

    clinical signs alone. Testing the bloodfor antibodies and virus is the best

    method of diagnosis. A paired blood

    sample for antibodies is useful for

    pneumonia, diarrhoea and infertility. Ifthe first sample is taken when the animal

    is ill and the second two to three weeks

    later, a rise in antibodies suggests thatthere was active infection

    BVD is a viral disease of cattle

    caused by a pestivirus. It has many

    different manifestations in a herd,depending on the herds immune andreproductive status. Transient diarrhoea,

    mixed respiratory infection, infertility orabortion and mucosal disease are the

    most common clinical signs of the

    disease and can be seen simultaneouslyin a herd. Due to its varied

    manifestations and sub clinical nature in

    many herds, the significance of the

    disease has not been understood untilrecently, when diagnostic methods

    improved.Bovine herpes virus 1 is a

    virus of the family Herpesviridae thatcauses several diseases worldwide in

    cattle, including rhinotracheitis,

    R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

    193

    ISSN: 2229-60

  • 8/7/2019 An Innovative Approach in Text Mining

    2/6

    vaginitis, balanoposthitis, abortion,conjunctivitis and enteritis. BHV-1 is

    also a contributing factor in shipping

    fever. Bovine leukemia virus is a bovinevirus closely related to HTLV-I, a

    human tumour virus. BLV is a retroviruswhich integrates a DNA intermediate as

    a provirus into the DNA of B-lymphocytes of blood and milk. It

    contains an oncogene coding for a

    protein called Tax.

    1.2 K-Means Clustering:

    In statistics and machinelearning, k-means clustering [4] is a

    method of cluster analysis which aims to

    partitionn

    observations intok

    clusters inwhich each observation belongs to thecluster with the nearest mean. It is

    similar to the expectation-maximization

    algorithm for mixtures of Gaussians inthat they both attempt to find the centers

    of natural clusters in the data as well as

    in the iterative refinement approachemployed by both algorithms.

    Procedure:

    This algorithm is initiated bycreating k different clusters.

    The given sample set is firstrandomly distributed between

    these k different clusters.

    As a next step, the distancemeasurement between each of

    the sample, within a givencluster, to their respective

    cluster centroid is calculated.

    Samples are then moved to acluster that records the shortestdistance from a sample to the

    cluster centroid.

    As a first step to the clusteranalysis, the user decides on theNumber of Clusters k. This

    parameter could take definite

    integer values with the lowerbound of 1 an upper bound that

    equals the total number of

    samples.

    The K-Means algorithm is repeated anumber of times to obtain an optimal

    clustering solution, every time starting

    with a random set of initial clusters.

    1.3 Principal Component Analysis:

    The main basis of PCA-based

    dimension reduction is that PCA picksup the dimensions with the largest

    variances. Mathematically, this is

    equivalent to finding the best low rank

    approximation of the data via thesingular value decomposition. However,

    this noise reduction property alone is

    inadequate to explain the effectivenessof PCA.

    PCA is a basic method of socialnetwork mining with applications to

    ranking and clustering that can be further

    deployed in marketing, in user

    segmentation by selecting communities

    with desired or undesired properties as.In particular the friends list of a blog can

    be used for social filtering, that isreading posts that their friends write or

    recently read.

    Principal Component Analysis is similar

    to the HITS ranking algorithm; in factthe hub and authority ranking is defined

    by the first left and right singular vectors

    and the use of higher dimensions issuggested already and analyzed in detail

    in Several authors use HITS for

    measuring authority in mailing lists orblogs , the latter result observing a

    strong correlation of HITS score and

    degree, indicating that the first principalaxis will contain no high-level

    information but simply order by number

    of friends. We demonstrate that HITS-

    R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

    194

  • 8/7/2019 An Innovative Approach in Text Mining

    3/6

    style ranking can be used but withspecial care due to the Tightly Knit

    Community effect that result in

    communities that are small on a globallevel grabbing the first principal axes.

    The probably the first who identify theTKC problem in the HITS algorithm,

    their algorithmic solution however turnsout to merely compute in and out-

    degrees. In contrast we keep PCA as the

    underlying matrix method and filter therelevant high-level structural

    information by removing TK

    II. Proposed Method (SAN Method):

    In our method we combined thek-means clustering and principalcomponent analysis for the effective

    clustering and optimized solution. While

    searching the information from theinternet we have to get what information

    we required until otherwise that

    searching becomes a null and void.Every clustering method has its own

    strategy and importance. We cant saythe single clustering mechanism enoughfor every kind of search and also we

    cant ensure every clustering method

    provide the same result for same keyterm. For this reason we combined the

    both clustering technique and gave new

    innovative idea to optimizing thesearching from the large data base or

    internet, etc. Both techniques are some

    what related to clustering technique. K-

    Means clustering grouping the sourcedata into certain groups called as clusters

    based on some distance measures

    technique. Principal component analyzefocusing dimension reduction based on

    the mathematical models.

    Our domain informations relatedto Bovine Disease, which are very

    specific instead of searching the all

    domains. Even though we have specificdomain we should search through out the

    internet if its online otherwise in large

    data base. As our earlier work we usedmodified HITS algorithm to search and

    another one we used stemmingalgorithm with hierarchical clustering. In

    this we combine the K-means andPrincipal component analysis and

    evaluate the results. Our research end

    with comparison making between althose things and optimize which

    technique better for my work.

    Bovine diseases keyword given

    for searching element using that

    keyword first we form the initialclusters. For example we have n samplefeature vectors bv1, bv2 bvn all from

    the same class, and we know that they

    fall into k compact clusters, k < n. Let mibe the mean of the vectors in cluster i.

    here for calculating the mean value we

    use Euclidean distance formula whichstandard algorithm as well as simple

    algorithm to find out the distance

    between two elements. If the clusters arewell separated, we can use a minimum-

    distance classifier to separate them. That

    is, we can say that x is in cluster i if x -mi is the minimum of all the k distances.

    This suggests the following procedure

    for finding the k means:

    Make initial guesses for themeans m1, m2... mk.

    Until there are no changes inany mean

    oUse the estimatedmeans to classify thesamples into clusters

    o For i from 1 to ko Replace bvi with the

    mean of all of thesamples for cluster i

    o end_for

    R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

    195

  • 8/7/2019 An Innovative Approach in Text Mining

    4/6

    end_untilIn addition to improve K- means

    algorithm while forming the clusteringanalysis we include the brute force

    stemming algorithm, suffix stripping andbrute force approach. Brute force

    stemmers maintain the lookup tablewhich contains relations between root

    forms and inflected forms. To stem a

    word, the table is queried to find amatching inflection. If a matching

    inflection is found, the associated root isreplaced by the original word. Suffix

    stripping algorithms not like lookup

    table Instead, a typically smaller list of

    rules are stored which provide a path forthe algorithm, given an input word form,

    to find its root form. Some examples ofthe rules include, Rule1) if the word

    ends in 'ed', remove the 'ed'.Rule2) if the

    word ends in 'ing', remove the

    'ing'.Rule3) if the word ends in 'ly',remove the 'ly'. Like wise they form

    some group of clusters at final stage but

    we cant stay these are the final

    optimized result so that we going to

    analyze this cluster further for that finalclusters distributed into Principal

    component analysis.

    Principal component analysis is a

    mathematical procedure that uses an

    orthogonal transformation to convert aset of observations of possibly correlated

    variables into a set of values of

    uncorrelated variables called principal

    components.

    PCA is mathematically definedas an orthogonal linear transformation

    that transforms the data to a new

    coordinate system such that the greatestvariance (difference) by any projection

    of the data comes on the first coordinate

    the second greatest variance on the

    second coordinate, and so on.

    Define a data matrix, XT, with zero

    samples mean, where each of the n rows

    represents a different repetition of dataform the different experiment, and each

    of the m columns gives a particular kind

    of datum .The singular valuedecomposition of X is X = W V

    T,

    where the m m matrix W is the matrix

    of eigenvectors of XXT, the matrix is

    an m n rectangular diagonal matrix

    with nonnegative real numbers on the

    diagonal, and the matrix V is n n. The

    PCA transformation that preserves

    dimensionality is then given by:

    WXyTT

    T

    V

    V is not uniquely defined in the usual

    case when m

  • 8/7/2019 An Innovative Approach in Text Mining

    5/6

    TTTWWXX

    Given a set of points in Euclidean space,

    the first principal component

    corresponds to a line that passes through

    the multidimensional mean andminimizes the sum of squares of the

    distances of the points from the line. The

    second principal component correspondsto the same concept after all correlation

    with the first principal component hasbeen subtracted out from the points. The

    singular values (in ) are the square

    roots of the eigenvalues of the matrixXXT. Each Eigen value is proportional to

    the portion of the variance that is

    correlated with each eigenvector. Thesum of all the eigenvalues is equal to the

    sum of the squared distances of thepoints from their multidimensional

    mean. PCA essentially rotates the set ofpoints around their mean in order to

    align with the principal components.

    This moves as much of the variance aspossible into the first few dimensions.

    The values in the remaining dimensions,therefore, tend to be small and may be

    dropped with minimal loss ofinformation. Finally we will get the

    reduced cluster as the output of our

    query.

    III. Result Analysis:

    Simulation will carry over in

    Matrix lab (Mat lab) software. For

    example take this as query: Symptoms ofBovine leukemia: First we will see the

    out come of K-means Clustering in

    Fig: 1

    Fig: 2

    The K-means cluster output value given

    to the principal component analysis. It

    will generate the variance matrix thatwill reduce into further steps finally we

    will get these things as the result ofquery Comparison Analysis will give

    performance evaluation of combined

    approach with linearly:

    Sample

    size

    SAN

    method

    K-

    means

    PCA

    Feeding

    Gouge Dehorhing

    Rhinotracheitis

    ataxia

    Pal tion

    Provirus

    L m hoc tes

    Mononucleosis

    B-Sell leukemiaColostrum

    leukaemia

    BLV

    B-Sell leukemia

    Lymphocytes

    BLV

    Ataxia

    Colostrums

    R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

    197

  • 8/7/2019 An Innovative Approach in Text Mining

    6/6

    2750 0.91 0.87 0.82

    4550 0.81 0.75 0.69

    7700 0.84 0.65 0.62

    10100 0.89 0.73 0.65

    As the result analysis depicts SANmethod performance will high then any

    other methods. While increasing the data

    set performance ratio will decrease in K-means and principal component analysis

    IV. Conclusion:In this paper we provided an effective

    method for information retrieval in

    Bovine disease. The SAN method givesan optimum solution compare with

    principal component analysis and K-means. In our earlier work we focused inenhancing the Medline & Pubmed using

    modified hits algorithm and also we tried

    with stemming algorithms. Its ourconclusion among all those methods;

    The SAN method gave the effective

    solution for bovine disease searching

    methodology.

    V. References:

    [1] Lada A. Adamic and Natalie Glance.The political blogosphere and the 2004u.s. election: divided they blog. In

    LinkKDD 05: Proceedings of the 3rd

    international workshop on Linkdiscovery, pages 3643, New York, NY,

    USA, 2005. ACM.

    [2] Pedro Domingos and MattRichardson. Mining the network value of

    customers. In KDD 01: Proceedings ofthe seventh ACM SIGKDD international

    conference on Knowledge discovery anddata mining, pages 5766, New York,

    NY, USA, 2001. ACM.

    [3] Lars Backstrom, Dan Huttenlocher,Jon Kleinberg, and Xiangyang Lan.

    Group formation in large social

    networks: membership, growth, and

    evolution. In KDD 06:

    Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge

    discovery and data mining, pages 4454,

    New York, NY, USA, 2006. ACMPress.

    [4] D Cheng, R Kannan, S Vempala, andG Wang. On a recursive spectral

    algorithm for clustering from pairwisesimilarities. Technical report, MIT LCS

    Technical Report MIT-LCS-TR-906,

    2003.[5] Matthew Hurst, Matthew Siegler,

    and Natalie Glance. On estimating the

    geographic distribution of social media.In Proceedings Int. Conf. on Weblogs

    and Social Media (ICWSM-2007), 2007.

    [6] M. Newman. Detecting communitystructure in networks. The EuropeanPhysical Journal B - Condensed Matter,

    38(2):321330, March 2004.

    [7] Jun Zhang, Mark S. Ackerman, andLada Adamic. Expertise networks in

    online communities: structure and

    algorithms. In WWW 07: Proceedingsof the 16th international conference on

    World Wide Web, pages 221230, New

    York, NY, USA, 2007. ACM Press.

    R.Santhanalakshmi,Dr.K.Alagarsamy, Int. J. Comp. Tech. Appl., Vol 2 (1), 193-198

    198