View
228
Download
0
Category
Preview:
Citation preview
103
CHAPTER 5
EFFECTIVE TEXT MINING USING INTELLIGENCE
LEARNING BASED MCMM (ETM-ILM)
5.1 DESIGN OF PROPOSED ETM-ILM
The proposed system design of ETM-ILM is shown in Figures
5.1 and 5.2. Figure 5.1 shows the flow diagram of proposed ETM-ILM in
the training session. Figure 5.2 shows the flow diagram of proposed
ETM-ILM in the testing session.
The detailed functionalities of each component in the proposed
system design are recalled hereunder:
The MCMM used for effective text classification. The working
design of proposed MCMM executes in two algorithms of manipulation,
which are training algorithm and testing algorithm.
In the conceptual analysis stage, the terms which appeared in
each STL are searched in the given training documents. The meta data
analysis (mda) values of the documents are shown in the equation 5.1,
documentsintermsofnumbertotal
STLintermsmatchingofnumbermda
_____
_____= (5.1)
On the classification stage, the highest values of ‘mda’ which
appeared in any one field of STL is identified and clustered as the name
of STL.
104
Figure 5.1 Flow diagram of proposed ETM-ILM in the training
session
Figure 5.2 Flow diagram of proposed ETM-ILM in the testing
session
MCMM Training
Data (60%)
STL
ABI
MCMM Testing Data
(40%)
STL
Output
105
This process continues for each training document and each
additional relevant terms identified in the training algorithm is added in
the concern STL.
In the conceptual analysis stage, the ‘mda’ values of each term
which appeared in every STL are calculated from the given document.
On the classification stage, the highest values of ‘mda’ which appeared
in any one field of STL is identified and clustered as the name of STL.
The performances of algorithms and techniques used in
computational field of domain are improved by means of proper learning
method. Hence, in order to improve the performances of proposed
MCMM, a learning method is proposed.
The proposed learning model involves the learning of
conceptual terms from the MCMM. The terms learned from the proposed
learning algorithm are grouped and added to the STL. The frequent
update of conceptual terms in the STL is more important for effective
clustering. For learning of such terminologies, this proposed work
applies ANN based learning algorithm.
The proposed ANN based unsupervised learning, is termed as,
Analysis of Bilateral Intelligence (ABI). The ABI applies the learning
process to identify two equivalent terms which havethe same meaning.
ABI contains text documents as datasets, improving accuracy of text
clustering which is the required output and achieving error free clustering
in a shorter time is the goal.
106
The working model of the proposed ABI Learning method is
explained in the following:
The following sigmoidal function is applied in the proposed
ABI,
A x
1X
1 e−
=
+
(5.2)
Where, XA is the output in the hidden and output layer. Where the inputs
are ‘x’ which is connected to the hidden layer from input layer. The
connection has weights ‘rai’, between inputs to hidden layer. And the
output of the neurons refered as ‘sba’ is computational values between
output and hidden layer. Where, ‘b’ neurons in the output layer, ‘a’
neurons in the hidden layer and ‘i’ neurons in the input layer.
Soptimum= A-1
x B (5.3)
Where
A=∑=
P
p
p
i
p
a ZZ1
a, i = 1,…, P (5.4)
B=∑=
P
p
p
b
p
a tZ1
a, b = 1,…, P (5.5)
Where, ‘ZP’= scalar output of the hidden neuron of training
data ‘p’, ‘A’ and ‘B’ are the output of the hidden layer and output layer
respectively, ‘a’ and ‘b’ are neurons in the hidden layer and output layer,
‘i’ is neuron in the input layer, and ‘t’ is transaction function.
The state vector or simply state, denoted by ‘xk’, is defined as
the minimal set of data that is sufficient to uniquely describe the
107
unforced dynamical behaviour of the system; the subscript ‘k’ denotes
discrete time. In other words, the state is the least amount of data on the
past behaviour of the system that is needed to predict its future
behaviour. Typically, the state ‘xk’ is unknown. To estimate it, use a set
of observed data, denoted by the vector ‘yk’.
RMS error (ERMS) was then calculated comparing the ‘Rtest’
matrix with ‘Soptimum’
matrices.
a. ERMS< E (5.6)
The hidden layer weight matrix ‘R’ is updated ‘R’= ‘Rtest’
.
Decrease the influence of the penalty term by decreasing ‘µ’.
b. ERMS ≥ E (5.7)
Increase the influence of ‘µ’.
If the RMS error is not within the desired range, else the
training process is ceased. After the successful completion of the training
algorithm, the sample real time data are given as input of the system. The
system will choose the comparatively best path. This thesis used 60%
dataset for training and 40% dataset for testing.
5.2 PROPOSED ETM-ILM ALGORITHM
The proposed ETM-ILM algorithm is also executes in two
algorithms, which are training algorithm and testing algorithm. The
detailed ETM-ILM training algorithm is described below.
5.2.1 Training Algorithm
Step 1 : Apply preprocessing
108
Step 2 : Prepare STL for each field of study
Step 3 : Check the metadata stored in each STL is unique
and primary data
Step 4 : Verify that all training documents are read then go
to step 9, otherwise continue step 5.
Step 5 : Calculate the number of matching terms in the
given document which matching the STL is ‘m’
and calculate the total number of terms in the
given document is ‘n’.
Step 6 : Compute ‘mda’, where ‘mda’ = �
�
Step 7 : Sort the ‘mda’ in decreasing order and check the
terms which has higher ‘mda’ terms in the STL, if
available, go to step 8 otherwise go to step 9.
Step 8 : Update these new terms to concern STL and go to
step 3
Step 9 : Apply ABI learning algorithm
Step 10 : Compare output with a minimum threshold. If
outputs above threshold go to step 11, otherwise
go to step 12.
Step 11 : Update STL
Step 12 : Go to the testing process
5.2.2 Testing Algorithm
Step 1 : Apply preprocessing
Step 2 : Collect STL for each field of study from training
algorithm
Step 3 : Check the metadata stored in each STL is unique
and primary data
109
Step 4 : Apply ABI and verify interestingness of each
keyword presented in the STL. If the confidence
of interestingness is not in acceptable level, which
may be removed from concern STL.
Step 5 : Apply the input test document
Step 6 : Read each term in every STL and calculate the
number of matching terms in the given test
document. In which, the terms are matching with
the STL is ‘m’ and calculate the total number of
terms in the given test document is ‘n’.
Step 7 : Calculate ‘mda’, ‘mda’ = �
�
Step 8 : Sort the ‘mda’ in decreasing order
Step 9 : Check the terms which have higher ‘mda’
Step 10 : Check this highest ‘mda’ term is available in the
given STL, if available, go to step 10 otherwise
gotostep11.
Step 11 : Classify the given test document as the field of
matching STL
Step 12 : Identify next higher ‘mda’ term until end of the
test document and goto Step 9
5.3 WORKING MODEL OF ETM-ILM
A sample input file in the computer network field of domain is
shown in Table 5.1. Table 5.2 is prepared based on a frequent item set
based on ETM-ILM.
Table 5.3 shows the STL in the computer network field of
study. The non-technical terms appeared in Table 5.2 is removed in the
training algorithm of the proposed work.
110
Table 5.1 Sample Input File
Computer Network
Computer Communication is a major field of study in circuit branches.
Communications between Computers are carried out through networks.
The network is a collection of computers. Local Area Network is an
integrated network within campus, which is also termed Intranet. Like
Intranet, a collection of network is called Internet. The Internet is a
world-wide computer integration based on common protocols. The
protocol defines the exact applications and implementations of each task
used for computer communication.
Table 5.2 Sample Frequent Term List
List of terms Frequency
Network
Computer
Communication
Intranet
Internet
Integrated/integration
Collection
Protocol
5
5
3
2
2
2
2
2
111
Table 5.3 Sample STL File (After training algorithm)
List of terms_ Computer Network
Network
Computer
Communication
Intranet
Internet
Protocol
5.3 RESULT AND ANALYSIS
The proposed ETM-ILM is implemented in MatLab
(MATLAB). Performance analysis and comparison of proposed work
with existing TBM, CBMM and PTM are computed. The result of the
implementation in terms of F-Measure and Entropy are recorded and
shown in Table 5.4 and 5.5 respectively.
In which, the comparison of F-Measure of various existing
methods and proposed ETM-ILM are shown in Table 5.4. In which, the
comparison of Entropy of various existing methods and proposed ETM-
ILM are shown in Table 5.5. In which, Improvement of F-Measure in
ETM-ILM over MCMM are shown in Table 5.6 and also improvement of
Entropy in ETM-ILM over MCMM are shown in Table 5.7.
112
Table 5.4 Comparison of F-Measure of existing Vs proposed
methods
Field of Study Data Set TBM CBMM PTM Proposed ETM-ILM
Electrical
IEEE 0.697 0.741 0.780 0.833
ACM 0.767 0.812 0.834 0.890
Scopus 0.724 0.807 0.828 0.887
Electronics
IEEE 0.688 0.731 0.753 0.825
ACM 0.757 0.801 0.836 0.876
Scopus 0.715 0.797 0.821 0.861
Civil
IEEE 0.756 0.804 0.847 0.903
ACM 0.832 0.881 0.906 0.962
Scopus 0.785 0.875 0.899 0.944
Computer
IEEE 0.746 0.793 0.835 0.892
ACM 0.821 0.869 0.898 0.950
Scopus 0.775 0.864 0.886 0.930
Mechanical
IEEE 0.736 0.783 0.807 0.880
ACM 0.810 0.858 0.892 0.936
Scopus 0.765 0.853 0.871 0.918
113
Table 5.5 Comparison of Entropy of existing methods Vs proposed
methods
Field of Study Data Set TBM CBMM PTM Proposed ETM-ILM
Electrical
IEEE 0.329 0.214 0.191 0.125
ACM 0.317 0.178 0.160 0.116
Scopus 0.412 0.380 0.358 0.260
Electronics
IEEE 0.325 0.211 0.202 0.123
ACM 0.313 0.176 0.159 0.114
Scopus 0.407 0.375 0.355 0.256
Civil
IEEE 0.357 0.232 0.213 0.136
ACM 0.344 0.193 0.174 0.125
Scopus 0.447 0.412 0.393 0.282
Computer
IEEE 0.352 0.229 0.213 0.134
ACM 0.339 0.191 0.165 0.123
Scopus 0.441 0.407 0.370 0.278
Mechanical
IEEE 0.348 0.226 0.185 0.132
ACM 0.335 0.188 0.167 0.122
Scopus 0.435 0.401 0.366 0.275
114
Table 5.6 Improvement of F-Measure in ETM-ILM over MCMM
Field of Study Data Set Proposed MCMM Proposed ETM-ILM
Electrical
IEEE 0.823 0.833
ACM 0.876 0.890
Scopus 0.859 0.887
Electronics
IEEE 0.812 0.825
ACM 0.865 0.876
Scopus 0.848 0.861
Civil
IEEE 0.892 0.903
ACM 0.950 0.962
Scopus 0.932 0.944
Computer
IEEE 0.881 0.892
ACM 0.938 0.950
Scopus 0.919 0.930
Mechanical
IEEE 0.869 0.880
ACM 0.925 0.936
Scopus 0.907 0.918
115
Table 5.7 Improvement of Entropy in ETM-ILM over MCMM
Field of Study Data Set Proposed MCMM Proposed ETM-ILM
Electrical
IEEE 0.143 0.125
ACM 0.132 0.116
Scopus 0.297 0.260
Electronics
IEEE 0.141 0.123
ACM 0.130 0.114
Scopus 0.293 0.256
Civil
IEEE 0.155 0.136
ACM 0.143 0.125
Scopus 0.322 0.282
Computer
IEEE 0.153 0.134
ACM 0.141 0.123
Scopus 0.318 0.278
Mechanical
IEEE 0.151 0.132
ACM 0.139 0.122
Scopus 0.314 0.275
116
Data set from these different web sources namely IEEE, ACM,
SCOPUS in five different fields namely electrical, electronics, civil,
computer, mechanical are taken for analysis and subjected to four different
methods namely TBM, CBMM, PTM and Proposed ETM-ILM methods
and the results are indicated in Table 5.4 to 5.7 and Figure (5.3) to (5.22).
Figure 5.3 Comparison of F-Measure on Electrical data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IEEE ACM Scopus
F-M
easu
re
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
117
Figure 5.4 Comparison of F-Measure on Electronics data
Figure 5.5 Comparison of F-Measure on Civil data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IEEE ACM Scopus
F-M
easu
re
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
0
0.2
0.4
0.6
0.8
1
1.2
IEEE ACM Scopus
F-M
easu
re
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
118
Figure 5.6 Comparison of F-Measure on Computer data
Figure 5.7 Comparison of F-Measure on Mechanical data
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IEEE ACM Scopus
F-M
easu
re
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
IEEE ACM Scopus
F-M
easu
re
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
119
Figure 5.8 Comparison of Entropy on Electrical data
Figure 5.9 Comparison of Entropy on Electronics data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
IEEE ACM Scopus
En
trop
y
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
IEEE ACM Scopus
En
tro
py
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
120
Figure 5.10 Comparison of Entropy on Civil Data
Figure 5.11 Comparison of Entropy on Computer data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
IEEE ACM Scopus
En
trop
y
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
IEEE ACM Scopus
En
tro
py
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
121
Figure 5.12 Comparison of Entropy on Mechanical data
Figure 5.13 Improvement of F-Measure in ETM-ILM over MCMM
on Electrical data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
IEEE ACM Scopus
En
trop
y
Data Set
TBM
CBMM
PTM
Proposed ETM-ILM
0.78
0.8
0.82
0.84
0.86
0.88
0.9
IEEE ACM Scopus
F-M
easu
re
Data Set
Proposed MCMM
Proposed ETM-ILM
122
Figure 5.14 Improvement of F-Measure in ETM-ILM over MCMM
on Electronics data
Figure 5.15 Improvement of F-Measure in ETM-ILM over MCMM
on Civil data
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
IEEE ACM Scopus
F-M
easu
re
Data Set
Proposed MCMM
Proposed ETM-ILM
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
IEEE ACM Scopus
F-M
easu
re
Data Set
Proposed MCMM
Proposed ETM-ILM
123
Figure 5.16 Improvement of F-Measure in ETM-ILM over MCMM
on Computer data
Figure 5.17 Improvement of F-Measure in ETM-ILM over MCMM
on Mechanical data
0.84
0.86
0.88
0.9
0.92
0.94
0.96
IEEE ACM Scopus
F-M
easu
re
Data Set
Proposed MCMM
Proposed ETM-ILM
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
IEEE ACM Scopus
F-M
easu
re
Data Set
Proposed MCMM
Proposed ETM-ILM
124
Figure 5.18 Improvement of Entropy in ETM-ILM over MCMM on
Electrical data
Figure 5.19 Improvement of Entropy in ETM-ILM over MCMM on
Electronics data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
IEEE ACM Scopus
En
tro
py
Data Set
Proposed MCMM
Proposed ETM-ILM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
IEEE ACM Scopus
En
tro
py
Data Set
Proposed MCMM
Proposed ETM-ILM
125
Figure 5.20 Improvement of Entropy in ETM-ILM over MCMM on
Civil data
Figure 5.21 Improvement of Entropy in ETM-ILM over MCMM on
Computer data
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
IEEE ACM Scopus
En
tro
py
Data Set
Proposed MCMM
Proposed ETM-ILM
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
IEEE ACM Scopus
En
tro
py
Data Set
Proposed MCMM
Proposed ETM-ILM
126
Figure 5.22 Improvement of Entropy in ETM-ILM over MCMM on
Mechanical data
F-Measure improvement in the proposed ETM-ILM method is
seen from Table 5.4 in all the fields of study over other existing methods.
From Figure (5.3) and (5.4) it is clearly evident that the proposed ETM-
ILM stands for better than the other methods in electrical and electronics
web based data undertaken in the present study. The same trend is
obtained in the field of civil and computer data as indicated from Figure
(5.5) and (5.6). Same trend provides with mechanical data as seen from
Figure (5.7).
From Figure (5.3) to (5.7) it is seen clearly that F-Measure for
electrical, electronics, civil, computer and mechanical data is highest in
all the website sources and it is found to have maximum in proposed
ETM-ILM compared to other three existing methods.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
IEEE ACM Scopus
En
tro
py
Data Set
Proposed MCMM
Proposed ETM-ILM
127
The F-Measure of the proposed ETM-ILM is improved than
TBM as a minimum of 16% than existing system and it leads to
maximum of 23%. The F-Measure of the proposed ETM-ILM is
improved than CBMM as a minimum of 8% than existing system and it
leads to maximum of 13%. The F-Measure of the proposed ETM-ILM is
improved than PTM as a minimum of 5% than existing system and it
leads to maximum of 10%.
Entropy improvement in the proposed ETM-ILM method is
seen from Table 5.5 in all the fields of study over other existing methods.
From Figure (5.8) and (5.9) it is clearly evident that the proposed ETM-
ILM stands for better than the other methods in electrical and electronics
web based data used in the present study. The same trend is obtained in
the field of civil and computer data as indicated from Figure (5.10) and
(5.11). Same trend provides with mechanical data as seen from Figure
(5.12).
From Figure (5.8) to (5.12) it is seen clearly that Comparison
of entropy in electrical, electronics, civil, computer and mechanical data
indicated that is minimum for the proposed ETM-ILM compared to other
three exiting methods.
The entropy of the proposed ETM-ILM is improved than TBM
as a minimum of 37% than existing system and it leads to maximum of
64%. The entropy of the proposed ETM-ILM is improved than CBMM
as a minimum of 31% than existing system and it leads to maximum of
42%. The entropy of the proposed ETM-ILM is improved than PTM as a
minimum of 25% than existing system and it leads to maximum of 39%.
128
F-Measure improvement in the proposed ETM-ILM over
proposed MCMM method is seen from Table 5.6 in all the fields of study
over other existing methods. From Figure (5.13) and (5.14) it is clearly
evident that the proposed ETM-ILM stands for over than proposed
MCMM method in electrical and electronics web based data undertaken
in the present study. The same trend is obtained in the field of civil and
computer data as indicated from Figure (5.15) and (5.1). Same trend
provides with mechanical data as seen from Figure (5.17).
From Figure (5.13) to (5.17) it is seen clearly that F-Measure
for electrical, electronics, civil, computer and mechanical data is highest
in all the website sources and it is found to have maximum in proposed
ETM-ILM over proposed MCMM method.
Entropy improvement in the proposed ETM-ILM over
proposed MCMM method is seen from Table 5.7 in all the fields of study
over other existing methods. From Figure (5.18) and (5.19) it is clearly
evident that the proposed ETM-ILM stands for over than proposed
MCMM method in electrical and electronics web based data undertaken
in the present study. The same trend is obtained in the field of civil and
computer data as indicated from Figure (5.20) and (5.21). Same trend
provides with mechanical data as seen from Figure (5.22).
From Figure (5.18) to (5.22) it is seen clearly that Entropy for
electrical, electronics, civil, computer and mechanical data is lowest in
all the website sources and it is found to have minimum in proposed
ETM-ILM over proposed MCMM method.
From Figure (5.18) to (5.22) it is seen clearly that Comparison
of entropy in electrical, electronics, civil, computer and mechanical
129
data indicated that is minimum for the proposed ETM-ILM over
proposed MCMM method.
The improvement of F-Measure in proposed ETM-ILM over
proposed MCMM is a minimum of 1% and maximum of 3%. The
improvement of entropy in proposed ETM-ILM over proposed MCMM
is a minimum of 12% and maximum of 13%.
From the result and performance analysis, it is concluded that
the proposed MCMM with ABI learning algorithm (ETM-ILM) is
proved better result than existing and notable recent works in document
clustering field of domain.
Recommended