Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
116
CHAPTER 6
SOLUTION TO NETWORK TRAFFIC PROBLEM IN MIGRATING
PARALLEL CRAWLERS USING FUZZY LOGIC
6.1 Introduction
The properties of the Internet that make web crawling challenging are its large amount
of data, its dynamic page generation and its rapid rate of change. The web crawler must
be scalable, robust and make efficient use of available bandwidth, while all crawlers are
built around standard components. Politeness is an important issue which needs to be
addressed when designing a web crawler. Crawlers should not overload a web server by
requesting a large number of web pages in a short interval of time. Web crawler should
follow restrictions outlined by web site administrators; they should also identify
themselves when requesting pages. The crawlers observe a waiting time between two
simultaneous requests to a web server. This waiting time is called Request Intervals. It
is generally 30secs between two downloads. To enforce this waiting time a shuffling
mechanism inside of the queue is implemented, the queue is scrambled into a random
order so that URLs from the same web server are spread out evenly throughout the
queue. The other crawler like Mercator implements their URL queue as a collection of
sub-queues: each domain has its own queue.
6.2 Quality and Network Metrics
There is always a scope to improve the quality of the data collected during a crawl. The
ordering of the URL queue determines the type of search of the web graph. The queue
can be ordered by taking into account the in-link factors of pages. The breadth first
search can improve the quality of downloaded pages. There exist a large number of
infinitely branching crawler traps and spam sites on the Internet whose pages are
dynamically generated and designed to have a very high in-link factor. In this section
the various network metrics like Geographic Distance and Latency are discussed.
117
6.2.1 Geographic Distance
There exist services on the Internet that provide a mapping between IP addresses and
geographic information. The existing Internet service parse registration data to derive
longitude, latitude from registrar address data. If two hosts share a common latitude and
longitude, then they are managed by the same ISP. Once the latitude and longitude have
been obtained for a pair of Internet hosts, their geographical distance can be calculated
using spherical coordinates on the earth.
6.2.2 Latency
There are various ways of determining Round Trip Time between two Internet hosts.
First method is by using Unix Ping utility and secondly method uses the Traceroute
utility. The Ping utility uses ICMP ECHO requests; however the ICMP replies are
sometimes blocked or manipulated by ISPs. Traceroute sends out TTL restricted UDP
packets which might be blocked by some routers.
6.2.3 Correlation between Metrics
There is strong Correlation between Latency and Geographic Distance. The
observations are lower values of linearized distance, the correlation between distance
and RTT is stronger. Linearized distance along a path implies a minimum end-to-end
RTT. Linearized distance and RTT are more strongly correlated than end-to-end
distance and RTT.
6.3 Case Study of Crawler Load
Figure 6.1 illustrated the client throughput in traditional and active network. The
vertical axis denotes the client throughput, number of bits received by clients
/simulation time unit and the horizontal axis denotes the client arrival rate of request
[159]. The client throughput for 0% overhead active indexing is proportional to that for
the 0% crawler. This establishes the comparability of the remaining cases. As the
systems become saturated the throughput drops rapidly, after both the simulations
achieve the similar throughput of about 222 bits/tick. Then the throughput remains the
same at about 140 bits/tick [159].
118
Figure 6.1: Client throughput in all cases [159]
Figure 6.2 showed the traditional network crawler throughput. The vertical axis denotes
the number of bits per simulation time unit received by crawlers and total request arrival
rate is denoted by horizontal axis. The requests are originated by both human clients
and crawlers [159].
Figure 6.2: Crawler throughput
119
Figure 6.3 illustrated the average client request delay for active indexing. The vertical
axis denotes the average client response delay while the rate at which the request is
generated by human clients is denoted by horizontal axis. The average client delay in
traditional network with 20% or 40% crawler traffic is more in active networks [159].
Figure 6.3: Average Client request delay in all cases [159]
Figure 6.4: Total Request Arrival Time vs. Average Crawler Request Delay [159]
Figure 6.4 demonstrated the graph between the average crawler request delays and the
total arrival rate of request. The above two curves are similar, which implies that as the
crawler load increased, it does not impact the delay seen by crawler sites [159].
120
Figure 6.5: Completed Client Request Rates in all cases [159]
Figure 6.5 illustrated the fraction of client requests that are completed in all cases.
However when the request arrival rate is low all requests are satisfied. The 20% and
40% crawler cases show significant decrement in the rate at which client requests are
completed [159].
6.4 Fuzzy Inference Systems and Fuzzy Logic
A fuzzy inference system (FIS) uses a fuzzy inference engine to derive answers from
knowledge database. The fuzzy inference engine is like the brain of the expert systems
which provides the required methodologies for reasoning with the information in the
knowledge database and formalizing results. The extended branch of Boolean algebra
which deals with partial truth is fuzzy logic. Fuzzy logic denotes degree to which
proposition logic is true. In Boolean algebra everything can be expressed in terms of
binary values i.e., zero and one. Fuzzy logic replaces Boolean algebra values with the
level of truth. Level of truth is used to record the imprecise modes of reasoning. This
mode of reasoning plays an important role in the decision making ability of humans in
an atmosphere of imprecision and uncertainty. In fuzzy sets the membership function
are like the indicator function of the classical sets theory. Membership functions are
curves. Membership functions defines that each point is mapped to a value between 0
121
and 1 in input space. The shape of a membership functions are triangular, bell curves
and trapezoidal. The input space is called universe of discourse
A Fuzzy Inference Systems are conceptually very simple and easier to implement. A
Fuzzy Inference Systems consists of three stages they are input stage, an output stage
and a processing stage. The input is mapped in the input stage into membership
functions. Appropriate rule is invoked at the processing stage and result is generated for
each rule, results of rules are combined. Then output stage converts the result into
output.
The processing stage is referred to as inference engine. Inference engine is based on a
set of logic rules of the form of IF-THEN statements. IF sub-statement is “antecedent”
and the THEN sub-statement is “consequent”. Fuzzy inference subsystems have n
number of rules which are stored in a knowledge database. The fuzzy inference system
has following steps:
• Fuzzification of inputs values.
• Application of fuzzy operators
• Applying implication methods
• Aggregation of outputs
• Defuzzification of results
The process of determining the degree to which input belong to its fuzzy sets via
membership functions is fuzzification of inputs. The input for the defuzzification
process is fuzzy set and the output is crisp value. There are two common used inference
methods in fuzzy sytems. The first method is Mamdani's fuzzy inference method
proposed by Ebrahim Mamdani in 1975 and the second method proposed in 1985 is
Takagi-Sugeno-Kang method of fuzzy inference. These methods are similar in many
ways, like the process of fuzzifying the inputs and fuzzy operators. Output membership
functions in Sugeno’s method are either linear or constant while in Mamdani’s inference
the output membership functions are fuzzy sets. Sugeno’s method is computationally
122
efficient and it works well with optimization and adaptive techniques. Also it works
well with mathematical analysis.
The quality is maintained by the crawling process. The web crawling is done using
following approaches either the web crawlers can be allowed to communicate among
each other or they are not allowed to communicate among themselves. Both techniques
put extra burden on network traffic. Here a fuzzy logic based algorithm is proposed and
it is implemented in MATLAB using fuzzy logic tool box which predict the load at
particular node and route of network traffic.
6.5 Proposed Solution
1. Using Fuzzy Inference System to Solve Network Traffic problem in migrating
parallel Crawlers.
2. Defining FIS variables and fuzzification of the input variables using membership
function editor
3. Specifying rules for Fuzzy inference system using Rule Editor for Network
Traffic problem in Migrating parallel Crawlers.
4. Rule Evaluation
5. Aggregation of the rule output
6. Defuzzification of the output value.
6.6 Description
1. Using Fuzzy Inference System to Solve Network Traffic problem in migrating
parallel Crawlers.
The theory of fuzzy logic is based on fuzzy set. Each point in the input space is mapped
in between 0 and 1 (membership value) which is determined by the curve called as
membership function. A set without a clearly defined crisp boundary is called a fuzzy
set. The tools used for building, editing fuzzy inference systems in Fuzzy Logic
Toolbox are:
123
1. Fuzzy Inference System (FIS) Editor
2. Membership Function Editor
3. Rule Editor
4. Rule Viewer
5. Surface Viewer
The Mamdani method is used as it is accepted widely for capturing knowledge. It
allows us to describe the expertise in more human –like manner.
2. Defining FIS variables and fuzzification of the input variables using membership
function editor
gaussmf: gaussmf is the Gaussian curve built-in membership function in fuzzy tool box.
The Syntax is given by y = gaussmf(x,[sig c]). The symmetric Gaussian function in
fuzzy tool box depends on two parameters σ and c as given by
For example if y=gaussmf(x,[2 5]);
plot(x,y)
xlabel('gaussmf, P=[2 5]')
Figure 6.6(a): gaussmf curve
124
Trimf: trimf is the triangular-shaped built-in membership function in fuzzy tool box.
The syntax is given by y = trimf(x,params); let y = trimf(x,[a b c]) then the triangular
curve is a function of a vector x and depends on three parameters
or,
The first parameter a and third parameters c locate the base of the triangle and the
second parameter b informs about the peak of the triangle. For example:
x=0:0.1:10;
y=trimf(x,[3 6 8]);
plot(x,y)
xlabel('trimf, P=[3 6 8]')
Figure 6.6(b): trimf function
125
Figure 6.6(c): FIS editor for Network Traffic Problem
Figure 6.7: FIS variable Communication
126
Figure 6.8: FIS variable Bandwidth
Figure 6.9: FIS variable Noise
127
Figure 6.10: FIS output variable NetworkTraffic
The figure 6.6(c) is the FIS editor for Network Traffic Problem. The figure 6.7 is the
FIS variable Communication. The figure 6.8 is the FIS variable Bandwidth. The figure
6.9 is the FIS variable Noise. The figure 6.10 is the FIS output variable NetworkTraffic
3. Specifying rules for Fuzzy inference system using Rule Editor for Network Traffic
problem in Migrating parallel Crawlers.
Communication Bandwidth Noise
Network
Traffic
low low low low
low low medium low
128
low low high low
low medium low low
low medium medium medium
low medium high medium
low high low medium
low high medium medium
low high high high
medium low low low
medium low medium medium
medium low high medium
medium medium low medium
medium medium medium medium
medium medium high medium
medium high low medium
medium high medium medium
medium high high high
high low low medium
high low medium medium
high low high medium
high medium low medium
129
high medium medium medium
high medium high high
high high low medium
high high medium high
high high high high
Table 6.1: Rules for FIS
Figure 6.11: Rules Editor for Network Traffic Problem
4. Rule Evaluation, Aggregation of the rule output and Defuzzification of the output
value.
130
Figure 6.12: Rule Evaluation Aggregation of the rule output
Figure 6.13: Surface Viewer for Network Traffic Problem
131
The table 6.1 is the Rules for FIS. The figure 6.11 is the Rules Editor for Network
Traffic Problem. The figure 6.12 is the Rule Evaluation Aggregation of the rule output.
The figure 6.13 is the Surface Viewer for Network Traffic Problem.
6.7 Result
The above module is integrated with the algorithm. The code is generated with help of
MATLAB Compiler. The Implementation is made to run on existing websites and is
compared with existing web crawlers.
Page 1 Page 2 Page 3 Total Load in KB
visit 1 185 185 185 555
visit 2 193 196 195
visit 3 188 189 199
visit 4 200 201 205
visit 5 188 199 188
load caused 954 970 972 2896
visit 6 188 189 188
visit 7 198 198 189
visit 8 178 176 189
visit 9 189 187 189
visit 10 199 189 198
load caused 1906 1909 1925 5740
Table 6.2: Load caused using Conventional Crawler
Page 1 Page 2 Page 3 Total Load in KB
visit 1 78 87 98 263
visit 2 87 89 98
visit 3 76 98 98
visit 4 87 98 87
visit 5 87 98 89
load caused 415 470 470 1355
visit 6 87 89 87
visit 7 78 98 98
visit 8 98 76 98
visit 9 87 97 98
visit 10 78 98 87
load caused 843 928 938 2709
Table 6.3: Load caused using Single threaded Crawler
132
Page 1 Page 2 Page 3 Total Load in KB
visit 1 35 35 37 107
visit 2 36 37 37
visit 3 43 36 45
visit 4 34 45 57
visit 5 34 43 43
load caused 182 196 219 597
visit 6 43 53 43
visit 7 43 34 34
visit 8 45 54 43
visit 9 34 43 45
visit 10 34 34 45
load caused 381 414 429 1224
Table 6.4: Load caused using Agent Based Crawler
Page 1 Page 2 Page 3 Total Load in KB
visit 1 23 23 24 70
visit 2 24 24 24
visit 3 24 28 27
visit 4 27 26 27
visit 5 24 27 27
load caused 122 128 129 379
visit 6 27 27 27
visit 7 26 26 26
visit 8 26 26 27
visit 9 27 25 27
visit 10 25 26 27
load caused 253 258 263 774
Table 6.5: Load caused using Migrating Parallel Web Crawler
133
Figure 6.14: Graph showing network load caused in various approaches
The table 6.2 is the Load caused using Conventional Crawler. The table 6.3 is the Load
caused using Single threaded Crawler. The table 6.4 is the Load caused using Agent
Based Crawler. The table 6.5 is the Load caused using Migrating Parallel Web Crawler.
The figure 6.14 is the Graph showing network load caused in various approaches. To
analyze and compare the approaches three websites are taken. The average size of a
HTML page was 205 KB so the network traffic generated using traditional centralized
crawling approach was 555 KB. Whereas in our approach the pages were compressed at
the server side and then the traffic load found was 70 KB. It can be observed that, after
five visits to the pages the load incurred has been found 2896 KB, 1355 KB, 597KB and
379 KB respectively and after ten visits the load was 5740 KB, 2709 KB, 1224 KB and
774 KB respectively as shown in the above figure. Moreover this result in network
traffic reduced.
6.8 Conclusion
In this chapter, discussion on the crawling process is carried out using either of the
following approaches: Crawlers can be generously allowed to communicate among
themselves or they cannot be allowed to communicate among themselves at all, both
approaches put extra burden on network traffic. Here a fuzzy logic based algorithm is
134
proposed and it is implemented in MATLAB using fuzzy logic tool box which predict
the load at particular node and route of network traffic. The experimental results show
that in case of Migrating Parallel web crawler the network load is reduced.