13.--Analysis of Web Server Log -full

ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING

USERS PATTERNS

OM KUMAR C. U.1 & P. BHARGAVI

2

1Department of Computer Science, Sree Vidyaniketan Engineering College, Tirupathi, Andhra Pradesh, India

2Senior Asst.Prof , Dept of CSE, Sree Vidyanikethan Engineering College , Tirupathi, Andhra Pradesh, India

ABSTRACT

WWW is a system of interlinked hypertext documents accessed via the Internet. Around 11 Hundred million

people access internet daily. And so the information available on WWW is also growing. With this continued growth of

information and proliferation of web services and web based information systems, web sites are also growing to host them.

Before analyzing such data using data mining technique the Servers web log need to be preprocessed. The log file data

offer insight into website usage. They can be collected from Web servers, Proxy Servers, Web Client. Web Usage mining

applies data mining technique to extract knowledge from these web log files. This paper discusses about the Log files and

uses Web mining techniques to extract usage patterns by using WEKA.

KEYWORDS: Pre-Processing, Web Usage Mining, Web Server Log Data, Classification, Clustering, Rule Based

Mining, Pattern Discovery

INTRODUCTION

WWW continues to grow at an astounding rate in both information and users perspective. The scale of information

on the internet is growing at an comprehensible rate, similar to the mystifying size of planets and stars. Internet has become

a place where a massive amount of information and data is being generated every day. Every Minute YouTube users upload

48 hours of video, Facebook users share 684,748 pieces of content, Instagram users share 3600 pictures and Tumblr shares

27,778 new posting. Over the last decade with the continued increase in the usage of WWW, Web mining has been

established as an important area of research. Web mining is used to analyse users using WWW who leave abundant

information in web log, which is structurally complex and incremental in nature. A Log file is a record that records

everything that goes in and out of a particular server. Analysing such data will yield knowledge but pre-processing of that

data is required before analysing it. Once analysing the Log File, they provide activities of users over a potentially long

period of time. They can be collected from web server, proxy server and Web client. These logs when mined properly

provide useful information for decision making. They contain information such as username, IP Address, timestamp, bytes

transferred, referred URL, User agent.

Based on the research of web mining [8] they are classified into 3 domains.

Web Content Mining

Web Structure Mining

Web Usage Mining.

Web usage mining in Figure 2 is a process of extracting information from server logs (i.e.,) users history. They

help in finding out what users are looking out in Internet.

International Journal of Computer Science Engineering

and Information Technology Research (IJCSEITR)

ISSN 2249-6831

Vol. 3, Issue 2, Jun 2013, 123-136

© TJPRC Pvt. Ltd.

124 Om Kumar C.U. & P. Bhargavi

Web Structure Mining in Figure 2 is the process of using graph theory to analyse node and connection structure of

web sites. They are of 2 types.

Extracting Pattern from Hyperlinks in the Web

They are structural components that connect the web page to different location.

Mining Document Structure

Analyses tree like structure of page to describe html, xml tag usages.Web content mining in Figure 2 is the mining,

finding pattern and extracting knowledge from web contents. They are of 2 types.

Information retrieval view

Database View

RELATED WORK

Data Preprocessing

The Log file contains immaterial attributes. So before mining the Log file, preprocessing needs to be done.

General preprocessing techniques applied on data are cleaning, Integration, Transformation and reduction. By applying the

above preprocessing techniques incomplete attributes, noisy data which contain errors and inconsistent data that has

discrepancies can be removed. By applying Data preprocessing we improve the quality of data.

Once the cleaned data is transformed, User sessions may be tracked to identify the user, and from it the user

patterns can be extracted.

Figure 1: Data Pre-Processing

The obtained data is now clean and can be analysed. Web mining process is considered for analysing the pre-

processed Log File.

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 125

Figure 2: Web Mining

Web Usage Mining

The goal of web usage mining is to get into the records of the servers (log files) that store the transactions that are

performed in the web in order to find patterns revealing the usage the customers [7][11]. We can also distinguish here:

General access pattern tracking. Here we combine the access patterns of a group rather than an individual to get a

trend that allow us to organize the web structure in such a way that the user is facilitated.

Customized access pattern tracking. Here we gather information regarding a client’s behavior with the website.

Based on the gathered information suggestions and advices are provided to improve the quality.

Web Content Mining

Here information is gathered regarding the search performed on the content to identify user patterns. There are two

main strategies of Web Content Mining are as follows:

Information Retrieval View: R. Kosala et al. summarized the research works done for unstructured data and semi-

structured data from information retrieval view. Study has revealed that researches use frequent words, which is generally a

single word. These Single words are considered as training corpuses for ranking the content in the web based on the number

of referrals. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized

the hyperlink structure between the documents for document representation.

Database View

As for the database view, in order to have the better information management and querying on the web, the mining

always tries to infer the structure of the web site to transform a web site to become a database.

Web Structured Mining

This type of mining is usually done to reveal the structure of websites by gathering structure related data. Typically

it takes into account two types of links: static and dynamic.

SERVER LOG

The server responds to user requests and the server log records all the transactions right from start up to shutdown

of the server. They provide time stamp to user requests and respond by recording requested ID with the requested action.


These Log files can be located in 3 places.

Web Servers- A web server is dispenses the web pages as they are requested.

Proxy Server- A proxy server is a intermediary compute that acts as a computer hub through which user requests

are processed.

Web Client- A Web client is a computer application, such as a web browser, that runs on a user local computer or

workstation and connects to a server as necessary

Table 1: Types of Server Logs

S.No Type of Server Log Example

Transfer- these log records remote hosts visiting it with

its time stamp. 120.236.0.14 -2007-11-05

2 Agent - A remote host surfs through a browser. Agent

log records the user agent (browser). InternetExplorer/5.0(win 7;)

3 Error- This log records all the warnings, errors caused

to its system with the time and type of error.

10/10/2012 10:07:09 unable to open

file‖academics.‖ —No such file 05/06/2008

13:04:10—could not load repository template

extension.

4

Referrer- They provide extra features like user reference

as links. When used with agent log provides detail like

the type of user and the user agent. They can track

external hosts using your document from your space.

http://myblaze.sez.html>/pictures/myspace/happy.gif

Contents of Log File

A server log file is a log file that automatically creates and maintains the activities performed in it. It maintains a

history of page requests. It helps us in understanding how and when your website pages and application are being accessed

by the web browser. These log files contain information such as the IP address of the remote host,content requested, and a

time of request.

SYNTAX

IPaddress, logproprietor, Username, [DD:MM:YYYY: Timestamp GMToffset] , "req method" .

Ex

104.11.13.108 - - [13/Jan/2006:16:56:12 -0600]

"GET /EDC/cell.htm HTTP/1.0" 200 4093.

IP address- The IP address of the http request is recorded to identify the remote host. Ex: 204.31.113.138.

log proprietor- The name of the owner making an http request is recorded through this field. They do not expose

this information for security purpose. When they are not exposed they are denoted by (-).

Username- This field records the name of the user when it gets a http request. They do not expose this information

for security purpose. When they are not exposed they are denoted by (-).

date- Request date is recorded here in the mentioned format. Ex: 13/Jan/2006.

time- Request Time of the HTTP request is recorded here in astronomical format. Ex: 16:56:12.


GMToffset- This field shows the time difference between the actual request time and Greenwich Mean time so

that request from corner of the world can be analysed in any part of the world. Ex: -0600.

Reqmethod- The request type of the request is stored. Ex: GET.

Types of Log Formats

NCSA Log Formats,

W3C Extended Log Format,

Microsoft IIS Log Format,

Sun One Web Server Format.

NCSA Log Formats

National Centre for Supercomputing Application (NCSA) established in 1986 developed a web server called httpd

at its centre. This web server had a log initially which had several extensions later.

NCSA Common Log or Access Log Format

Stores basic information about the request received.

Syntax

Host IP address, Proprietor, Username, date: time, request method, status code, byte size.

Ex

200.40.12.4, -, -, [2006/Oct/10:10:16:52 +0500], GET /svec.html http 1.0‖, 200, 1460.

NCSA Combined Log Format

Stores all common log information with two additional fields. ―referrer‖ –history of request, ―user_agent‖ – type

of browser.

Syntax

Host IPaddress, Proprietor, Username, date: time, request method, status code, byte size, referrer, User_agent,

Cookie.

Ex

200.40.12.4, -, -,[2006/oct/10:10:16:52 +0500], GET /svec.html http 1.0‖, 200, 1460, http://www.cbci.com/,

―Mozilla/5.0 (WIN:7)‖, ―UserID=om123;Pwd=101112‖.

NCSA Separate Log Format

In this type the information is split into 3 log files instead of storing it in a single file.

The three log files are a) access log

b) Referral log

c) Agent log

http://www.cbci.com/


W3C Extended Log Format

The worldwide web consortium (w3c) is an international standards organization. They provide rich information

hence the name extended log format.

The lines starting with # contain directives.

#version <int><int>

#Software- the software which generated the log .

#date-<date><time>

#fields-this directive lists a sequence of entries. They are as follows.

Table 2: W3C Directives

Acronym Description

C Client

S Server

CS Client to Server

SC Server to Client

r Remote host

Sr Server to Remote host

rS Remote host to Server

Ex

#version: 2.0

#Software: Microsoft windows server1.0

#date : < date> <time>

#field: C-ip S-ip CS-username CS-method CS-uri CS-version CS-user-agent SC-status

2000-04-14 10: 16: 42 , 192.16.14.1, 200.14.100.4 - GET/Mypictures.gif http1.0 ―Mozilla/4.0‖ 200.

Microsoft IIS Log File

Internet information service enables you to track or record the activities happening in your website through File

transfer protocol (ftp), Network News Transfer Protocol (NNTP), Simple Mail Transfer Protocol (SMTP) by allowing you

to choose a log format that works in synchronization with your system environment. Some supplementary attributes

provided are as follows. Elapsed time, total bytes transferred, target file.

Syntax

IP add, date timestamp, Server name, Server IP, elapsed time, http request size, byte size, status code, error,

request method.

Ex

192.16.10.1,-,10/4/01 14:02:10, svec, 170.42.14.2, 1604, 140, 4240, 200, 0, GET, /Mypicture.gif.

Sun One Web Server

They are similar in functionality with the above mentioned log formats. But provides more security in 2 ways.


-By using Secure Socket Layer (SSL) between client and server.

-Administrator can provide access controls or permissions to files and directories.

Request & Response by the Server

All the log formats specified above has fields that record http request in the form of elapse time and response in

the form of status code.

To Handle Request

Authorization

Checks user ID and password

URI Translation

Translates the Uniform Resource Identifier to local system path.

Checking

Checks the correctness of file path with user privileges.

MIME type checking:

Checks the Multi-Purpose Internet mail Encoding of the requested resource.

Input

Prepares the system for reading input.

Output

Prepares the output for client.

Service

Generates the response to client

Log Entry

Record the activity into the Log.

Error

This field is used only if any of the above mentioned field fails from its normal execution. They are of 2 types.

They are as follows.

Connection Errors

They happen when a connection established for communication with the web server drops.

They are classified as follows:

Void URL

This simply means that the format of the Uniform Resource Locater is invalid.


Host Not Found

This error occurs when the Server could not be found with its host/domain name.

Time Out

When a connection could not be established with in a predetermined time this error occurs. The default time out is

set to 90 seconds.

Connection Refused

This error occurs when an identified host refuses connection through its default port.

No response from Web Server

When an identified Web server fails to respond with a time period this error has said to be occurred.

Unexpected Error

These are errors that does not report itself in an anticipated manner; it cannot be classified into one of the

predetermined categories.

To Handle Response

If a connection is established successfully with a Web Server then the Server responds with one of the following

status codes.

Web Server Status Code & Messages

The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request [6].

Table 3-clearly explains the message status.

Table 3: Status Codes

Status Code Message

1xx Informational

100 Continue

101 Switching protocols

2xx Success

201 Created

201 Accepted

203 Non authoritative Information

204 No content

205 Reset content

206 Partial content

3xx Redirection

301 Moved permanently

302 Moved temporarily

303 See other

304 Not modified

305 Use proxy

4xx Client error

401 Bad request

402 Unauthorized

403 Payment required

404 Forbidden

405 Method not allowed


Table 3:Contd.,

406 Not acceptable

407 Proxy Authentication required

408 Request timeout

409 Conflict

410 Gone

411 Length Required

412 Precondition Failed

413 Required entity too long

414 Required URI too long

415 Unsupported media type

5xx Server error

501 Not implemented

502 Bad gateway

503 Service unavailable

504 Gateway timeout

505 http version timeout

EXPERIMENTAL RESULTS

A company’s log file in Figure 3 is analyzed using WEKA. We apply classification and clustering [9][15]

technique in it.

Figure 3: Log File

(Username, ReqType and UserAgent of the Log File are Not Considered here for Mining)

By applying If Then classification [10] rules we obtain

Rule

IF

freemem <= 193.5

freemem > 117.5

THEN

usr = 93

+ 0.0001 * outtime


- 0.0016 * intime

- 0.0019 * bytesize

+ 0.0022 * req

- 1.9716 * exec

Rule: 2

IF

freemem > 304.5

fork <= 1.095

bytesize > 1626.5

THEN

usr = 88

+ 0.152 * intime

- 0.0009 * outime

- 3.0119 *req

- 0.0011 * exec

+ 0.31 * freeswap

Similarly around 17 rules can be mined. Applying kmeans we obtained prior probabilities of clusters.

Mean Distribution

Attribute: UserId

Normal Distribution. Mean = 25.1373 StdDev = 60.3306

Attribute:In Time


Attribute: Out Time


Attribute: Byte Size


Attribute: Request


Attribute: Exec



Figure 4: Kmeans Cluster-Distribution

Cluster: 0 Prior probability: 0.6381

Cluster: 1 Prior probability: 0.3619

From the above graph Cluster 0 contains maximum number of users.

Farthest First

Farthest first is a variant of K Means that places each cluster centre in turn at the point furthermost from the

existing cluster centre. This point must lie within the data area. This greatly speeds up the clustering in most of the cases

since less reassignment and adjustment is needed.

Cluster Centroids

Cluster 0

1098.0 rajesh M Other password 10/9/2010 10/9/2010 2

Cluster 1

123.0 Vaishnavee F Mobile Bill 9.940870123E9 9/1/2009 9/1/2009 227

Time taken to build model (full training data) : 0.06 seconds

Clustered Instances

0 740 (95%)

1 42 (5%)

Figure 5: Centroids of Cluster


CONCLUSIONS AND FUTURE WORK

There is a growing trend among companies, organizations and individuals alike to gather information from log

files to gather information regarding user but it is a challenging task for them to fulfill the user needs .Web mining has

valuable uses to marketing of business and a direct impact to the success of their promotional strategies and internet traffic.

This information is gathered on a daily basis and continues to be analyzed consistently [15]. Analysis of this pertinent

information will help companies to develop promotions that are more effective, internet accessibility, inter-company

communication and structure, and productive marketing skills through web usage mining This paper gives a detailed look

about servers, data mining, web mining, web server log file and its format. Further we extracted patterns of the user using

clustering, decision trees and If-Then rules.

The extended work to this research work is to mine the log file based user clicks .This need to have a deep insight

in to log files that stores clicks of the user.

REFERENCES

1. V.V.R.MaheswaraRao , Dr.V.Valli Kumari ―An Enhanced Pre-Processing Research Framework For Web Log

Data Using a Learning Algorithm‖, Journal Of Computer science & information technology‖ ,pp.01-15, 2011.

2. Kobra Etminani ,Mohammad-R. Akbarzadeh-T.,Noorali Raeeji Yanehsari,‖ Web Usage Mining: users

navigational patterns extraction from web logs using Ant-based Clustering Method‖, International Fuzzy System

Association &European Society Of fuzzy Logic Technology(IFSA-EUSFLAT )‖ ,pp.396-401, 2009.

3. Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin,‖

Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm‖,Proc.Of World

Academy Of Science, Engineering and Technology, Vol 36 pp.970-977, Dec 2008.

4. Ratnesh Kumar Jain1 , Dr. R. S. Kasana, Dr. Suresh Jain,‖ Efficient Web Log Mining using Doubly Linked Tree‖,

International Journal of Computer Science and Information Security ‖, Vol. 3, No. 1, pp.402-407, 2009.

5. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava,‖ Data Preparation for Mining World Wide Web

Browsing Patterns‖.

6. K. R. Suneetha, Dr. R. Krishnamoorthi,‖ Identifying User Behavior by Analyzing Web Server Access Log File‖,

International Journal of Computer Science and Network Security, VOL.9 No.4, pp.327-332, April 2009.

7. S. K. Pani , L.Panigrahy, V.H.Sankar, Bikram Keshari Ratha, A.K.Mandal, S.K.Padhi,‖ Web Usage Mining: A

Survey on Pattern Extraction from Web Logs‖, International Journal of Instrumentation, Control & Automation ,

Volume 1, Issue 1,pp.15-23, 2011.

8. Arvind Kumar Sharma,Dr. P.C. Gupta,‖ Exploration of Efficient Methodologies for the Improvement In Web

Mining Techniques: A Survey‖, International Journal of Research in IT & Management Vol 1, Issue 3, pp.85-95 ,

July 2011.

9. Stavros Valsamidis, Sotirios Kontogiannis, Ioannis Kazanidis, Theodosios Theodosiouand Alexandros Karakos,‖

A Clustering Methodology of Web Log Data for Learning Management Systems‖, pp. 154–167, 2012.


10. M. Malarvizhi, S. A. Sahaaya Arul Mary,‖ Preprocessing of Educational Institution Web Log Data for Finding

Frequent Patterns using Weighted Association Rule Mining Technique‖, European Journal of Scientific Research

Vol.74 No.4, pp. 617-633,2012.

11. Sawan Bhawsar, Kshitij Pathak, Sourabh Mariya, Sunil Parihar,‖ Extraction of Business Rules from Web logs to

Improve Web Usage Mining‖, Vol 2, Issue 8, Aug,pp.333-340, 2012.

12. Vijay K. Gurbani,Eric Burger, Carol Davids, Tricha Anjal,‖ SIP CLF: A Common Log Format (clf) For The

Session Initiation Protocol‖,pp.1-8, 2010. [13]

13. Thanakorn Pamutha, Siriporn Chimphlee, Chom Kimpan, and Parinya Sanguansat,‖ Data Preprocessing on Web

Server Log Files for Mining Users Access Patterns‖, International Journal of Research and Reviews in Wireless

Communications (IJRRWC), Vol. 2, No. 2, June ,pp.92-98, 2012,.

14. Navin Kumar Tyagi, A. K. Solanki and Manoj Wadhwa,‖ Analysis of Server Log by Web Usage Mining for

Website Improvement‖, International Journal of Computer Science Issues(IJCSI), Vol. 7, Issue 4, No 8, July

pp.17-20, 2010.

15. Ian H.Witten, and Eibe Frank,‖ Data Mining: Practical Machine Learning Toolsand Techniques with Java

Implementations‖Morgan Kaufman Publishers, 1999.

Documents

13.--Analysis of Web Server Log -full