Application Log Analysis

Masarykova univerzitaFakulta informatiky

}w�� !"#$%&'()+,-./012345<yA|Application Log Analysis

Master’s thesis

Júlia Murínová

Brno, 2015

Declaration

Hereby I declare, that this paper is my original authorial work, which I haveworked out by my own. All sources, references and literature used or excerptedduring elaboration of this work are properly cited and listed in complete referenceto the due source.

Júlia Murínová

Advisor: doc. RNDr. Vlastislav Dohnal, Ph.D.

iii

Acknowledgement

I would like to express my gratitude to doc. RNDr. Vlastislav Dohnal, Ph.D. forhis guidance and help during work on this thesis. Furthermore I would like tothank my parents, friends and family for their continuous support. My thanks alsobelongs to my boyfriend for all his assistance and help.

v

Abstract

The goal of this thesis is to introduce the log analysis area in general, compareavailable systems for web log analysis, choose an appropriate solution for sampledata and implement the proposed solution. Thesis contains overview ofmonitoring and log analysis, specifics of application log analysis and log fileformats definitions. Various available systems for log analysis both proprietaryand open-source are compared and categorized with overview comparison tablesof supported functionality.

Based on the comparison and requirements analysis appropriate solution forsample data is chosen. The ELK stack (Elasticsearch, Logstash and Kibana) andElastAlert framework are deployed and configured for analysis of sampleapplication log data. Logstash configuration is adjusted for collecting, parsingand processing sample data input supporting reading from file as well as onlinesocket logs collection. Additional information for anomaly detection is computedand added to log records in Logstash processing. Elasticsearch is deployed asindexing and storage system for sample logs. Various Kibana dashboards foroverall statistics, metrics and anomaly detection dashboards are created andprovided. ElastAlert rules are set for real-time alerting based on sudden changesin events monitoring. System supports two types of input – server logs and clientlogs that can be reviewed in the same UI.

vii

Keywords

log analysis, threat detection, application log, machine learning, knowledgediscovery, anomaly detection, real-time monitoring, web analytics, log file format,Elasticsearch, Kibana, Logstash, ElastAlert, dashboarding, alerting

ix

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Monitoring & Data analysis . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Monitoring in IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Online service/application monitoring . . . . . . . . . . . . . . . . . 52.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Big data analysis . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 Data science . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.3 Data analysis in statistics . . . . . . . . . . . . . . . . . . . 7

2.4 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Business intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Log analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 Web log analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Analytic tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Data anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Security domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Software application troubleshooting . . . . . . . . . . . . . . . . . 153.6 Log file contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6.1 Basic types of log files . . . . . . . . . . . . . . . . . . . . . 183.6.2 Common Log File contents . . . . . . . . . . . . . . . . . . . 193.6.3 Log4j files contents . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 Analysis of log files contents . . . . . . . . . . . . . . . . . . . . . . 234 Comparison of systems for log analysis . . . . . . . . . . . . . . . . 25

4.1 Comparison measures . . . . . . . . . . . . . . . . . . . . . . . . . . 254.1.1 Tracking method . . . . . . . . . . . . . . . . . . . . . . . . 254.1.2 Data processing location . . . . . . . . . . . . . . . . . . . . 27

4.2 Client-side information processing software . . . . . . . . . . . . . . 284.3 Web server log analysis . . . . . . . . . . . . . . . . . . . . . . . . . 304.4 Custom application log analysis . . . . . . . . . . . . . . . . . . . . 314.5 Software supporting multiple log files types analysis with advanced

functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.6 Custom log file analysis using multiple software solutions integration 36

5 Requirements analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1 Task description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.2 Requirements and their analysis . . . . . . . . . . . . . . . . . . . . 395.3 System selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xi

6 Application log data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.1 Server log file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Client log file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Data contents issues . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Logstash configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 517.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.1.1 File input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.1.2 Multiline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527.1.3 Socket based input collection . . . . . . . . . . . . . . . . . 53

7.2 Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.2.1 Filter plugins used in configuration . . . . . . . . . . . . . . 547.2.2 Additional computed fields . . . . . . . . . . . . . . . . . . . 587.2.3 Adjusting and adding fields . . . . . . . . . . . . . . . . . . 597.2.4 Other Logstash filters . . . . . . . . . . . . . . . . . . . . . . 61

7.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.3.1 Elasticsearch output . . . . . . . . . . . . . . . . . . . . . . 617.3.2 File output . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3.3 Email output . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.4 Running Logstash . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648 Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8.1 Query syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658.2 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668.3 Accessing Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . 67

9 Kibana configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 699.1 General dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 699.2 Anomaly dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . 759.3 Client dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799.4 Encountered issues and summary . . . . . . . . . . . . . . . . . . . 80

10 ElastAlert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8310.1 Types of alert rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 8310.2 Created alert rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8711.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

11.1.1 Nested queries . . . . . . . . . . . . . . . . . . . . . . . . . . 8811.1.2 Alignment of client/server logs . . . . . . . . . . . . . . . . . 89

12 Appendix 1: Electronic version . . . . . . . . . . . . . . . . . . . . . 9113 Appendix 2: User Guide . . . . . . . . . . . . . . . . . . . . . . . . . 93

13.1 Discover tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9413.2 Settings tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9713.3 Dashboard tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9713.4 Visualization tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xii

14 Appendix 3: Installation and setup . . . . . . . . . . . . . . . . . . 9914.1 Logstash setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9914.2 Elasticsearch setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 9914.3 Kibana setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10014.4 ElastAlert setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

15 Appendix 4: List of compared log analysis software . . . . . . . . 10116 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xiii

1 Introduction

Millions of online accesses and transactions per day create great amounts of datathat are a significant source of valuable information. Analysis of these high amountsof data needs appropriate and sophisticated methods to process them promptly,efficiently and precisely.

Data logging is an important asset in web application monitoring and reportingas it contains massive amounts of data about the application behavior. Analysis oflogged data can be a great help with reporting of malicious use, intruders detection,compliance assurance and the anomalies that might lead to actual damage.

In my master’s thesis I will be looking into the main benefits of monitoring,web application service log analysis and log records processing. I will becomparing a number of available systems for log records collecting andprocessing, considering both the existing commercial and open-source solutions.With regards to the sample data collected from a chosen web application, themost fitting solution will be chosen and proposed for the required dataprocessing.

This solution will then be implemented, deployed and tested on the sampleapplication log records. Goals of this thesis are:

• Get familiar with the terms of monitoring, data mining and the log recordsanalysis;

• Investigate possibilities and benefits of log records data collecting andanalysis;

• Look into different types of log formats and information they contain;

• Compare and categorize commercial and open-source systems available forlog analysis;

• Propose an appropriate solution for the sample log records analysis basedon previous comparison and requisites;

• Implement the proposed solution, deploy and test on the sample data;

• Summarize the results of the implementation and list possible futureimprovements.

1

2 Monitoring & Data analysis

Monitoring1 as a verb means: To watch and check a situation carefully for a periodof time in order to discover something about it.

The fundamental challenge in IT monitoring process is to adapt quickly tocontinuous changes and make sure that the cost-effective and appropriate softwaretools are used. Strength of controlling process is based on both preventive anddetective controls which also are the crucial parts of changes monitoring. Theremight be some bottlenecks in regards to different types of data that need to bemonitored as not all types of monitoring systems allow records logging. Also theautomated data logging processes might not be cost-effective due to slowing downthe processing of data itself. Basically the strategies for automated monitoringinclude IT-inherent, IT-configurable, IT-dependent manual or manual guidelinesand these need to be evaluated carefully considering the requisites and availableresources. [1]

2.1 Monitoring in IT

For Information technologies in particular, there are a few types of monitoringthat are used specifically according to their purpose rather than the contents ofthe monitoring processes themselves as those are often overlapping. Some of thetypes are listed below and briefly described:

• System monitoring – system monitor (SM) is a basic process of collectingand storing system state data;

• Network monitoring – monitoring system set for reporting network issues(slow processing, connection discrepancies);

• Error monitoring – focuses on error detection, catching and handlingpotential issues within the code;

• Website monitoring – specific monitoring of website contents and access,reporting broken functionality or other issues related to the monitoredwebsite;

• APM (Application performance management)[2] – Based on end userexperience and other IT metrics the APM is a fundamental softwareapplication monitoring and reporting system that ensures certain level ofservice. It consist of 4 elements, see Figure 2.1:

1. From Cambridge dictionary: http://dictionary.cambridge.org/dictionary/british/monitor

3

http://dictionary.cambridge.org/dictionary/british/monitor

http://dictionary.cambridge.org/dictionary/british/monitor

2. Monitoring & Data analysis

– Top Down Monitoring (Real-time Application Monitoring) – focuseson End-user experience and can be active or passive;

– Bottom Up Monitoring (Infrastructure Monitoring) – monitoring ofoperations and a central collection point for events within processes;

– Incident Management Process (as defined in ITIL) – foundation pillarof APM, focuses on improvement of the application;

– Reporting (Metrics) – monitor collecting raw data for analysis ofapplication performance.

Figure 2.1: Anatomy of APM [2]

Online service/application monitoring using the log analysis is often compared toAPM or Error monitoring and contains a lot of overlapping processes. The maindifference between them is in the core purpose of monitoring. For APM theemphasis is put more on the end user perspective and on enabling the bestapplication performance possible. Error monitoring focuses on catching thepotential code errors by implementing the adequate level of error controllingmechanism in the code.

4


2.2 Online service/application monitoring

Near real-time monitoring of data logging with automatic reporting is neededto obtain expected levels of security and quality that need to be maintained 24hours a day. A certain level of uniformity in logging patterns is important for morepossibilities in standardization of the log analysis process. The specified event levelsand categories should simplify detecting and handling the suspicious activity orsystem failures. [3]

The crucial part of the monitoring and reporting process is identifying theproblematic data in log records and evaluating the appropriate response –automatic, semi-automatic or manual. To decide on the rules to be run for thesemalicious patterns recognition, the speed of detection and the ability ofprocessing need to be considered.

2.3 Data analysis

What is usually understood by the term data analysis is a process of preparing,transforming, evaluating and modeling data to discover useful information helpfulin subsequent conclusion finding and data-driven decisions making.

The process itself includes obtaining raw data, converting to a formatappropriate for analysis (cleaning dataset), applying required algorithms oncollected data and visualizing the output for evaluation.

2.3.1 Big data analysis

The term Big data2 is mostly used for much larger and complicated data setsthan usual. Huge amounts of records cause great challenges in their treatment andprocessing as the traditional approaches are often not effective enough. Advancedtechniques are needed to extract and analyze the Big data and new promisingapproaches are being developed specifically for their treatment.

Big data processing focuses on collection and management of large amountsof various data to serve large-scale web applications and sensor networks. Fieldcalled Data science focuses on discovering underlying patterns in complex dataand modeling them into required output. [32]

2.3.2 Data science

A basic data science process consists of a few phases (see Figure 2.2 forvisualization). Process is iterative due to possible introduction of new

2. Definition from Cambridge dictionary: http://dictionary.cambridge.org/dictionary/english/big-data

5

http://dictionary.cambridge.org/dictionary/english/big-data

http://dictionary.cambridge.org/dictionary/english/big-data


characteristics during execution. The phases of data science are listed below: [4]

• Data requirements – Clear understanding of data specifics that need to beanalyzed;

• Data collection – Collection of data from specific sources (sensors inenvironment, recording, online monitoring etc.);

• Data processing – Organization and processing of obtained data intoa suitable form;

• Data cleaning – Process of detecting and correcting errors in data (missing,duplicate and incorrect values);

• Exploratory data analysis – Summarizing main characteristics of data andits properties;

• Models and algorithms – Data modeling using specific algorithms basedon the type of problem;

• Data product – Result of the analysis based on required output;

• Communication – Visualization and evaluation of the data product,modifications based on feedback.

Figure 2.2: The data science process [4]

6


2.3.3 Data analysis in statistics

Statistic methods are essential in data analysis as they can derive the mostimportant characteristics from the data set and use this information directly forvisualizing via basic information graphics (line chart, histogram, plots andcharts). In statistics, data analysis can be divided into three different areas: [5]

• Descriptive statistics – It is mostly used for quantitative description.It contains basic functions (sum, median, mean) as characteristics of thedata set.

• Confirmatory data analysis (CDA) (also refers to hypothesis testing) – It isbased on the probability theory (significance level). It is used to confirmor reject a hypothesis.

• Exploratory data analysis (EDA) – In comparison to Confirmatory dataanalysis, EDA does not have a pre-specified hypothesis. It is mostly usedfor summarizing main characteristics and exploring data without formalmodeling or testing of content assumptions.

2.4 Data mining

Data mining is in a sense a deeper step inside the analyzed data. It isa computational process of discovering patterns in the full data set records togain knowledge about its contents. Data mining combines artificial intelligence,machine learning, statistics and database systems areas to achieve significantinformation extraction and transformation into a simplified format for future use.

A basic task of data mining is mostly automatic analysis of large amounts ofdata to detect outstanding patterns, which might be consequently used for furtheranalysis by machine learning or other analytics. There are six basic tasks in datamining:

• Anomaly detection (Outlier/change/deviation detection) – Detection ofoutstanding records in a data set;

• Association rule learning (Dependency modeling) – Detection ofrelationships between variables and attributes;

• Clustering – Detection of similar properties of analyzed data and creatinggroups based on this information;

• Classification – Generalization of type of structure and classification ofinput data based on the learnt information;

7


• Regression – Detection of a function to model data with the least error;

• Summarization – Detection of compact structure representing the data-set(often using visualization and reports).

Data mining is also considered to be analytics part of the Knowledge Discoveryin Databases (KDD) process, used for processing data stored in database systems.Data mining placement in the KDD process is also shown in Figure 2.3. Theadditional parts, such as data collection and preparation or results evaluation,do not belong to data mining but rather to KDD process as a whole. [7]

Figure 2.3: Data mining placed in the KDD Process [7]

2.5 Machine learning

Machine learning is a specific field exploring possibilities to use algorithms thatare capable of learning from data. These algorithms are based on findingstructural foundations to build a model from the training data and derive rulesand predictions. Based on the given input the machine learning is divided intothe main categories listed below:

• Supervised learning – Example input and corresponding output arepresented in training data.

8


• Unsupervised learning – No upfront information is given about the data,leaving the pattern recognition to the algorithm itself.

• Semi-Supervised Learning – Incomplete input information is provided. Itis a mixture of known and unknown desired output information.

• Reinforcement learning – It is based on interaction with a dynamicenvironment to reach a certain goal (e.g. winning a game and developinga strategy based on the previous success).

Machine learning and data mining contain similar methods and often overlap.However they can be distinguished based on the properties they are processing.While machine learning is working with known properties learnt from the trainingdata, data mining focuses on unknown properties and pattern recognition. [6]

2.6 Business intelligence

Business intelligence (BI) is a set of tools and technologies used for processing ofraw data and other relevant information into business analysis. There are numerousdefinitions of what exactly BI consists of. In this thesis the definition where internaldata analysis is considered a part of BI is used3: business intelligence is the processof collecting business data and turning it into information that is meaningful andactionable towards a strategic goal.

BI is based on transformation of available data into a presentable form enablingeasy-to-use visualization. This information might be crucial for strategic businessdecisions, threats and opportunities detection and better business insight. [9] Basicelements of Business intelligence are:

• Reporting – Accessing and processing of raw data into a usable form;

• Analysis – Identifying patterns in reported data and initial analysis;

• Data mining – Extraction of relevant information from collected data;

• Data quality and interpretation – Quality assurance and comparisonbetween the obtained data and the real objects they represent;

• Predictive analysis – Using the output information to predict probabilitiesand trends.

3. Definition available on World Wide Web: <http://www.logianalytics.com/resources/bi-encyclopedia/business-intelligence/>

9

http://www.logianalytics.com/resources/bi-encyclopedia/business-intelligence/

http://www.logianalytics.com/resources/bi-encyclopedia/business-intelligence/

3 Log analysis

The term is usually used for monitoring systems where data logs are records ofthe events detected by the sensors. The data logs are further processed in the loganalysis.

Log analysis consists of the subsequent research, interpretation and processingof the generated records obtained by the data logging. The most usual reasonsfor the log analysis are: Security, Regulation, Troubleshooting, Research, Incidentautomatic response. The semantics for the specific log records are designed by thedevelopers of the software therefore might differ for some specific areas of usageand sometimes these differences are not fully documented. A significant amountof time might be therefore needed for the log records pre-processing and theirmodification into a usable form for the following data analysis.

In this thesis I will mainly focus on the web log analysis – analysis of logsgenerated in web communication and interaction. Following sections include thegeneral information about web log analysis, possible uses and common formats ofthese logs.

3.1 Web log analysis

The web log is basically an electronic record of the interaction between the systemand its user. Therefore there may be additional user actions that would triggera record creation (not only the requests for the connection or the data transmittingbut also the overall behavior on the webpage, the link/button clicking and similar).An area of that is targeting measurement, collection, analyzing and reportingof web data is called web analytics. The web analytics have been studied andimproved significantly over the past years mainly because of their significance inincreasing the usability of web applications and gaining more users/customers fromthe marketing point of view. [11]

In comparison to Business Intelligence, the BI is focused more onmarketing-based analysis of the internal data from multiple sources. Even thoughthere are various approaches and software solutions available, it is still consideredfreer in terms of the implementation and depends highly on the organizationneeds, structure and tools. Web analytics on the other hand are specified foranalysis of web traffic and web usage trends. As a whole it offers a solution forone area and is separated from the rest of data. However the borders are nowmore blurred and the web analytics can sometimes be perceived as a specific dataflow from one source along others used as part of Business Intelligence.

The purpose of the web log analysis lays also in the system-usercommunication monitoring. The actions of this communication are stored in

11

3. Log analysis

electronic records and are subsequently analyzed for behavior patterns. Thesepatterns are important for the research of both user and system behavior andtheir reaction to various actions. The users actions can include useful informationabout their usage of web applications and can be analyzed for systemimprovements, security defects detection and compliance records. The systemreplies and actions can detect malfunctions on the server side, the unusualbehavior for the specific actions treatment and the erroneous responses. Asa result, there are specific areas for the analytic tests performed on the data logsthat are discussed in the following section.

3.2 Analytic tests

From the statistical analysis of the data, there are two main different kinds ofapproaches, or branches of communication information classification. Thequantitative approach focuses on the numbers of accesses, transmissions,request/actions and their distribution over time and the number ofclients/ports/sessions. The qualitative approach on the other hand detects partsof the communication which are out of the ordinary. Either according to theexpectation of the web application usage or the analysis of the test data, there isa certain basic behavioral pattern expected to be seen in a log records output.The records that follow the expected values are considered normal dataset andalso most of the overall analyzed dataset usually belongs to this group.

3.3 Data anomaly detection

There are often records that indicate different results than expected and might besignificantly different from the other records in the dataset. These are consideredanomalies in the data and one of the most important goals of log analysis is theirdetection and treatment.

Anomalies, also called the outliers are ones of the primary steps indata-mining applications. In the first steps of an analysis there is the detection ofthe outlying observations, which may be considered as an error or noise, but alsocarries significant information as the observations might lead to the incorrectspecification and results. Some of the definitions of outliers are more general thanothers, depending on the context, data structure and method of detection used.The most basic view is that an outlier is an observation in the data set whichappears inconsistent to the rest of the data. There are multiple methods for theoutlier detection differing according to the data set specifics, and are often basedon the distance measures, clustering and spatial methods. The outlier/anomalydetection is often used for various applications, such as credit card frauds, data

12

3. Log analysis

cleansing, network intrusion, weather prediction and other data-miningtasks. [10]

The subsequent anomaly analysis is essential for the root cause investigationof the detected anomaly and it helps greatly in both the inside and the outsidethreat prevention. The inside kind of defects might include malfunctions in thesystem code or the erroneous request processing. The outside threats are oftenweb-based attacks and intrusion attempts. Anomaly detection plays a significantrole in web-based attacks detections, also called anomaly-based intrusiondetection systems (IDSs). The basic intrusion detection system is monitoring theweb communication against the directory of the known types of intrusion attacksand takes action once the suspicious behavior is detected. However to ensurea certain level of security against the unknown types of attacks, also potentiallyanomalous communication should be monitored for possible threats. In this areathe monitoring of the anomalies of web traffic is essential for finding new types ofattack attempts that can be detected by behavior records stored in the datalogs. [13]

3.4 Security domain

Frequent attack attempts are based on finding the applications with flawedfunctionality. Taking advantage of vulnerabilities the attacker inserts code whichis executed by the web application causing transfer of malicious code intobackend or reading of unauthorized data from the database.

These types of attacks can be detected in the log files as the injected code isrecorded when sent to server. The post- detection is important for avoiding futureattacks but due to its late running the pro-active monitoring is essential. Basicregular expressions or more complicated methods can be used for rule making forknown attacks detection. The communications containing harmful code injected isthen rejected as a result.

The application runs on the 7th layer of ISO/OSI model and for detection to beefficient it has to see relevant traffic. There are multiple parts of the communicationthat can be subject to attack. In Figure 3.1 there is an illustration of high-levelattack detection in a network.

On the lower layers (network and transport layer) there is firewall working ontraffic analysis based on common protocols. It can detect anomalies in protocols.However it cannot detect the attacks on application as it does not see theadditional data from higher layers. Web application firewall on the other hand isprocessing the higher-layer protocols and can analyze more precisely. It containsenough information for filtering and detection and as a result is a good place fordefined allowance rules for specific requests and attacks detection. Web servers

13

3. Log analysis

Figure 3.1: Illustration of communication zones for attack detection [18]

such as Apache and IIS usually create log files in Common Log Format (CLF)described further in the following section. But this kind of format does notcontain data sent in the HTTP header (e.g. POST parameters) – since thisheader information can contain important data about possible attacks, it isa great deficiency for web server logs. As a part of application logic there shouldalso be a certain degree of validation of input and output data and securityinformation logging integrated. The application log files should contain a fullinformation about the actions of user and therefore allow wide possibilities formisuse and threat detection mechanisms. A network intrusion detection system(NIDS) analyzes the whole traffic to and from the application. However it hassome disadvantages, such as difficulties with decrypting SSL communication andreal time processing in high traffic overload. Also working on ISO/OSI layer 3and 4 is causing disability of detecting attacks targeted on higher layersinformation.

For attacks detection, there are two possibilities – log file analysis and fulltraffic analysis. Even though log files do not contain all data about thecommunication, they are easily available and collected. Due to default server-sidelogging to standard formats and applications usually containing basic loggingprocess for traceability of users’ actions, log files provide easily set-up process forsecurity monitoring.

Attacks can be detected using two strategies – by using static and dynamicrules. Differences among these are based on their creation. The recommendedattack monitoring system should consist of both types of rules. [18]

14

3. Log analysis

• Rule-based detection (static rules) – This strategy defines static rules basedon known attacks patterns that need to be rejected in order to avoidattacks. These rules are specifically prepared beforehand and stay the sameduring detection. Static rules are prepared manually based on pre-knowninformation. Static rules can be divided into two models:

– Negative security model – Blacklist approach allows everything bydefault, all is considered normal and the policy defines what is notallowed (listed on a blacklist). The biggest disadvantage is in thequality of policy and its need to be updated regularly.

– Positive security model – The positive model is the opposite of thenegative one – it denies all the traffic except for the allowed by policy(listed on whitelist). Whitelist contents can be learnt in the trainingphase by a machine learning algorithm or manually defined.

• Anomaly-based detection (dynamic rules) – Dynamic rules are notprepared beforehand on known information. They are obtained in thelearning phase on training dataset using machine learning algorithms. Itis essential to make sure the dataset is without any attacks andanomalies to ensure the correct rules generated result. Afterwards thetraffic considered different from the normal dataset will be flagged asanomalous.

Anomalous patterns may also be helpful in other application monitoring areaslike system troubleshooting. While security monitoring targets detection ofsuspicious behavior coming from the outside, system performance analysis andtroubleshooting are focused on the inside behavior. Internal behavior patternsmight reveal errors in code or even in design of the application or system setup.

3.5 Software application troubleshooting

Log files can be used in multiple stages of software development, mainly debuggingand functionality testing. It is possible to check the logic of a program without theneed to run it in a debug mode using log files for information extraction. Anotheradvantage is that this type of testing is not affected by the probe effect (time-basedissues introduced when testing in a specific run time environment), environmentand system setting generation required for currently used testing and debuggingcustoms and offer important insight into overall functionality and performance ofa system.

With sufficient background implementation for automatic log file analysis insoftware testing, making use of language and specification capabilities, log file

15

3. Log analysis

analysis can be considered a useful methodology for software verification,somewhere between current testing practice and formal verificationmethodologies. [26]

From the software development, testing and monitoring perspective, there isvaluable information that can be extracted from the log files. This information canbe divided into several main classes: [23]

• Generic statistics (e.g. peak and average values, median, modus,deviations) – They are mostly used in setting hardware requirements,accounting and general view into the system functionality.

• Program or system warnings (e.g. power failure, low memory) – They aremostly used in system maintenance and performance analysis.

• Security related warnings – They are used in security monitoring discussedin previous chapter.

• Validation of program runs – It is used as a type of software testing,included in development cycle.

• Time related characteristics – They are important for software profilingand benchmarking, can also reveal system performance issues.

• Causality and trends – They contain essential information about theprocessed transactions and are used mostly in data mining.

• Behavioral patterns – They are mostly used in system troubleshooting,performance and reliability monitoring.

For system troubleshooting, there are various types of valuable information loggedand their extraction can provide essential knowledge about the system behaviorand detect performance issues that are not easily found otherwise. Some of themost basic ways to use log analysis for system performance analysis are: [24]

• Slow response – Detection of slow response times can point out directlythe functionality area that should be optimized and checked for eventualcode errors.

• Memory issues and Garbage collection – Basic error massages analysis canprovide indications about the malformed behavior in specific scenariosand the out of memory issues are some of the most common ones. Alsothese might often be caused by slow or long lasting garbage collectionimplementation which can also result in overall slow application behavior.

16

3. Log analysis

• Deadlocks and Threading issues – With more users accessing theapplication resources simultaneously, the greater becomes the potentialof them creating deadlock situations1. Preventing as well as dealing withthese occurrences is therefore an important part of application logic andtheir detection can significantly improve the performance optimization.

• High resource usage (CPU/Disk/Network) – High resource usage mightresult in slowing down the performance or even halting the system.These irregularities can therefore help to detect the busiest times ofsystem usage or even need of additional resources allocation due toincreased user demands.

• Database issues – Once the applications are communicating directly withthe database, the queries results as well as response times and potentialmultithread access issues are significant to overall functionality andapplication responsiveness.

However, not only what occurs in the system is worth detecting. The inactivity,which can be easily found by log file analysis, provides also important insightinto system monitoring. If the important action that was scheduled to run hadnot happened, it would not generate any error message, but it would still makea significant impact. As a result, it is important to not only monitor and searchthe logged data for error messages and behavior patterns that happened, but alsodetect those actions and situations when nothing happened even though it shouldhave. Therefore it is worth looking into the possibilities of their detection andcompilation in order to maintain a certain quality of service. [25]

However the contents of the log files can differ greatly from system to system.Depending on the desired information, the format of log files needs to be oftenadjusted to contain the specific information. Basic server logging files usuallycontain standard information used for server-side monitoring andtroubleshooting, for specific application logic analysis additional log files mayneed to be generated with more descriptive information.

3.6 Log file contents

The web log analysis software (sometimes also called the web log analyzer) isa tool processing a log file from a server and according to its values obtainsknowledge about who, when and from where accessed the system and whatactions took place during a session. There are various approaches for log files

1. Deadlock – a situation in which two or more competing actions are each waiting for theother to finish, and thus neither ever does.

17

3. Log analysis

generation and processing. They may be parsed and analyzed in real time or maybe collected and stored in databases to be examined later on. The subsequentanalysis then depends on the required metrics and types of data the analysisfocuses on. The basic information contained in the web log format tends to besimilar across different systems. This however depends on the softwareapplication type. As a result, there is a different log file output generated by theintrusion detection systems, the antivirus software, and the operation system orweb server when creating access logs. These differences need to be taken intoaccount when storing and processing data from multiple sources. There are alsovarious recommendations for log management security published by the NationalInstitute of Standards and Technology that should be followed when processinglog records internally within organizations. [14]

There are some default types of variables and values that are generated for theweb logs by the specific software web server solutions. However also for the webservers solutions like the Apache web server software2, it is possible to alter andconfigure the web log format generated according to specific needs. [67]

3.6.1 Basic types of log files

There are basic log file types that are used by web server logging services. Thesemay differ according to the type of server as well as its version and an importantpart of preparation for log file analysis is based on getting familiar with the contentsand requirements. There are also multiple different logs generated based on theirtriggering event, contents and logic such as error logs, access logs, security logsand piped logs. Selected web server log formats are: [15]

• NCSA Log Formats – The NCSA log formats are based on NCSA httpdand are mostly used as a standard for HTTP server logging contents. Thereare also specific types of NCSA formats such as:

– NCSA Common (also referred to as access log) – It contains onlybasic HTTP access information. Its specific contents are listed in thefollowing section.

– NCSA Combined Log Format – It is an extension of the CommonNCSA format as it contains the same information with additionalfields (referrer, user agent and cookie field).

– NCSA Separate (three-log format) – In this case the information isstored in three separate logs – access log, referrer log and agent log.

2. The Apache web server software is one the most used open-source solutions used worldwide.More information available from World Wide Web: <https://www.apache.org/>

18

https://www.apache.org/

3. Log analysis

• W3C Extended Log Format – This type of log format is used by MicrosoftIIS (Internet Information Service) versions. It contains a set of lines thatmight consist of directive or entry. Entries are made of fields correspondingto HTTP transactions, separated by spaces and using a dash for fieldswith missing values. Directives contain information about the rules for thelogging process.

Apart from main server log file types, there are multiple specific ones that mightbe generated by FTP servers, supplemental servers, application servers3.

Application logging functionality is also important to be setup to simplifytroubleshooting and maintenance as well as increase protection from outsidethreats. A lot of systems contain server and database logging but the applicationevent logging is missing, disabled or poorly configured. However the applicationlogging provides valuable insight into the application specifics and has a potentialof bringing much more information than the basic server data compilation. Theapplication logs formats might differ greatly as they are highly dependent on theapplication specifics, its development and needs. Nevertheless, within theapplication, organization or infrastructure the log files format should beconsistent and as close to standards as possible. [27]

There are also logging utilities created for simplified definition of consistentapplication logging and tracking API. Once the standardized logging file formatis used, it makes its subsequent pre-processing and analysis much simpler. Anexample for widely used API is the open source log4j API for Java that offersa whole logging capabilities package and is often used for log generation inapplications written in Java. [28]

However for the basic logging functionality or simple web application thedefault utilities might generate sufficient records. To decide if the default log filecontents are enough for needs of users, basic insight and knowledge about thecommon log file formats is required.

3.6.2 Common Log File contents

A Common Log Format or the NCSA Common log format is based on logginginformation about the client accessing the server. Due to its standardization, itcan be more easily used in multiple web log analysis software tools. It contains therequested resource and some additional information, but no referral, user agent orcookie information. All the log contents are stored in a single file.

3. For example server logs types of Tomcat server:<https://support.pivotal.io/hc/en-us/articles/202653818-Tomcat-tc-Server-log-file-types-2009881>

19

https://support.pivotal.io/hc/en-us/articles/202653818-Tomcat-tc-Server-log-file-types-2009881

https://support.pivotal.io/hc/en-us/articles/202653818-Tomcat-tc-Server-log-file-types-2009881

3. Log analysis

An example of the log file format is:

host id username date : time reque s t s t a tu s bytes

• host – the IP address of the HTTP client that made a request;

• id – the identifier used for a client identification;

• username – the username or the user ID for the authentication of theclient;

• date:time – the date and time stamp of the HTTP request;

• request – the HTTP request, containing three pieces of information –resource (e.g. URL), the HTTP method (e.g. GET/POST) and theHTTP protocol version;

• status – the numeric code indicating the success/failure of the request;

• bytes – the number of bytes of the data transferred as part of the requestwithout the HTTP header.

The described type of the common log file format contains only the most essentialinformation. Usually more items are added into the log obtained throughout thesession, depending on the type of the data needed to be received from the webserver visit logs. Often information is included about the browser type and itsversion, the operating system, or other actions of the user during the session.

3.6.3 Log4j files contents

Log4j Java logging utility is developed under Apache Software Foundation and isplatform independent. Log file contents are labeled with defined standard levels ofseverity of the generated message. Basic Log4j log message levels are listed below:[30]

• OFF – The OFF level has the highest possible rank and is intended toturn logging off.

• FATAL – The FATAL level designates by very severe error events that willpresumably lead the application to abort.

• ERROR – The ERROR level designates error events that might still allowthe application to continue running.

20

3. Log analysis

• WARN – The WARN level designates potentially harmful situations.

• INFO – The INFO level designates informational messages that highlightthe progress of the application at coarse-grained level.

• DEBUG – The DEBUG level designates fine-grained informational eventsthat are most useful for application debugging.

• TRACE – The TRACE level designates finer-grained informational eventsthan the DEBUG level type.

• ALL – The ALL level has the lowest possible rank and is intended to turnall logging on.

Log4j files contents can be adjusted using properties file, XML or through Javacode itself. The log4j logging utility is based on three main components which canbe configured:

• Loggers – Loggers are logical log file names which can be independentlyconfigured according to their level of logging and they are used inapplication code to log a message.

• Appenders – Appenders are responsible for sending a log message to outpute.g. file or remote computer. Multiple appenders can be assigned for alogger to enable sending its information to more outputs.

• Layouts – Layouts are used by appenders for output formatting. Mostlyused format with every log input on one line containing defined informationis PatternLayout, which can be also specified using ConversionPatternparameter.

The PatternLayout is a flexible layout type defined by a conversion pattern string(regular expression defining the requested string pattern). The goal is to formatlogging event information into a suitable format and return it as a string. Eachconversion specifier starts with a percent sign (%) and is followed by optionalformat modifiers and a conversion character. The conversion character specifiesthe type of data, e.g. category, priority, date, thread name. Any type of literal textcan be inserted into the pattern. [31] Conversion characters are listed in Table3.6.3. As a result, the ConversionPattern can be used to define the specific loggeroutput format using the listed characters in its definition.

21

3. Log analysis

Conversioncharacter Type of data

c Category of the logging eventC Class name of the caller issuing the logging requestd Date of the logging eventF File name where the logging request was issued

l Location information of the caller which generatedthe logging event

L Line number from where the logging request was issuedm Application supplied message associated with the eventM Method name where the logging request was issuedn Platform dependent line separator character or charactersp Priority of the logging event

r Number of milliseconds elapsed from the constructionof the layout until the creation of the logging event

t Name of the thread that generated the logging event

x NDC (nested diagnostic context) associated with thethread that generated the logging event

X MDC (mapped diagnostic context) associated with thethread that generated the logging event

% The sequence %% outputs a single percent sign.

Table 3.1: List of Conversion characters used in ConversionPattern

For example the desired pattern can be defined by string sequence:

\%d [\%t ] \%-5p \%c - \%m\%n

Possible output might then display like:

2015 -02 -03 00 :00 [ main ] INFO l o g 4 j . SortAlgo - Sta r t s o r t

Meanings of the items separated by spaces in the example are:• %d (2015-02-03 00:00) – date of the logging event;

• %t ([main]) – name of the thread that generated the logging event (inbrackets according to pattern definition);

• %-5p (INFO) – priority of the logging event (the conversion specifier %-5pmeans the priority of the logging event should be left justified to a widthof five characters);

22

3. Log analysis

• %c (log4j.SortAlgo) – category of the logging event;

• %m (Start sort) – application supplied message associated with the loggingevent (dash is an added literal character between category and the messagefull text according to pattern definition);

• %n – adds line separator after logging event record.

Log4j logging utility provides wide possibilities for adjusting the format, contentsand functionality of application logging, which can ease the subsequent analysisand log files management.

There is also a variety of possibilities for filtering messages contents ingenerated log file records. Full textual searches and results filtering based onspecific message strings that might reveal potential threats or systemmalfunctioning can be configured and automated. Contextual patterns that arepotentially important to review can be also often easily defined by e.g. regularexpressions and be searched for.

3.7 Analysis of log files contents

To gain the desired knowledge from the log file contents, the subjected parts of therecords need to be collected, extracted, pre-processed and analyzed as a dataset.The subsequent visual representation is for easier behavior and pattern recognitionfrom the development or marketing point of view. Some of the basic metrics learntfrom the web log analysis are:

• Number of users and their visits;

• Number of visits and their duration;

• Amount and the size of the accessed/transferred data;

• Days/hours with the highest numbers of visits;

• Additional information about the users (e.g. domain, country, OS).

The goal of the web log analysis software therefore is to obtain among others thelisted information from the generated log records. In the following chapter there isan overview of the selected available software systems designed for this task andtheir comparison.

23

4 Comparison of systems for log analysis

When choosing the most appropriate analytics software, there is a couple ofthings that need to be taken into account. These include the required orexpected functionality of the analysis software, web application and data storagespecifics and size and amount of data for analysis. Also support and competencyon premises and financial options should be evaluated when making the decision.

There are various possibilities for categorization of available systems for loganalysis. In this thesis, I would firstly describe multiple different approaches andcategorize according to the main focus. Then choose and compare some existingsystems that belong to specified categories according to the capabilities theyoffer. This comparison is based on the overall information offered publicly by theselected systems and is meant primary for high level overview of functionalitythat is available.

4.1 Comparison measures

Considering web analytics as not only a tool for web traffic measurement but alsoa business research information source, the offerings of some web analytics softwaretypes might contain functionality closer to web page optimization and performanceincrease with on-page actions monitoring.

These are divided into off-site web analytics which analyze the web pagevisibility on the Internet as a whole and on-site web analytics which track useractions while visiting the page. To ease the use of web analytics with no need foron premises demands and also for client-side monitoring, a different methodapart from log file analysis came up – the page tagging. As a result, software canbe divided into categories according to the tracking method it uses – client-sidetracking (page tagging), physical log files tracking and analysis or eventually fullnetwork traffic monitoring.

4.1.1 Tracking method

The two main approaches, considered mainly in the web log analysis area, aretracking the client-side and server-side information. Page tagging is a trackingmethod based on adding third party script to the webpage code enabling recordingof user actions on the client-side using JavaScript and Cookies and sending theinformation to the outside server. These types of solutions are also often basedon hosted software approach or Software as a service (SaaS). On the other hand,the log files are generated on the server side and contain therefore server-sideinformation. However log files can also be transferred outside for processing and

25

4. Comparison of systems for log analysis

there are also hosted software solutions available for log files analysis that is doneon the third party premises. Some of the differences in contents between the clientand server side information processing are listed below: [19]

• Visits – Due to tracking based on JavaScript and cookies, the hostedsoftware might not be able to output completely accurate information asa result of users with disabled JavaScript, regularly deleted Cookies orblocked access to analytics. Also it does not track robots and spiderswhile all the interaction information including the above mentioned isrecorded in the web logs.

• Page views – While the log file is tracking only communication goingthrough the server, it would not include the page reload as it is usuallycached in the browser. Client-side software would on the other handrecord the re-visit.

• Visitors – There is a difference in visitor recognition as the tagging scriptuses to identify the user Cookies (which might be deleted) while log filerecords the Internet address and browser.

• Privacy – Specifically for the SaaS tagging-based systems – as the thirdparty is collecting and processing the obtained information there are someprivacy concerns which are not present for local log file analysis.

To sum up, there are advantages and disadvantages to both approaches and thedecision should be based on the specific requirements for the software. Log fileanalysis does not need to make changes to webpages and contains the basic requiredinformation by default (as default setting for logging can be easily enabled andtracked for web servers). Also data are stored and processed on premises and moreinside information can be extracted from the records. Page tagging on the otherhand contains information from the client side that is not recorded in log files (e.g.on click, cached re-visit etc.), it is available to web page owners that do not havelocal web servers and support for on premises analysis. Often both approaches arecombined and used for in-depth analytics.

However, even though the page tagging term is mostly used for client-sideinformation tracking, there can also be PHP server-based tags used for additionalinformation generation. As already mentioned, there is a number of valuable datasources that are omitted when using only client-side tagging, however there istoo much redundant information in physical log files. PHP tagging enables bothacquiring server-side information and choosing the information that needs to becollected.

There are also other tracking methods that can be used in (mostly)web-based applications/systems, such as full network traffic monitoring. Network

26


traffic monitoring might include much more information about the overall systembehavior than log files or page-tags information output. But it is also morecomplicated to implement and the whole monitoring process needs to be setupcarefully and manually while logging is generally a built-in capability that is easyto setup, adjust and process.

4.1.2 Data processing location

As partly noted in the previous section, the systems for log analysis can becategorized also according to how (or where) the obtained data is collected orprocessed. From this point of view the basic distinction is between the Hosted(SaaS) type which processes data on centrally hosted servers (also used ason-demand software) and Self-hosted (On premise) type that runs on local user’sserver.

Gradually increasing interest in cloud-based and outsourced services showsthat it is often the easiest solution for standalone non-complex applications andsmall businesses without sufficient hardware and software foundation. Softwareas a service or hosted type of software solution is based on a delivery modelwhere data is processed (and sometimes also collected) on premises of thesoftware provider. The main advantages of this approach are that the user doesnot need to own the hardware and software equipment with desired capacity andperformance, as well as does not need to cover the need for maintenance, supportand additional technical services. The basic idea of hosted software is that theservice is managed entirely by the software provider and the user only gets thedesired results of the process. The understandable disadvantage is that data(often containing sensitive information) is transferred and processed by a thirdparty and in this case the security and privacy are questioned. Even though thecloud and SaaS providers are legally required to commit to certain dataprotection, transparency and security, users might consider processing their dataon premise as a safer and more convenient approach.

The second type of data processing location is traditionally self-hosted orcalled deployment on premise. This approach includes installation and setup ofthe software solution on the user’s server and allowing it to process data locally.

In conclusion, the location of data processing requirements might differaccording to type of organization, on premise hardware and software support orsensitivity of data contents. Apart from the data location and type of trackingused, there is one more important thing to consider when choosing anappropriate solution. That would be price and license of the software.

27


4.2 Client-side information processing software

Client-side information is usually obtained using page tagging, even though someof the software solutions listed in this section also include the log analysis as anadditional source of input data. The common feature for these types of software isa priority based on tracking the user actions and activity as well as basic statisticscontaining information about the background of the user. The aim of client-sidetracking software is to optimize performance of a web based application/page tobe appealing for the current customers/users as well as attractive for new ones.Selected solutions of client based software solutions:

• Google Analytics [33] – One of the most used web analytics softwareworldwide, contains a wide variety of features. It includes anomalydetection [21], is easy to use and for basic use is free (possibility toupgrade to paid premium version).

• Clicky web analytics [34] – Hosted analytics software that offers real timeresults processing, basic customer interaction monitoring functionality andease of use. Pricing depends on the daily page views and number of trackedweb pages.

• KISSmetrics [35] – Tool offering funnel (visitors’ progression throughspecified flows)1, A/B test, behavior changes reports. It is offeringa 14-day trial and the starter price begins at $200 per month.

• ClickTale [36] – Software based on customer interaction monitoring, basedon providing heat map analytics, session playback and conversion funnelsalong with basic web analytics reports. It is offering a trial demo andpricing depends on the bought solution.

• CardioLog [37] – Software designed to work on the Windows platformspecified for use in on premises SharePoint servers, Yammer and hybriddeployments including Active Directory integration. Contains basicanalytic reporting with UI directly built-in SharePoint site and is easy todeploy. Available is 30-day trial, full functionality pricing depends on thechosen solution On premise/On demand/Hybrid and the chosen features.

• WebTrends [38] – Solution offering rich functionality containing mobile,web, social and SharePoint monitoring. Apart from reports there isa possibility to integrate internal data to statistics and use performancemonitoring for anomalies detection.

1. More information about funnel functionality: <http://support.kissmetrics.com/tools/funnels/>

28

http://support.kissmetrics.com/tools/funnels/

http://support.kissmetrics.com/tools/funnels/


• Mint [39] – On premises solution for JavaScript tagging based tool, offeringbasic reports for visits, page views, referrers etc. Requirements are Apachewith a MySQL and PHP scripting, payment is $30 per site.

• Open Web Analytics [40] – Open source web analytics software written inPHP working with MySQL database that is deployed on premise but alsousing tagging for analytics processing. There is also built-in support forcontent management frameworks like WordPress and MediaWiki.

• Piwik [41] – Open analytics platform offers apart from default JavaScripttracking and PHP server-side tagging also option to import log files tothe Piwik server for analysis and reporting. There are more possibilities toadjust the reporting according to needs, however as a result the solutionis not as easy to use. Piwik PRO contains also on premises solutions forEnterprise and SharePoint with pricing depending on the scale.

• CrawlTrack2 [42] – Open source analytics tools that is based on PHPtagging, enabling a wider range of obtained information including spidershits and other server-side information.

• W3Perl [43] – CGI-based open source web analytics tool that works withboth page tracking tags and reporting from log files.

Some chosen features of client-side tracking software types are compared in theTable 4.1. First compared feature Tracking traffic sources & visitors isa fundamental functionality of client-side analysis software types as it is based oninformation of client-side log source and unique IDs of visitors. Tracking robotvisits feature is less often supported, as it is not usually detected using clientscript only (however it can be detected by php tagging). Custom dashboardfeature compares capability of adjusting dashboard or statistics report outputcontents. Real-time analysis is based on continuity of information beingprocessed/received thanks to script present on pages, simplifying thisfunctionality support in contrast with log files analysis. Keyword analysis can bevery helpful feature mainly for SEO optimization work while it does not alwaysbelong to basic features of client-side analyzers. Mobile geo-location is a nicefeature for increased tracking ability, but supported by only limited number ofreviewed solutions.

2. CrawlTrack uses PHP tagging which enables also server-side information, however due toits main focus on basic client-side statistics with only spiders hit included it is listed among theclient-side tracking type of software

29


Solution

Trackingtrafficsources &visitors

Trackingrobotvisits

Customdashboard

Real-timeanalysis

Keywordanalysis

Mobilegeo-location

GoogleAnalytics 3 7 3 3 3 3

Clicky 3 7 7 3 7 7

KISSmetrics 3 7 3 3 3 –ClickTale 3 7 7 3 3 3

CardioLog 3 7 3 3 3 7

WebTrends 3 3 3 3 3 7

Mint 3 7 3 3 7 7

Open WebAnalytics 3 3 3 7 3 7

Piwik 3 3 3 3 3 3

CrawlTrack 3 3 7 3 3 3

W3Perl 3 3 7 3 3 3

Table 4.1: Comparison of selected client-based software features

4.3 Web server log analysis

Types of web server log analysis use the log file in its standard format (IIS orApache generated) and are optimized for their processing. Even though they mightsupport analysis also for customized log file formats, the output is mostly madefor basic server connectivity statistics and monitoring with no additional featuresthat might be required for application log analysis.

• AWStats [44] – Free open source tool that works as a CGI script on theweb server or launched from the command line. It evaluates the log filerecords and creates basic reports for visits, page views, referrers etc. It canbe also used for FTP and mail logs.

• Analog [45] – Open source web log analysis program running for allmajor operating systems is provided in multiple languages and processesconfigurable log file formats as well as the standard ones for Apache, IISand iPlanet.

• Webalizer [46] – Portable free platform-independent solution withadvantages in scalability and speed. However it does not support as widerange of reporting mechanisms as other alternatives.

• GoAccess [47] – Open source real-time web log analyzer for Unix-likesystems with interactive view running in terminal. It includes mostlygeneral statistics in server report on the fly for system administrators.

30


• Angelfish [48] – Proprietary possibility for on premise analysis, oftenaccompanying page tagging solutions. Contains also traffic andbandwidth analysis and also include client-side information in thereports, which was gained from web analytics tagging software. Pricingstarts at $1 295 per year.

Some chosen features of server log files analysis software types are compared in theTable 4.2. First is the Custom log format capability, which might not be alwaysavailable but is often crucial in requirements when slightly modified log files are tobe analyzed. The Unique human visitors feature is quite easy to be accomplishedfor client-side tracking, however from log file analysis standpoint it is not alwaysa priority along with the Session duration property. On the other hand the log filesoffer easy-to-get capability of Report countries tracking based on domain and IPaddress. There are often supported detailed Daily statistics, but Weekly statisticsmight not be supported in types of analyzers with basic functionality due to highnumbers of records computation.

SolutionCustomlogformat

Uniquehumanvisitors

Sessionduration

Reportcountries

Dailystatistics

Weeklystatistics

AWStats 3 3 3 IP & Domain 3 7

Analog 3 7 7 Domain name 3 7

Webalizer 7 7 7 Domain name 3 7

GoAccess 3 7 7 IP & Domain 3 7

Angelfish 3 3 3 IP & Domain 3 3

Table 4.2: Comparison of selected server log file analysis software features

4.4 Custom application log analysis

Fundamental functionality expected from the application log analysis consists of:parsing custom fields in log records, view the records in a consolidated form, searchfor specific data using custom queries and highlighting results that might be ofinterest. For a simple application, the log file viewers with searching capabilitiesmight offer sufficient functionality for basic application monitoring as they canbe set up for searching in high numbers of log files records for specific issues,working with custom log files field data that differ across different platforms andapplication types. Searching and filtering is often based on regular expression inputand configurable queries filtering contents. Some of the application log files viewand analysis tools are:

31


• Log Expert [49] – Free open source tool for Windows, contains search,filtering, highlighting and timestamp features.

• Chainsaw [50] – Open source project under Apache logging servicesfocuses on GUI-based Log4J files view, monitoring and processing. Itoffers searching, filtering and highlighting features.

• BareTail [51] – A free real-time log file monitoring tool with built-infiltering, searching and highlighting capabilities supporting multipleplatforms and also configurable user preferences.

• GamutLogViewer [52] – Free Windows log file, log file, viewer that workswith Log4J, Log4Net, NLog, and user defined formats includingColdFusion. It supports filtering, searching, highlighting and other usefulfeatures.

• OtrosLogViewer [53] – Open source software for logs and traces analysis.Contains searching, filtering with automatic highlighting based on filtersand multiple additional options using plugins.

• LogMX [54] – Universal log analyzer for multiple types of log files, includesbuilt-in customable parser, filtering & searching options for large files, realtime monitoring with alerts and auto response options. Pricing starts for1 user basic license at $99.

• Retrospective [55] – Commercial solution for managing log files dataworking on multiple platforms and offering wide search, monitoring,security and analytic capabilities with a friendly UI design. Pricing forpersonal use starts at $92.

These types can differ according to supported Log files (even though custom logfile format is often configurable) and also according to Platform. In the followingtable All listed for platform stands for Windows, OS X and Unix-like, while Winstands for Windows platform. While client-side analyzers are often based onstatistics, visitors and source referrers tracking and their diagramming indashboards, application logs analysis might not even support the statisticsgeneration as these types of tools are mainly used for middle processing afterlogging and before visualization. Their capabilities are based on filtering &highlighting tools to make better sense of multiple types of data. Log files can bedesigned to be straightforward, in which case there are only specific types of logfiles data that are of interest. They can be easily retrieved with configurablesearching and automatic highlighting and filtering. Regex or regular expressions3

3. A regular expression (regex or regexp for short) is a special text string for describing asearch pattern – more information on page http://www.regular-expressions.info/

32

<http://www.regular-expressions.info/>


functionality is priceless when searching custom data sources as they arepowerful tools for retrieving valuable information in specified format. As for theReal time support, for locally gathering multiple format types this capability canbe a plus (mainly for monitoring), however often it is not treated as a priority.

Solution Platform Statistics Log files Filter &Highlight

Regexsearch

Realtime

Log Expert Win 7 Custom 3 3 7

Chainsaw All 7 Log4j 3 – 7

BareTail Win 7IIS/Unix/custom 3 7 3

GamutLogViewer Win 7

Log4j/custom 3 7 7

OtrosLogViewer All 7

Log4j/Java logs 3 3 7

LogMX All 3Log4j/custom 3 3 3

Retrospective All 3Server/Java/custom 3 3 3

Table 4.3: Comparison of selected application log file analysis software features

Solutions listed up till now contain mostly basic functionality for visitors/pageviews/referrer stats extraction and visualization while working with eitherclient-side tracked information (obtained by page tagging) or standard webserver file format analysis. For the custom log application viewers, they containbasic searching and highlighting capabilities based on custom search rules setup.Even though some offer also additional functionality for bandwidth/anomalydetection/performance monitoring they are mostly recommended for small tomidsize businesses with webpages or simple web application monitoring.

Once the application log files are needed to be processed more in-depth, specificstatistics for security and compliance are required, standard reporting mechanismsmight not be sufficient for the web log file analysis.

4.5 Software supporting multiple log files types analysiswith advanced functionality

Software solutions that include additional deeper analytics capabilities as well asprocessing of distinct log file formats can be used for both the basic log files andclient side tagging output analysis. They also often offer a fully functional platformfor log file analysis that can factor in also additional data input streams. Accordingto specific needs the contents (input and/or output) can be highly customized andprepared to fit in the user’s requirements.

33


To remind the basic steps of data analysis, it consists of: data collection,pre-processing, data cleaning, analysis, results overview and communication. It ispossible to get a full solution including all the required steps. On the other hand,it is possible also to compile the output from separate software tools according tothe systems for data management already used in the organization. Some of thetools used also for application log files analytics include:

• Logentries [56] – Hosted SaaS cloud-based alternative for log filecollection and analysis. It collects and analyzes log data in real timeusing a pre-processing layer to filter, correlate and visualize. Softwareoffers rich functionality including security alerts, anomaly detection andboth log file and on-page analytics. A free trial is available, limitedfunctionality option with sending less than 5 GB/month is free. Starterpack for up to 30 GB/month costs $29 per month.

• Sawmill [57] – Mostly universal solution using both log file entries analysisand on-page script tagging, can be deployed locally or hosted. Covers alsoweb, media, mail, security, network and application logs, supports mostplatforms and databases. Pricing depends on the chosen solution, lite packwith limited functionality starts at $99.

According to the needs of the analysis, multiple possibilities are present foracquiring data from local/hosted log files. The system used for data collection aswell as the overall data management is significant in choosing the appropriatetool [17]. Some of the richer functionality solutions for web log monitoring andanalysis are:

• Splunk [58] – Splunk is a solution based on working with machine datafrom the whole environment – devices, apps, logs, traffic and cloud.Therefore it offers powerful tools for data management, analysis andresults visualization. There is a possibility of cloud-based solution fordata management and storage, or it can be deployed for on premisedatabases, offers data stream processing, mobile devices data insight andthe Big data solution – Splunk analytics for Hadoop4 and NoSQL datastores. 60-day trial is available, pricing depends on the data volume perday, $675 per month is for Splunk Cloud version.

– There are also additional analysis tools available for Splunk solutionsuch as anomaly detection from Prelert called Anomaly DetectiveApplication for Splunk Enterprise [60]. Prelert offers a REST API

4. Hadoop [69] – framework that allows for the distributed processing of large data sets acrossclusters of computers using simple programming models

34


which can process basically any feed – also offers 6-months trial fordevelopers. The application is mostly used in its Splunk plugin formthat adds the easy-to-use anomaly detection capabilities to themachine data analysis and monitoring process.

• Sumo Logic [59] – Sumo Logic is a cloud-based solution for a native LogAnalytics service and machine learning algorithms developed toefficiently analyze and visualize the information from the dataprocessing. It includes incident management with pattern recognition,anomaly detection and other monitoring and management tools. It alsohas a capability called LogReduce at its disposal, which consolidates thelog lines using recurring pattern detection. On top of this there is alsoa possibility of anomaly detection. Sumo Logic scans historical data forpatterns and thanks to LogReduce works also on lines that are notidentical. Also the tool allows annotating and naming anomalies so whenit occurs again it can be considered as known. There is a 30-day trialoption and pricing depends on the data volume per day, at 1 GB/day thecost is $90 per month.

• Grok [61] – Numenta is a developer of data-analysis solutions and releaseda data-prediction and anomaly detection library modeled after the humanmemory. Grok for IT analytics is an anomaly detection tool for AWS5.Basically it works with most of Amazon’s web services and has an API thatanalyzes system metrics. Therefore this solution is processing generatedmetrics rather than log file lines, covers most monitoring capabilities andcomes with a friendly adjustable UI.

• XpoLog Analytic Search [62] – XpoLog Log Management and AnalysisPlatform offers full solution on almost any log file data analysis areaincluding monitoring, scanning for errors, anomalies, rule-baseddetection. It offers collection, management, search, analytics andvisualization of virtually any data format – including server, applicationlog files and built-in cloud integration possibilities for data hosted onAmazon or Google clouds. It can be used also for Big data analyticsthanks to the option for integration with Hadoop and can be deployed onpremise or in the cloud. Pricing depends on the daily logs volume.

• Skyline by Etsy [63] – Skyline is an open source solution for anomalydetection based on operation metrics. It consists of several components:python-based daemon Horizon accepts data from TCP and UDP inputs,

5. Amazon Web Services [73] – offers a broad set of cloud-based global compute, storage,database, analytics, application, and deployment services

35


uploads data to a redis where they are processed by an Analyzer whichutilizes statistical algorithms for abnormal patterns detection. The resultsare then displayed in a minimalist UI. Additionally, anomaly investigationis also implemented – Oculus [64] is a search engine for graphs, usefulwhen detecting similar graphs to the anomaly detected by Skyline.

Basic comparison and general information about the software solutions listed inthis chapter are in Table 15 in Appendix 4 chapter.

4.6 Custom log file analysis using multiple softwaresolutions integration

Another possibility is to use multiple software solutions for specific parts of thelog file management tasks. Some development platforms also offer specific toolsthat can be used as a standalone unit or can be integrated into an existinginfrastructure.

• Graylog [65] – Graylog is a fully integrated open source log managementplatform used for collection, indexing and analyzing multiple types ofdata streams. It uses a few key open source technologies: Elasticsearch,MongoDB and Apache Kafka, which allows streamed data to bepartitioned across cluster and has multiple functions suitable for big dataanalysis.

• Elastic platform [66] – Platform offering both commercial and open sourceproducts aimed for search, analysis and visualization of data insights in realtime. The well-known combination of three open source projects Logstash,Elasticsearch and Kibana is also referred to as the ELK stack.

– Logstash – Collection, parsing and enrichment pipeline designed foreasy integration. It is adjusted for processing of streams of logs, eventsand other unstructured data sources to be further processed.

– Elasticsearch – Distributed easy to use search and analytics engineoffering quick searching and analytics via a query language.

– Kibana – Visualization platform for interaction with data analysisoutput including a variety of histograms, diagrams and dashboardpossibilities.

• Apache family and integration [67] – Open source software integrationpossibilities may also offer some efficient combinations for log files datacollection, processing and output software tools.

36


– Apache Flume [68] – Distributed service built for collecting,aggregating and moving high amounts of data. It is based on joiningthe multiple distinct data streams and collecting. It has robust faulttolerant structure containing recovery mechanisms and optionallydefined rule-based pre-processing or alerting possibilities.

– Hadoop HDFS and HBase [69] – Flume provides a pipeline to Hadoopand the ecosystem of the distributed file systems with possible open-source additions offers various analytics capabilities and might be themost efficient possibility for large amounts of data logs analysis frommultiple sources.

– Solr [70] – Offers similar functionality as Elasticsearch – based onsearch, analytics and monitoring capabilities. It is easily integratedwith Hadoop and can also bring interesting outputs.

– Spark [71] – Engine for large-scale data processing from Hadoopthat supports also machine learning, stream processing and graphsgeneration.

– Apache Storm [72] – Distributed real-time computation system forstreams of data with built-in capabilities for analytics, online machinelearning and others. [22]

The ELK stack abbreviation is used in general, even though Logstash is the firsttool used in processing data as the data collector and parser. This is explained as:Because LEK is an unpleasant acronym, people refer to this trinity as the ELKstack. [75] The ELK stack with the wide user and developer community is oftenused for customized input and multiple data stream processing for middle sizeddata as well as for higher amounts. It can be easily integrated into the open-sourcebuilt solutions and is often also used as a part of Big data processing point.

On the other hand Apache Flume and Hadoop are mostly used with Big dataprocessing thanks to built-in distributed capabilities and native support for Bigdata storage, processing and analysis. There are various possibilities for integrationand cooperation for Apache and similar open-source project solutions and theycan create powerful data processing tools. Due to increasing demand for Big dataprocessing including multiple data streams input in real-time and efficient datastorage and analytics performance the capabilities of Big data processing solutionsare rapidly improving. For commercial uses the combination of Splunk analyticsdesigned to operate with Hadoop and NoSQL databases called Hunk might be alsoa powerful tool in data management.

For high level comparison of presented software solutions divided intocategories, see Table 15 in Appendix 4.

37

5 Requirements analysis

Once a deeper insight in the purpose and processing of logging and possibilitiesin its subsequent analysis is gained, requirements for provided sample data can beanalyzed and an adequate solution can be proposed. Specifics of sample data alongwith requirements analysis, solution proposal, implementation and evaluation arediscussed in the following chapters.

To propose a solution for specific sample data, requirements for desiredprocessing input and mainly output need to be analyzed. First, the basic idea ofwhat the processing should be working with and what the expected outputshould consist of is given. Next, specific requirements are listed for selecting theright log analysis system.

5.1 Task description

The basic task is to collect, process and analyze logs of an application for a cartracking service to gain insight into the application functionality and behavior.Input consists of application log files in log4j format with no additional storagesystem setup. Expected results should contain tracking users functionality,malfunctions and suspicious behavior patterns based on anomaly and knownissues detection in real time with possible alerting and automatic responsefunctions built-in.

5.2 Requirements and their analysis

Summarization of the most important requirements:

• Open-source solution considered as a priority;

• Analysis of custom application log file format using log4j logging API(online collection of records necessity);

• Focus on analysis in system and application troubleshooting and unknownbehavior pattern detection;

• Tracking user operations for compliance reports;

• Creating rules on logs for anomaly detection;

• Getting alerts on errors and suspicious behavior to avoid losses;

• Possibility to retrieve the specific log files contents information fromdetected anomalies or alerts;

39

5. Requirements analysis

• Integration with logs from client devices.

Besides functional requirements, there are two main technical aspects:

• Custom fields – Log records consist of application specific information andtherefore cannot be left unprocessed. So a processing system should enablecustom file format parsing or configurable parsing rules.

• Focus on internal behavior of the application – Chosen system should beable to process physical application log files and include server-sideinformation in the output as behavior patterns of client-servercommunication are the essential area of interest.

5.3 System selection

Considering the systems’ comparison from the previous chapter, there are typesof software that can be considered inappropriate or insufficient for the listedrequirements:

• Client-side tagging – This type of tool is not suitable for applicationinternal behavior analysis as it provides only client-side information.

• Server log file parsers and analyzers – These types of software do notsupport advanced functionality for custom fields parsing configuration andanalysis.

• Application log file viewers – Viewers are considerable option for the task,with the following pros and cons:

– Pros – Viewers often offer decent capabilities for processing large files.The main features of the viewers consist of searching, filtering andhighlighting of results, which can actually be sufficient for most ofapplication behavior monitoring.

– Cons – They often come as a standalone application with difficultaddition of custom analytics functionality. They have limited parsingand storage adjusting capabilities, which makes them more suitablefor processing of already structured or simple enough messages.

• Open source systems for multiple log files processing - They often consistof a compilation of multiple software tools. It is the most suitable solutionfor the task, considering the integration abilities of standalone open-sourceapplications and possibilities to adjust the functionality at the code level.One of these merged solutions is the ELK stack.

40


Some of the biggest advantages of ELK stack (Elasticsearch, Logstash and Kibana)is that it can be used for relatively big data flows as well as for basic applicationlog files analysis with wide possibilities of add-on functionality and adjustments.Also thanks to active user community and easy to get support its setup is notcomplicated. Also, there is a wide spectrum of options for advanced functionalityenhancements and if required even commercial possibilities can be used upon ELKdeployment. One of the examples might be Prelert Anomaly detective.1

5.4 Proposed solution

According to requirements analysis outcome deployment of ELK stack for sampledata processing is recommended. Logstash for online records collection can besetup easily. It is possible to configure the default parsing template for customlog file format pre-processing. Also Elasticsearch with its rich filtering and searchcapabilities including options for rules creation (plugins) and anomaly detection isappropriate for the task. Kibana would be a benefit for analysis output generationas a friendly GUI tool used for visualization with optional contents adjustments.

Following chapters consist of the description of the provided application samplelog records data as well as implementation and deployment of proposed solution.Also issues and limitations encountered while implementation are listed accordingto part of system they were detected in. Overall summary of implementation aswell as general issues and possible future work improvements are listed in theConclusion chapter 12. Contents of upcoming chapters:

• Application log data – specifics of sample data contents and encounteredissues;

• Logstash configuration – input data collection, parsing and processing tool;

• Elasticsearch – full-text search engine and data storage using JSONdocuments;

• Kibana configuration– visualizations as well as description of defaultdashboarding;

• ElastAlert – Elasticsearch framework for alert creation based on rules andthe default rules contents.

1. Further information about the integration of Prelert and ELK can be found on pagehttp://info.prelert.com/prelert-extends-anomaly-detection-to-elasticsearch.

41

<http://info.prelert.com/prelert-extends-anomaly-detection-to-elasticsearch>


5.5 Deployment

Topology of ELK stack components are represented in Figure 5.1. Logstash cancollect data from various sources and these can be then joined together usinga Broker. Afterwards, the collected unstructured data can be processed byLogstash configuration for data parsing and enrichment. From Logstash theprocessed data is usually indexed directly into Elasticsearch. Once sitting inElasticsearch, documents can be easily queried and visualized using the Kibanabrowser-based GUI.

Figure 5.1: ELK stack topology[74]

42

6 Application log data

The proposed solution for log analysis is supposed to process log records froma car tracking system. I was provided with a sample data of two origins on whichthe solution was tested. First and more important for ongoing analysis are theapplication server log records containing log records generated in server-sidecommunication of the application. The second source of data are the client logrecords from the client device that can be packaged and sent to server on requestif further information about previous communication of a specific client isneeded. Client logs need to be treated separately on input due to their slightlydifferent contents. However it is required for these to be searchable and displayedalong the server records for efficient troubleshooting options.

For gaining structured information about log records, two main dataadjustments and additions are used:

• Fields – By default, the log record consists of only one field – overallrecord text. To gain structured information from the log text, it is parsedusing Logstash parsing filters. The parsed parts of log records are stored ascontents/values of specified fields. For example from record text CarKey3052, only number is stored in a field called CarKey.

• Tags – Tags specifying type of message can be added to the parsedmessages as strings added to a list of values of tags field. They are usefulfor tracking a specific type of message so it can be easier to search on.For example all records for finished connection are tagged withConnection_finished.

6.1 Server log file

Server log files are created by the system using the log4j framework. [28] Thepattern of records defined by log4j PatternLayout is in Section 3.6.3. The leadingsection of the log file record contains the same information for all types which is:

l ogdate [ thread_name - thread_id ( op t i ona l ) s e s s i on_id ( op t i ona l ) ]Cmessage_type module_name - message_text

Example of the server sample data log line:

2015 -02 -03 00 : 0 0 : 0 1 . 9 00 [WorkerUDP-4246 a88615be -7 d07 -49 f4 -8 b3f -Cfe6a7d594c21 ] INFO ConnectionManager .UDP.Worker - Text o f l og record

43

6. Application log data

As the messages for connection treatment are of most interest, these are the baserecords for analytics and dashboards. Three types of connections can be logged inthe files: TCP, UDP from client and UDP initiated by the server (PUSH).

When the connection is initiated, a record is logged containing the IP addressand port from where the client connects. These messages differ according to typeof connection and also do not seem to be consistent across various log records.Connection initiation messages are helpful for tracking overall connection timeand provide a possibility to aggregate messages of a specific connection sessiontogether. Nevertheless there are connections where the initiation message ismissing. More information about misalignments in provided sample data arelisted in the following section.

For any connection there is a Connection finished type of record. Thismessage is crucial for the analysis output, as it should be present for allserver-client connections and contains all the information about the connectionduration and transferred records/files. Example of Connection finished type ofmessage (message text only):

Connection f i n i s h e d : WorkerUDP : c l i e n t from /10 . 1 . 8 2 . 8 5 : 4 5 555 ,CTM A24 , IMSI 230025100345197 , SN nul l , CarKey 3052 , Se s s i on C

8502 f266 - aad3 -489 e - b839 -8 b0d25f26f9a , Status RUNNING, Fa i l u r e OK,CCreated 49 msecs ago , r e co rd s : 0 , 1 , 0 , connect ion : type=U, bytes=278 ,Cpos=0, s rv=1, msg=0

There are three types of records transferred in client-server connection in thesample data contents:

• Position record;

• Service record;

• Instant message record.

Types of records transferred in a specific connection can be tracked according torecords:0,1,0 part of message, where the first number specifies number ofpositions, second number of services and the third number of transferred instantmessage records. In Connection finished record there is also information aboutthese numbers in connection: type=U, bytes=278, pos=0, srv=1,msg=0 section. According to provided information about sample data, thenumber of transferred records should be the same in both parts of the message.Nevertheless there happen to be some misalignments which are being flagged aspart of anomaly detection as well. Type of connection in this message can be U,T or P corresponding to the type of connection – UDP, TCP or Push.

44


Depending on the connection outcome, there can be either a Connection succeededor a Connection failed message. Both contain information about records transferredand status of communication – there is also the reason of a failure listed in case ofa failed connection.

Example of a basic connection that ended with SUCCESS and contains noerror in the inner protocol (message text only):

Connection succeeded : WorkerUDP : c l i e n t from /37 . 188 . 133 . 70 : 17225 ,CTM A24 , IMSI 230024100956629 , SN nul l , CarKey 2494 , Se s s i on a88615beC-7d07 -49 f4 -8 b3f - fe6a7d594c21 , Status SUCCESS, Fa i l u r e OK, Created 40Cmsecs ago , r e co rd s : 0 , 1 , 0

Example of a basic connection that ended with FAILURE and containsinformation about error that occurred (message text only):

Connection f a i l e d : WorkerUDP : c l i e n t from /37 . 48 . 4 2 . 1 00 : 6 1838 ,CTM A16 , IMSI nul l , SN nul l , CarKey 0 , Se s s i on b41803e8 - c39b -4 a05C- a765 - d43123dff8a2 , Status FAILURE, Fa i l u r e CLIENT_UNREGISTERED,CCreated 2 msecs ago , r e co rd s : 0 , 7 , 0 Fai lureReason : CLIENT_UNREGISTERED

Apart from these, there are various types of server log messages collected. Someof the significant parts of messages parsed from server logs are:

• logdate – timestamp from log record in format of YYYY-MM-ddHH:mm:ss.SSS;

• thread_name-thread_id (optional) – type of thread that generated themessage and its number (e.g. WorkerUDP-4246);

• session_id (optional) – generated ID of session (e.g. 5dbe1a66-0ee0-4f0d-bd3a-44afe4c852fa);

• message_type – type of message (e.g. ERROR);

• module_name – module that generated message (e.g. ClientRegistry);

• message_text – text of logged message – to be parsed for specificinformation;

• ip – IP address of client;

• port – communication port of client;

45


• TM – machine type number/null (number after dash is for internal useonly and does not need to be included);

• IMSI – client number/null;

• SN – serial number/null;

• CarKey – identifier of a car/null;

• Status – status of connection – success or failure;

• Failure – if OK no failure;

• FailureReason – reason of failure;

• Created – number of msecs from when connection started;

• files – file names if associated with connection;

• Connection information from the Connection finished message:

– conn_type – U/P/T;

– conn_bytes – number of bytes transferred in communication;

– conn_pos – records of positions;

– conn_srv – records of services;

– conn_msg – records of messages.

• Records from the Connection finished and Connection failed or Connectionsucceeded types of messages:

– record_pos – records of positions;

– record_srv – records of services;

– record_msg – records of messages.

• DK – driver key.

Furthermore, these parsed parts of records are stored in Elasticsearch and can beused for filtering and querying.

Information about parsing filters used in Logstash configuration to gain thisinformation are listed in the Logstash configuration section.

46


6.2 Client log file

Client log files are logs generated by the client communication device. They are ina different format in comparison to the server log messages.

Even though the basic log analysis is run on the server log files, the systemsupports also adding of client log messages. Client logs can be requested by theserver and parsed using a specific Logstash configuration file. As a result, log linesfrom both server and client can be reviewed in the same UI and checked for possiblecommunication issues. Information about the client identification number and TMnumber can be parsed from the log filename. For example, an uploaded clientlog file can be named: sample-client-tld.domain.mod-tmlog-imsi230033164533642-day20150812-A29-10.10.44.138_3A35437-1439377227.9754-contents.txt

Client log messages also comply with the log4j format: The leading section ofthe log file record contains the same information for all types which is:

l ogdate - service_name : message_text

Example of the client sample data log line:

2015 -08 -12 12 : 52 : 03 - PowerControl : keepWokenUp c a l l e d

While the client log messages are considered only an additional information source,they are still parsed to gain message text contents information. In combinationwith the server log files, they can provide valuable insight into the client-servercommunications that occurred. Some of the information parsed from client logfiles:

• logdate – timestamp of log record in format YYYY-MM-dd HH:mm:ss;

• service_name – service that generated log message (can be omitted);

• message_text – text of logged message – to be parsed for specificinformation;

• IMSI – client number/null (from log filename);

• TM – machine type number/null (from log filename);

• ip – connecting ip address;

• port – connecting port;

• SN – serial number/null;

47


• simState – status of SIM;

• DK – driver keys;

• globalStatus – global status of device;

• statusMessage – global status message;

• dataStatus – status of data upload;

• networkStatus – status of network;

• GPSStatus – status of GPS;

• satelliteStatus – number of satellites;

• filename – filename of an uploaded/downloaded file;

• uploadStatus – status of file upload/download.

Due to less precise log time information format, misalignments in time may occurwhen reviewing messages from server and client logs. Also there might be a transferdelay between these messages, so better alignment of client and server messages forsimpler review and troubleshooting would require a consistent time format usage.The possibility to match the information from client log connection messages withthe corresponding sessions in server log connection messages would be a plus.Possibilities of this alignment are drawn in the future work section (11.1).

6.3 Data contents issues

Multiple issues were found in the sample data contents, causing additional effortsin their processing. Some of these are listed below:

• Inconsistent sample data contents – Additional sample data aremissing previously present types of generated messages resulting indifferent needs for processing.

• Inconsistent sample data format – Additional sample data are usingdifferent format of some messages resulting in parsing failures.

• Missing sessions for some logs, no sessions for client logs – Itis hard to aggregate events belonging to the same connection when theunique identifier is missing.

48


• Messages with identical message text – Messages with the samecontents are sent right after one another (few milliseconds apart).Example of messages with identical message text from sample data islisted below.

2015 -11 -29 16 : 2 5 : 1 0 . 7 35 [ C l i en tReg i s t r y ] INFO Cl i en tReg i s t r y - CRefreshed in fo rmat ion f o r carKey 2280 : IMSI=230023100144619 ,Cser ia lNumber=nul l , compKey=65642015 -11 -29 16 : 2 5 : 1 0 . 7 39 [ C l i en tReg i s t r y ] INFO Cl i en tReg i s t r y -CRefreshed in fo rmat ion f o r carKey 2280 : IMSI=230023100144619 ,Cser ia lNumber=nul l , compKey=6564

• Inconsistent property names – There are changes in property namessuch as CarKey/carKey/car_key or they need to be derived from context(e.g. IMSI/client).

• Different empty field value – The empty field can be parsed frommessages as null or 0 (e.g. CarKey field value).

• Different values and field names for properties in same session– The field value changes as part of the same session to null/0. Exampleof the inconsistent messaging format, field names and changes in values islisted below (SerialNumber/SN value changes).

2015 -02 -03 23 : 5 9 : 5 9 . 9 89 [WorkerUDP-514 9ca1010b -5484 -466 c - b091C- e5ed77f5b92a ] DEBUG Cl i en tReg i s t r y - Loaded Cl i en t : CarKey=3111 ,CIMSI=230023100133488 , SerialNumber=nul l , phone= . . .2015 -02 -03 23 : 5 9 : 5 9 . 9 38 [WorkerUDP-514 9ca1010b -5484 -466 c - b091C- e5ed77f5b92a ] DEBUG Cl i en tReg i s t r y - Gett ing c l i e n t in fo rmat ion C

f o r IMSI=230023100133488 , SerialNumber=894202032015 -02 -03 23 : 5 9 : 5 9 . 9 38 [WorkerUDP-514 9ca1010b -5484 -466 c - b091C- e5ed77f5b92a ] INFO Cl i en tReg i s t r y . C l i en t . Abstract - CConnection f i n i s h e d : WorkerUDP : c l i e n t from /10 . 1 . 8 2 . 2 22 : 3 7656 ,CTM A22 , IMSI 230023100133488 , SN nu l l . . .

• Noise values of field names – Values of properties set to unusual valuessuch as %2$d. Parsing of these is omitted in Logstash configuration file.Example is listed below (message text only).

%s in fo rmat ion f o r carKey %2$d : IMSI=%1$s , ser ia lNumber=%4$s ,CcompKey=%3$d , driverKey=%5$d , eco=%6$s

49


• Malformed invalidated data – Examples of the possibly invalidateddata are: phone=+420209457971. (redundant full stop), IMSI=23002 (tooshort)

• Suspicious messaging for changes in values – There are occurrencesof messages where the new and old values listed in are the same. Exampleof such record:

Run : carKey=2802 has the same SIM card , oldIMSI=230020300780218 ,CnewIMSI=230020300780218

• Unclear messaging – Log record contents are often hard to understandand process (inconsistently formatted).

• Different messaging for the same events – Example of differentmessaging is the new connection record which differs between connectiontypes as well as sample data sets.

• Unexpected message contents – There are unexpected log recordcontents such as unparsed packets, whole SQL queries and possiblyunhandled Java exceptions.

Apart from checks for these misalignments, it is highly recommended to alsoinvestigate and fix application code to avoid these issues – mainly possiblyunhandled Java exceptions and direct SQL code on output. Also adjustingmessaging to be consistent and computable may significantly increase efficiencyof log data analysis.

Sample data were provided in their original format in files containing recordslogged per day. Nevertheless the input system supports also online log recordscollection. Both input formats are supported by Logstash input configuration andfurther described in the following chapter.

50

7 Logstash configuration

Logstash is a log management tool for centralized logging, log enrichment andparsing. The overall purpose of Logstash is to collect the unstructured data frominput data streams, parse according to a set of filter rules – eventually add somecomputed information. Then it is used to output the processed data for additionalprocessing or storage.

All information for Logstash processing execution is set in a configuration file*.conf. So, multiple configuration files can be created for distinct data inputs.Configuration of Logstash is divided into three sections:

• Input – setting input data streams;

• Filter – setting parsing filters for computing structured information fromoften unstructured input data;

• Output – setting output for data processed by filters.

There are also multiple possibilities of log processing enrichment using Logstashplugins and using ruby code within the configuration file. Contents of Logstashconfiguration file used for processing sample data contents are described in thefollowing sections.

7.1 Input

The input section of a Logstash configuration file contains definition of input datastreams. There are various input possibilities that can be used in Logstash andcan be combined. [76]

For sample server data there are two main possibilities for processing input logdata streams – reading updates from file and online socket listening. For the clientlogs the whole file from the beginning should be processed.

7.1.1 File input

For File input plugin, there needs to be a path to the file defined. Log records areread from the file by tailing – from the last update to file (similar to tail -0f ). Butthey can be also read from the beginning if it is set so in the configuration file. Bydefault, every log line is considered one event. The input file plugin keeps track ofthe current position in each file by recording it in a separate file named sincedb.This makes it possible to stop and restart Logstash and have it pick up where itleft off without missing the lines that were added to the file while Logstash wasstopped. Path of this file can be also set in the input setting.

51

7. Logstash configuration

#Reading whole f i l e from s p e c i f i e d l o c a t i o nf i l e {

path=>"C:/ t e s tda ta / f i l ename . l og "s t a r t_po s i t i on => " beg inning "type=>"log_server "sincedb_path => "C:/ t e s tda ta / s incedb "

}

Following properties for file reading are set in this section:• path – location of file1 (It can include wildcard characters and it can also

be a directory name.);

• start_position – setting reading of file from the beginning, by default it isread by tailing;

• type – setting type to messages read by specified input;

• sincedb_path – tracking position in watched file.The same settings are used also for client log records as they are read from a fileat a specified location from the beginning.

7.1.2 Multiline

To handle log messages that occupy more than one line, multiline codec needsto be defined. In this case, it needs to be added for both server and client logs.Multiline codec is defined as below: [77]

#Inc lud ing in fo rmat ion from reco rd s on mul t ip l e l i n e scodec => mu l t i l i n e {

pattern => "^%{TIMESTAMP_ISO8601}"negate => truewhat => prev ious

}

In this section, following properties for multiline codec are set:• pattern – indicator that the field is part of a multi-line event;

• negate – can be true or false depending on first condition – if true,a message not matching the pattern will constitute a match of themultiline filter and the what will be applied;

1. For file location in Windows OS slashes in path need to be changed to unix-style due tobackslash (\) being treated as an escape; character – see more information for this on web pagehttps://logstash.jira.com/browse/LOGSTASH-430

52

https://logstash.jira.com/browse/LOGSTASH-430


• what – previous or next indicates the relation to the multi-line event.

The above definition of a multiline input filter therefore means, that every linethat does not start with a timestamp is considered a multi-line log record andshould be added to the contents of the previous line. By default, a multiline tagis added to every record that was processed by the multiline codec. This tag canbe however removed from the processed records if it is not necessary.

7.1.3 Socket based input collection

A Log4j input type is for reading events over a TCP socket from a Log4jSocketAppender. This input option is currently commented in the Logstashconfiguration file but can be enabled for online socket listening. This input typeis defined as follows:

#Read events over a TCP socket from a Log4j SocketAppender .l o g 4 j {

mode => se rv e rhost => " 0 . 0 . 0 . 0 "port => [ log4 j_port ]type => " l o g 4 j "

}

Eventually, direct collection from UDP and TCP port listening can be used foronline data collection, which is also commented in the Logstash configurationfile. [79] An example definition is listed below:

#Set t ing l i s t e n e r s f o r both TCP and UDPtcp {

port => 514type => " server_tcp "

}udp {

port => 514type => " server_udp "

}

Settings contain a listening port, host information and an optional data typesetting according to the collection method used. Additional properties can beadjusted in case the log4j socket listener is used.2

2. These can be found in Logstash documentation here: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html

53

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-log4j.html


7.2 Filter

Parsing and data enrichment rules are defined in the filter section of the Logstashconfiguration file. They mostly consist of regular expressions to parse the inputlog lines into the specific fields according to their contents. A grok filter pluginis essential for the unstructured input information parsing. However there aremultiple useful filter plugins that can be used in Logstash filter section. [80]

7.2.1 Filter plugins used in configuration

Grok filter

A Grok filter is the essential Logstash filter plugin used for parsing unstructureddata into a structured and queryable format. It supports a lot of already defineddefault patterns, however custom regular expression patterns can be defined aswell. The grok filter definition consists of the existing field and the provided regularexpression that matches it. This regular expression then parses the information ofthe input string into additional fields. In the configuration created for sample dataprocessing, the grok filter is used at first for parsing the overall message contents.Then message text contents are parsed according to type of message for acquiringspecific data from the unstructured text. Custom tags are added depending on thetype of message contents to allow simplified searching capabilities. [80]

In the Logstash configuration two kinds of field definition types are used. Thepre-defined pattern strings such as %{IP:ip_address} and custom regularexpression definitions using Oniguruma syntax such as(?<client_id>[0-9]{14,15} can be used. In the first case, parsed IPaddress in pre-defined format is stored in the ip_address field. The customregular expression match from the second example will be stored as client_id.

The overall message contents is parsed first, while the message text contents(message_text field) is parsed later using additional grok expressions. Grok filterused for parsing all messages is defined in the Logstash configuration:

#Overa l l regex pattern f o r logged r e co rd sgrok {match => { " message " => "%{TIMESTAMP_ISO8601 : l ogdate }\ s∗ \ [C(?<thread_name>[a - zA-Z ] ∗ ) (\ -(?< thread_id >[0 -9 ]∗ ) ) ?\ s∗

C

((?< sess ion_id >[0 -9a - z \ - ] ∗ ) ) ?\ s∗ \ ]\ s∗%{LOGLEVEL: message_type}C((?<module_name>[a - zA-Z \ . ] ∗ ) ) ?\ -∗%{GREEDYDATA: message_text }"}}

Essential information about log records is parsed from the default field messagethat contains all the log line contents. The parsed fields are:

• logdate – Date and time of log record;

54


• thread_name and thread_id – Name and ID of the thread that generatedmessage;

• session_id – Identification of a specific client-server communication;

• message_type – Type of logged record, or debugging level name;

• module_name – Module name that generated message;

• message_text – Unstructured message text.

Some of these fields, such as thread_id and session_id are not present for all logrecords, thus are marked as optional in the regular expression.

For processing of message_text, Logstash filter section is divided into sectionsaccording to the module name that generated the message. Afterwards, a specificinformation is parsed from the message text using grok filters and tags are addedaccordingly using the mutate filter. [80] An example of mutate filter is given below.It adds Listening_ip tag, if the parsed record contains ConnectionManager in theparsed field module_name and Listening in the message field.

#ConnectionManageri f " ConnectionManager " in [ module_name ] { . . .#IP & port f o r connect i ons l i s t e n i n gi f " L i s t en ing " in [ message ] {

grok {match=> {" message_text " => " [ a - zA-Z\ s∗ ] ∗ : \ s∗\/%{IP : ip }:%{INT : port }"}}mutate {add_tag => " Lis ten ing_ip "}

}. . . }

Most of the log record message texts are parsed in a similar fashion, adding the tagand parsing required information from the message text. Apart from these howeverthere are some additional Logstash filter plugins used for additional computedinformation. As noted before, the messages with the most important informationfor the analysis output are those logged for the connections treatment. To enablea simpler processing of the additional metrics, computed fields are added usingLogstash plugin filtering capabilities.

Date filter

When the log records are parsed using the Logstash dynamic mapping,a timestamp field used for querying in Kibana is added according to the time of

55


log record processing. To enable using the logdate field as a timestamp, it needsto be set in the Logstash configuration using the date operation:

#Use date from record as timestamp in Kibanadate {

match => [ " l ogdate " , "YYYY-MM-dd HH:mm: s s . SSS " ]t a r g e t => "@timestamp "

}

Ruby filter

The aim of a Ruby filter plugin is to add direct ruby code for the computationof the additional fields’ values. It references to the already added/parsed fields ofa log record by using event[’fieldname’] in the code. The whole ruby codesection is enclosed in quotes and supports full ruby code syntax including the localvariables and functions.

The ruby filter is used in the Logstash configuration file for the computationof the transferred records sum in the Connection_finished log events. There areactually two instances of the connection records count information in theConnection_finished event that are computed: the sum of records listed inrecords:0,1,0 and connection: type=U, bytes=278, pos=0,srv=1, msg=0. These fields can then be queried and checked for the highcounts of transferred records in total. Additionally, as the records listed in thesetwo parts of the Connection_finished message should be the same, theirdifference is considered an anomaly. The comparison result of these twocomputed fields is therefore stored in the additional boolean field calledrecords_mismatch. Example of the ruby code used for the sum of recordscomputation:

ruby {code => "event [ ’ r ecords_tota l ’ ] = C

event [ ’ record_msg ’ ] + event [ ’ record_pos ’ ] + event [ ’ record_srv ’ ]event [ ’ conn_records_total ’ ] = C

event [ ’ conn_msg ’ ] + event [ ’ conn_pos ’ ] + event [ ’ conn_srv ’ ]"}

Aggregate filter

The aim of an Aggregate filter is to aggregate information of several log recordsthat belong together. This can be used to aggregate records of the same session,tracking events for a specific client or car. The overall idea is to store some valuepresent in the events of the task and then add the computed field to the last

56


event. The use of this filter can be simply adjusted thanks to using ruby code forits computation.

For example, the duplicate records detection is implemented using theaggregate filter, executing ruby procedure for all events of all clients (theaggregation of tasks using IMSI/client field). During its first run, it sets the initand records_same local variables to 0 only if they were not initiated before. Oncethe initialization variable init is set to 0, the current records types and counts aresaved in the local variables and init is increased as initialization is complete. Forevery following log line processed by Logstash containing the same client number,the procedure runs with init already increased. So the first section of the code isexecuted (being run only if init is greater than 0). The records transferred in thecurrent log event are compared to those saved in local variables and if all three ofthem are of the same value, count of duplicate events is increased (saved inrecords_same field). This procedure is counting only the duplicate recordsattempts immediately one after another, so once an event with the differentrecords values is processed, all local variables are re-initialized. The procedure isthen comparing to the updated set of saved records fields values. This procedureis marked in the Logstash configuration file by comment #Handlingduplicate events - comparing records sent by client.

Elapsed filter

The Elapsed filter is a useful Logstash plugin that tracks the time differencebetween two log records. Both start and end records are chosen (identifiedaccording to their tag) and the unique field is used for the aggregation of thesetwo events for the specific session. This filter is used for the computation of timebetween the Connection_new record type and the Connection_finished recordtype of a specific session:

#Duration from s t a r t i n g connect ion to ende lapsed {

start_tag => "Connection_new "end_tag => " Connect ion_f in i shed "unique_id_f ie ld => " se s s i on_id "

}

There were some sessions in the sample data found that were missing theConnection_new tagged record. If the start tag is not found, the elapsed filteradds an elapsed_end_without_start tag and the elapsed time is not computed.For one type of the sessions missing the starting tag, the aggregate code sectionwas added for checking if there was a starting event.

57


This aggregate procedure creates the boolean value started and saves it as true forthe Connection_new type of message. For specific type of record (that was foundas first in sessions where often first message was missing) is then stored value ofstarted checked. In case the value is false, no beginning message was received forthe session yet and the Connection_new tag is added. In case there are more typesof messages where starting event is sometimes missing, this part of code can beadded for them as well.

Mutate filter

A Mutate filter is a basic Logstash plugin used for making any changes in thedocuments fields. It can be used for field addition/removal as well as formanipulation with tags added to the log event. Additionally, the field contentscan be updated using this filter – e.g. for replacement of a specific string in field.

7.2.2 Additional computed fields

Some additional computed fields and tags were also defined in the Logstashconfiguration filter section. For the information computed that needed value tobe stored fields were added. If no value was required to be stored, only tag wasadded to the records of interest. Fields and tags added by the Logstash filters arelisted below:

• files_total – Similarly to the total records computation, also the sum offiles transferred in a session is computed.

• time_difference and time_mismatch fields – Using the previouslydescribed elapsed filter and the Created field in the message, the overalltime of the session can be checked. However to compare these two values,they need to be adjusted first – as the value of Created field is inmilliseconds and the value of elapsed_time is listed in seconds.Afterwards, the difference between these two fields is computed andadded to the field time_difference. If this difference is greater than 0.1seconds (this can be adjusted accordingly), an additional boolean fieldtime_mismatch is added.

• Empty_connection tag – For tracking of empty connections, a tagEmpty_connection is added for all the connection log records whereneither records nor files were transferred.

• Too_many_bytes tag – In case there is an unusually high amount ofbytes while neither records nor files were transferred, a tagToo_many_bytes is added. Number of bytes for the empty connection is

58


usually below 200 bytes. If more bytes are transferred in the emptyconnection, the Too_many_bytes tag is added to theConnection_finished event.

• SQL_code and Exception_code tags – These tags are added if an SQLcode or a Java exception is present in the message. These tags were addedas a result of the content anomalies detected in the original log files andcan be removed if this behavior is expected. Eventually, custom tags forany other type of the event contents that should be tracked can be added.

• Changed_same_value tag – There are multiple log lines, where theproperty is being updated. As noticed, there are occurrences of theseupdates, where the field is being updated to the same value as before,which is also considered an anomaly and is being tracked by a tagChanged_same_value. These change messages with no actual change intheir contents are checked for all from/to fields.

• fieldname_check tag – The grok filters were set up to process all kinds ofinput strings, even though there should be more strict validation for thefields that contain specific format of the input. Once the misalignments indata are taken care of, the regular expressions for these can be used alsoin the main parsing section. Currently the misalignments in some of thefields formats are only flagged by adding a fieldname_check tag. Thesechecks are set for IMSI, CarKey and Session fields.

Example for the IMSI format check is listed below:

#Most o f IMSI va lue s should be 14 to15 - d i g i t - - Check i f noti f [ IMSI ] {

grok {match => {" IMSI " => " [ 0 - 9 ] { 14 , 1 5} " }tag_on_fai lure => [ " IMSI_check " ]

}}

Default tag added if the grok filter cannot parse the input string using providedregular expression is a _grokparsefailure. However this tag can be customized, asis shown in the example above.

7.2.3 Adjusting and adding fields

The connection types are parsed from the input data in the format of first lettersonly (T/U/P). For better usability, these fields are updated to contain the full

59


connection type format (TCP/UDP/PUSH). This field update can be specified inthe Logstash configuration using a mutate filter:

#Update connect ion type f i e l d si f [ conn_type ] {

i f [ conn_type ] == "U" {mutate {update => {" conn_type " => "UDP"}

} . . .

As noted in the previous section, there were some inconsistencies in the providedsample data logs. In the first logs provided, there were DEBUG messages added,that provided information about client and car in the beginning of each session.As these two fields are useful to be tracked for all the messages in specificconnection session, they were added using the aggregate filter. This approach canbe used only when the messages containing this information are present as thefirst log records of the session because the Logstash code can work only with theinformation in lines already read. The sample data contents provided later didnot contain these DEBUG messages, so this trick cannot be used for all types ofsample data. All records of all communication of a client cannot be retrieved byquerying Elasticsearch with a specific IMSI number.

The aggregate filter procedure that adds the IMSI and CarKey fields for allmessages generated in a specific session is triggered by a Registry_getting_clientand a Registry_getting_car tagged messages. The section of configuration thatadds this information is marked with #Adding IMSI & CarKey (if Debugmessages are enabled) comment and works as listed below:

• For all Registry_getting_client tagged messages – The aggregate filtersaves the information about client number for specific session;

• For all Registry_getting_car tagged messages – The aggregate filter savesthe information about car number for specific session;

• Then for all messages that contain session_id field – The aggregate filteradds the saved client and car number fields to the message;

• Confirmation_failed or Confirmation_success events - The messagestagged with Confirmation are usually last session messages logged.Therefore for these messages aggregate filter finishes task and removesthe saved information.

60


7.2.4 Other Logstash filters

There are many more possibilities for the Logstash filter apart from those usedin the sample data Logstash configuration. Some of the filters with interestingfunctionality are an elasticsearch filter and a metrics filter plugin.

Elasticsearch filter

The Elasticsearch filter enables sending of queries to the Elasticsearch instanceand processing the query result. It provides interesting functionality: even thoughlimited to when Elasticsearch is used as Logstash output, it is useful for gettinginformation from other elasticsearch indexes. For some of the aggregate parts ofcode used in Logstash configuration, also elasticsearch filter could have been used.

Metrics filter

The Metrics filter is aiming at storing frequency of specified types of messagesover time. It actually creates and refreshes numbers of occurrences for specifiedfields in 1, 5 and 15 minute sliding windows. This type of filter can actually beused as well as monitoring tool, generating alerts in case of high numbers of somerecord occurrences. The downside of this filter is, that it creates its own instances ofevents with the processing timestamp, generating a lot of new fields for all types ofrates. Processing of these is therefore problematic using Kibana as a visualizationtool and due to short time rates (longest is 15 minutes), this functionality was inimplementation replaced by ElastAlert monitoring rules.

7.3 Output

The output section of the Logstash configuration file specifies where the parsedlog records are sent to. In means of using full ELK stack for processing sampledata, the output is set to Elasticsearch. There is also a definition of file output forparsing failures and email output for alerting.

7.3.1 Elasticsearch output

Elasticsearch output defines that processed data should be indexed to theElasticsearch instance running on a specified host.

61


This definition is in the Logstash configuration:

#Indexing output data to e l a s t i c s e a r c he l a s t i c s e a r c h {

ac t i on => " index "hos t s => [ " l o c a l h o s t " ]workers => 2

}

By default, dynamic mapping for the created Logstash index is used and sendsdata to the Elasticsearch instance running on specified hosts. Default mappingcreates one document per event – for every parsed log line, adding fields and tagsas defined in the Logstash filter section. Every document is saved as a JSON objectwith defined fields as properties under logstash-YYYY.MM.DD index.

7.3.2 File output

Apart from basic output to Elasticsearch, also output to file in the originalmessage format is added to configuration for handling parsing failures thatrequire modifications in the Logstash filter section. This section is specifiedaccordingly:

#Set t ing f o l d e r o f messages f a i l e d in par s ingi f " _g rokpa r s e f a i l u r e " in [ tags ] {f i l e {

message_format => "%{[message ] } "path => "C:/ g r okpa r s e f a i l u r e -%{+YYYY.MM. dd } . l og " }

}

According to this setting, all messages that contain the _grokparsefailure tag,are written to a separate output file. These are listed in their original format, sothey can be used as an input file with no changes in set filters and processed byconfiguration again once the problematic filter section is adjusted accordingly.

7.3.3 Email output

An additional possibility of Logstash output setting is the triggering of alert emails3

when specified conditions are met. Logstash alerts are suitable for basic issuesthat can be checked right after an event is processed. There are also additional

3. Functionality of email alerts was tested using testing and debugging software tool. This toolcollects the emails sent using the specified port of localhost without actually delivering them. Itis downloadable from: http://smtp4dev.codeplex.com/

62

http://smtp4dev.codeplex.com/


possibilities for adding counts of events, setting complex aggregate filters or addingcomputed fields using ruby code for frequency checks. However if Elasticsearchinstance is used for storing data parsed by Logstash, querying Elasticsearch forfrequency based events using ElastAlert framework is easier for implementation.Three alerts were added to Logstash output configuration:

• if type of message ERROR occurs;

• if there are no records but more than 200 bytes transferred in connection;

• if client with IP from outside Europe is connecting.

These are set as below:

i f "ERROR" in [ message_type ] {emai l {

from => " logstash_alert@company . l o c a l "sub j e c t => " l o g s t a sh a l e r t "to => " email@example . com"v ia => " smtp "port => 35555body => "ERROR message .Here i s the event l i n e that occured : %{message }"

}

In this section, following properties for email output are set: [78]

• from – from whom the generated message is sent;

• subject – subject of generated message;

• to – specified email addresses for generated message to be sent to;

• via – for email generation smtp should be used;

• port – port for email sending – default is 25;

• address – address used to connect to the mail server – default is localhost;

• body – body of generated message.

63


7.4 Running Logstash

The Logstash configuration is run from the /bin folder using commandlogstash -f *.conf.

For editing or adding new filters into Logstash configuration, the providedconfiguration needs to be adjusted. Editing existing filters can be done by findingthe corresponding filter in configuration file by searching for the added tag,eventually according to module name that generated the message. Grokdebugger4 is a helpful tool for the verification of grok parsing patterns.

4. Can be accessed from here: http://grokdebug.herokuapp.com/

64

http://grokdebug.herokuapp.com/

8 Elasticsearch

Elasticsearch is a real-time distributed search and analytics tool. It is built on topof Apache Lucene full-text search engine and offers quick querying capabilities.Documents are stored in JSON format and all fields are natively indexed for search.When Logstash is used as input to Elasticserach, dynamic mapping is used, thuscreating one JSON document per parsed log record. Elasticsearch is by defaultrunning on port 9200 of localhost and is accessible through REST API. [81]

8.1 Query syntax

A rich amount of possibilities exist for querying Elasticsearch, e.g. using the Lucenequery syntax. Complex queries including aggregation capabilities can be built andprocessed quickly due to a flat and easily searchable structure. Some of the queryingpossibilities using Elasticsearch Query DSL are: [81]

• Full text queries – match or multimatch queries;

• Term level queries – including missing/exists and range queries;

• Compound queries – including boolean logic in queries, filtering andlimiting results;

• Joining queries – used for Nested fields and queries (mapping needs tobe adjusted);

• Specialized queries – comparing and scripted queries.

Example of Elasticsearch query using REST API:GET l o c a l h o s t :9200/ logs ta sh - ∗/ log_server /_search{" query " : {

" bool " : {"must " : [

{ "match " : { " IMSI " : "230025100345197" }} ,{ "match " : { " message_type " : "INFO" }}

] ," f i l t e r " : [{ " term " : { " tags " : " Connect ion_f in i shed " }} ,{ " range " : { " Created " : { " gte " : "300" }}}] }

}}

65

8. Elasticsearch

The query is searching in specified index and type documents as used in GETinstruction – /index_name/type_name/_search. The bool section of the queryis using the keyword must as the AND operator. As a result, all documentswhere both conditions are met (in this case the IMSI and message_type fieldsare corresponding to queried values) are returned. On top of the query results,a term filter is run causing only queried records with specified tags and Createdfield values are returned.1

The index name is specified when storing data in Elasticsearch and it basicallyworks as a package containing specific type of data. Also substitute characters canbe used and in this case logstash-∗ would search in all logstash-%DATE indices.Different types of data are usually being processed and saved using different indices.One index may contain multiple types of data (e.g. log_server) that are also definedon data input. For example, the bookstore contents can be stored in the specifiedElasticsearch index called bookstore and contain multiple types of documents,such as book and customer. If index in the Logstash output configuration is notspecified, logstash-%DATE index is used (date is set according to log messagetimestamp).

Kibana as a visualization tool for Elasticsearch has the same expressioncapabilities as direct querying of Elasticsearch instance. For more complexaggregation and compound filters and queries, visualizations on top of searchesare used. However these queries can be also shown in their raw format of howthey look when sending to Elasticsearch. Filters in Kibana can also be edited byupdating the query source directly.

8.2 Mapping

Mapping in Elasticsearch defines how the indexed documents and their fields willbe stored. It defines types of fields such as full-text search strings, numbers andgeo-location strings, date formats and also other custom rules for stored contents:

• String fields – are analyzed by default (enabling search in also sections ofstring), but storing also not analyzed version of field as field.raw;

• Numeric fields – if type is set as number (integer/float) it is also storedas number in Elasticsearch and numeric operations such as SUM or AVGcan be applied;

• IP and geolocation – using geo-location plugin, fields marked as IP areprocessed for geo-location information detection;

1. Differences between query and filter can be found here: https://www.elastic.co/guide/en/elasticsearch/guide/current/_queries_and_filters.html

66

https://www.elastic.co/guide/en/elasticsearch/guide/current/_queries_and_filters.html

https://www.elastic.co/guide/en/elasticsearch/guide/current/_queries_and_filters.html

8. Elasticsearch

• timestamp – timestamp is generated for all processed documents using thecurrent time if no field is set;

• _source – field contains the original JSON document body;

• Every event/log line processed as separate document – in case there shouldbe any nested properties set, these needs to be adjusted in configurationsetting and also updated in Elasticsearch mapping file.

Due to default requirements to process log lines as documents and no special needsfor Elasticsearch mapping, this setting was not edited for sample data processing.

One thing to consider for Elasticsearch usage is that with better speed,performance and scalability its structure is much different from basic relationdatabases. For relation databases, relations between tables and contents arecrucial. For Elasticsearch on the other hand, there is much less support for datarelationships modeling and it is often very complicated to get the complexdependent information. In the supposedly flat world of JSON documents, thescaling options are more horizontal – instead of joining, data scheme needs to beadjusted accordingly by tweaking mapping definition.

8.3 Accessing Elasticsearch

Elasticsearch can be easily accessed using REST API curl requests. Example ofthe curl request searching for records with specified client number is listed below:

cu r l -XPOST ’ l o c a l h o s t :9200/ logs ta sh - ∗/ l o g_c l i e n t /_search ’ -d ’{" query " : {

" term " : { " IMSI " : "230024100717559" }}

} ’

A nice easy-to-use option for accessing the Elasticsearch instance from a browserdirectly is using the Chrome browser Sense plugin.2. It is a JSON aware developerconsole to Elasticsearch and it enables sending curl requests to the Elasticsearchinstance directly through browser and review the request results.

Assumption for the overall implementation of ELK stack is, that querying andworking with Elasticsearch directly is minimized due to querying and visualizationsbeing handled by Kibana.

2. Downloadable on webstore, github project for this add-on is on page https://github.com/cheics/sense

67

https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig

https://github.com/cheics/sense

https://github.com/cheics/sense

9 Kibana configuration

Kibana is a browser-based analytics and dashboarding tool for Elasticsearch.GUI provided by Kibana is easy to use and enables searching and creatingvisualizations on top of data stored in Elasticsearch indexes. It is best-aligned forworking with the time specific data, however it can work with all kinds ofElasticsearch mappings.

The assumption for the implementation part is that Kibana would be usedregularly for the visualizations and detection of the anomalous properties addedin Logstash parsing. In this section, the visualizations in three dashboardscreated as part of this thesis are described and also a short user guide for theoutput processing in Kibana is provided. The time-based visualizations aredisplayed according to the time constraint in Kibana (defaults to last 15minutes). The additional information about working with Kibana and adjustingdisplay is listed in the Appendix B: User Guide.

Note that due to the dynamic mapping set in Elasticsearch, the string fieldsare analyzed by default. The string fields containing original contents can be usedin form of fieldname.raw.

9.1 General dashboard

The general dashboard contains the overall statistics and visualizations of theprocessed log data. The visualizations are mostly time-based and their purpose isto provide general information about the system performance and processedconnections specifics. Following visualizations were created and added to thegeneral dashboard:

• General_all_type visualization is for the overview of all logged messages inspecified time with added information about the type of generated message(e.g. INFO/DEBUG). This visualization can be helpful in the monitoringof overall count of log messages for possible spikes or drops. Additionally, itprovides information about the type of messages that are logged the mostoften. This information may be useful in case of unusual changes of themostly used message type that might indicate a problem with a loggingmodule or discrepancies in data.

• General_CarKey_TM visualization is for the overview of uniqueCarKeys processed in the specified time with added information aboutthe client machine type (TM) used. This visualization is for monitoringof unusual changes in count of specific car connections, which might

69

9. Kibana configuration

indicate connection or server issues. Also it includes information aboutthe used client device software for two reasons:

– General overview of what software is mostly used by active clients.– Detection of possible relation between the software type and changes

of car connections (e.g. sudden increase of messages from clients usingthe same device software).

• General_IMSI_country visualization is for the overview of unique clientnumbers (IMSI) count processed in the specified time with addedinformation about the country they were connecting from (usinggeodetection capabilities). The purpose of this graph is in monitoring ofthe total count of distinct clients connecting and listing countries wheremost clients are connecting from. Example of this visualization is shownin Figure 9.1 below.

Overall count of clients generating messages hourly during one day isvisible in the screenshot. The numbers of logging clients increasednoticeably around 2 and 3 pm, which might be worth investigating.Apart from that, apparently the most clients come from the Czechrepublic and Slovakia.

Figure 9.1: General_IMSI_country visualization screenshot

70


• General_sessions_type visualization is for the overview of uniquesessions count processed in the specified time with added informationabout connection type (UDP/TCP/PUSH). This graph can be used fordetection of connections overload (in case of high numbers of sessions) aswell as overview of mostly used connection types.

• General_failed_reason visualization is for the overview of all failedconnections that occurred in the specified time with added informationabout the failure reason. The purpose of this visualization is in generaloverview information about the failed connections and their causes forthe erroneous communications monitoring. Example of this visualizationis shown in Figure 9.2 below.

There are multiple peaks of the failed connections count that arepresumably worth investigation. These are mostly caused by timeout andconcurrent connection errors. Apart from these however, there are alsomultiple unregistered client connection attempts in the short timeframe,which might indicate also an attack attempt.

Figure 9.2: General_failed_reason visualization screenshot

• General_map visualization is based on geolocation information gainedfrom the connecting IP addresses shown on the world map. As most ofthe clients are connecting from Europe, the assumption is that all

71


connections from other parts of the world are considered an anomaly(e.g. from Asia or Africa).

• General_carkey_type is a visualization of a pie chart for the top 10CarKeys occurring in the logged messages including information abouttheir connection types (UDP/TCP/PUSH).

• General_module_type is a visualization of a pie chart for the top 5modules that generated most messages including information about thelogged message types. Example of this visualization is shown in Figure9.3 below.

According to the visualization, the largest part of the messages werelogged by WorkerUDP (so generated in UDP connection). This might behelpful information also for better insight into the distribution of theincoming connection types. Additionally, there is information about thetype of messages generated by modules in which INFO type is taking thelead. However the blue section marking the WARN type of messagemight be worth looking into as the warning messages often containvaluable information about processing issues.

Figure 9.3: General_module_type visualization screenshot

72


• General_client_TM is a visualization of the pie chart for the top 10clients that generated most messages. It also includes information aboutthe clients’ machine type (TM).

• General_records_sum is a visualization of trends and changes in thesum of transferred records over time. There are separate lines comparingthe sum of processed service, message and position records types forbetter insight for what types of records are processed the most. Exampleof this visualization is shown in Figure 9.4.

There are visible peaks in counts of transferred records according to theirtype. Apart from that, it is also visible that there is a much higherfrequency of service type records transfer than the other two recordtypes.

Figure 9.4: General_records_sum visualization screenshot

• General_files_sum is a visualization of changes and trends in the sum oftransferred files. There are separate lines comparing the sum of forwardedand received types of files, which might be helpful for gaining knowledgeabout what types of files are processed the most and how frequently.

• General_max_same is a visualization of changes and trends in themaximum of the same transferred records counter (records_same field).

73


This visualization can be helpful in detection of unusually high counts ofsame records transfer attempts. From the log records only the type oftransferred record can be learnt so low numbers might be referring totransfer of different records of the same type. However once therecords_same value is significantly high it is worth looking into.

• General_bytes_max contains a visualization of the maximum numbersin transferred bytes in a specified time for better insight of peaks (if any)of bytes transferred in connections. The graphic is also includinginformation about the connection type (TCP/UDP/PUSH) formonitoring of connections generating the maximum numbers of bytes.Example of this visualization is shown in Figure 9.5.

There are some visible peaks in the maximum of transferred bytes in thescreenshot. Also the additional information about the applicationbehavior is that even though most messages are logged by the UDPconnections (as shown in previous visualization General_module_type),TCP connections are responsible for the highest bytes transfers.

Figure 9.5: General_bytes_max visualization screenshot

• General_bytes_sum is a visualization of changes in the sum oftransferred bytes including the information about the module thatgenerated the message. The purpose of this graphic is to gain knowledge

74


about what modules are responsible for peaks in the sum of transferredbytes as well as an overview of transferred bytes in a specific timeframe.This information can be helpful when monitoring sudden changes oftransferring bytes or unusually high/low numbers.

• The General_tags_overview table is listing the counts of messages forall tags, types of messages and module names. This table is meant as anoverview list of tagged log messages for quick detection of sudden andsuspicious changes in the counts of specific type of message.

9.2 Anomaly dashboard

The Anomaly dashboard contains visualizations of detected anomalies in processedlog records. They are mostly based on the computed fields added to documents aspart of the Logstash parsing and processing. Most of the computed fields are listedand described in the Logstash configuration section. Following visualizations werecreated as part of the Anomaly dashboard:

• Anomaly_empty visualization is for trending changes in emptyconnections count over time. Purpose of this visualization is in possibletracking of similar patterns of empty connections counts over time.Example of this visualization is shown in Figure 9.6

Figure 9.6: Anomaly_empty visualization screenshot

75


• Anomaly_max_time visualization is for comparing maximum values ofthe Created field that is parsed from the finished connection eventmessage (containing information about the connection duration) and theelapsed_time field computed as the time between the new connectionmessage until the finished connection event. Differences between thesetwo fields can be also caused by the start events missing for theconnection resulting in the elapsed filter not functioning properly. Theoverall goal should be however to align these two metrics.

• Anomaly_time_created is a visualization of the sum of all the Createdfields in specific time for eventual increased delay in the processingmonitoring. It also includes the information about clients. Example ofthis visualization is shown in Figure 9.7.

Figure 9.7: Anomaly_time_created visualization screenshot

• Anomaly_max_bytes visualization is for comparing trends in changingof the max values of bytes transferred in the connections and possiblespikes detection. There might be some specific time of day that is usuallyoverloaded with a high number of bytes transfers and if so, the system canbe adjusted accordingly.

• Anomaly_duplicated_records visualization is created for monitoring ofthe sum of duplicated transfers (records_same field) including theinformation about client numbers. The records_same value might not

76


always mark the truly duplicate records transfer, as the explicit recordcontents cannot be learnt from log messages. However this visualizationhelps with detection of clients with the highest amount of same recordstransfer attempts. These clients can be then filtered on.

• Anomaly_exception & Anomaly_SQL visualizations are tracking themessages that contain suspicious contents on output, such as Javaexceptions and SQL code. Example of the SQL code anomaly detectionvisualization is shown in Figure 9.8. Tracking of Java exceptions isdisplayed using the same type of visualization.

Figure 9.8: Anomaly_SQL visualization screenshot

• Anomaly_bytes_client is a visualization of the empty connections thatare transferring more bytes than usual – messages tagged with theToo_many_bytes tag. The expected number of bytes for emptyconnection (considering protocol management needs) is below 200 bytes.This visualization also includes information about clients so they can beeasily filtered for further investigation.

• Anomaly_time_mismatch visualization is tracking records where theCreated time value differs from the computed elapsed_time. Thisdifference is stored in the computed field time_difference and is true onlyif the difference is greater than the chosen threshold (currently set to 0.1

77


second). This threshold can be adjusted if greater differences areexpected.

• Anomaly_empty_clients is a visualization for all the empty connectionsin a specified time with added information about the clients. The overallpurpose of this visualization is to investigate the clients that are oftenconnecting to the server without any records nor files being transferred.The clients occurring the most can then be filtered using the table displayof the visualization (can be accessed by clicking an arrow icon at thebottom of the graph). Example of this visualization is shown in Figure9.9.

Figure 9.9: Anomaly_empty_clients visualization screenshot

• Anomaly_records_mismatch is a visualization for theConnection_finished messages where a sum of values of record numbersdiffers within the message contents (comparison of the records_totalcomputed field and the conn_records_total field). This anomaly hasbeen discussed already in the server log data section 6.1.

• Anomaly_change_same is a visualization for all instances of messagescontaining information about the change but both values (before andafter) were the same. This graph is based on the computed tagChanged_same_value that is added as part of the Logstash parsing

78


filters by comparing previous(_from field) and new values(_to field) inall messages containing field values changes.

• Anomaly_field_check visualization was added for monitoring ofmisalignments in the field formats (e.g. for client/IMSI numbers that arenot 14 or 15-digit numbers as usual). These formats are checked as partof the Logstash parsing filters (possible future usage is that these fieldsare strictly defined in the overall parsing filters).

9.3 Client dashboard

There are only few general visualizations created for the client log records as theseare expected to be used only on request. These are mostly targeted at statusdistribution graphics and overall tags overview. The additional visualizations canbe created and added to this dashboard for tracking of other parsed elements fromclient log files. The Client dashboard visualizations are listed below:

• Client_runtime_exception visualization is monitoring the messagescontaining a Java exception in their contents. Example of thisvisualization is shown in Figure 9.10

Figure 9.10: Client_runtime_exception visualization screenshot

• Client_all_service visualization is created for listing of all the log recordswith the name of service (or module) that generated them.

79


• Client_status is a visualization for the global statuses distribution in thelog messages overview. These statuses can be of three distinct values –GREEN/RED/YELLOW.

• Client_tags_overview is a table with all messages types counts listedbased on their tags.

• Client_dataStatus visualization is used for data statuses overview in theclient messages contents.

9.4 Encountered issues and summary

There are multiple functionality downsides for Kibana 4 that were found during theimplementation. Active issues of Kibana 4 can be also found on the github projectpage, where developers are reacting to the flagged issues and provide informationon possible feature inclusion in future relases1. Some of encountered limitations ofKibana 4 are:

• Time field cannot be hidden on the Discover tab, so the listing of rawmessages is not very neat;

• Results on the Discover tab are wrapped to display on the page instead ofone record per line listing (with possible scrolling out of page horizontally);

• Default Discover tab contents cannot be pre-configured and defaultfiltering cannot be set;

• Results on the Discover tab are limited by number of results, instead ofallowing listing all results on more pages;

• Nested aggregations not supported so unable to e.g. compute sum ofnumber of records;

• Hard to locate points when using geodetection map visualization;

• Visualizations on the Dashboard tab are not always aligning correctly anddo not display the same after re-load

All dashboards, visualizations and searches discussed in this chapter are includedin the electronic version. They can be easily imported to the running Kibanainstance using the Import option on Settings -> Objects tab of Kibana GUI. Seelocation of Import button on screenshot in Figure 9.11

1. Kibana issues tracked on the github page can be reviewed here: https://github.com/elastic/kibana/issues

80

https://github.com/elastic/kibana/issues

https://github.com/elastic/kibana/issues


Figure 9.11: Kibana Objects import

A short User Guide for using Kibana for working with searches and visualizationsis added as part of Appendices chapter.

81

10 ElastAlert

As Kibana due to its browser-based nature is not suitable for real-time alerting,ElastAlert plugin is used for frequency monitoring and email alerts generation.ElastAlert is a simple framework for alerting on anomalies, spikes, or other patternsof interest from data in Elasticsearch. Overall functionality of ElastAlert lies indefault configuration of host and port of the Elasticsearch instance and set of rulescreated for anomaly patterns detection. Rules are situated in specified folder andcontain an Elasticsearch query that is triggered according to a set interval.

10.1 Types of alert rules

There are multiple types of rules that are supported by ElastAlert: [83]

• Any – every hit that the query returns will generate an alert;

• Blacklist – rule will check if specified field is on blacklist and match if itis present;

• Whitelist – rule will check if specified field is on whitelist and match if itis not present;

• Change – monitor a certain field value and alerts if it changes;

• Frequency – triggered when certain number of specified events happenin a given timeframe;

• Spike – matched when the volume of events during a given time period isspike_height times larger or smaller than the volume during the previoustime period;

• Flatline – matches when the total number of events is under a giventhreshold for a time period;

• New Term – matches when a new value appears in a field that was notseen before;

• Cardinality – matches when the total number of unique specifiedevents/values in given timeframe is higher or lower than a presetthreshold.

There are various other types of alerting output settings listed in the ElastAlertdocumentation. For the created rules however only alerting by email was used.

83

10. ElastAlert

ElastAlert is written in python and needs the python libraries installed and set upin order to be running.1

10.2 Created alert rules

Set of default ElastAlert rules were created for the sample data monitoring.Thresholds and frequency settings are based on the sample data output but canbe adjusted accordingly. Following alert rules are provided:

• rule_connections_spike – sends an alert once there is a 3 times differencein count of Connection_finished events in comparison to the previous timewindow (1 hour);

• rule_empty_spike – sends an alert once there is a 3 times difference incount of Empty_connection events in comparison to the previous timewindow (1 hour);

• rule_exception_spike – sends an alert once there is a 2 times differencein count of Exception_code events in comparison to the previous window(1 hour);

• rule_finished_frequency – sends an alert once there are more than 3000connections in the last 15 minutes;

• rule_failed_frequency – sends an alert once there are more than 15 failedconnections in the last 15 minutes;

• rule_empty_frequency – sends an alert once there are more than 400empty connections in the last 15 minutes;

• rule_duplicate_transfer_frequency – sends an alert once there are morethan 700 connections attempting to transfer same records as before in thelast 15 minutes (for all clients);

• rule_sessionid_cardinality – sends an alert if there are more than 3000unique sessions in the last 15 minutes for overload monitoring;

• rule_IMSI_cardinality – sends an alert if there are more than 1000 distinctclients connecting in the last hour;

• rule_CarKey_cardinality – sends an alert if there are more than 1000distinct car connections in the last hour;

1. All requirements for ElastAlert to be run are listed in its documentation on page http://elastalert.readthedocs.org/en/latest/running_elastalert.html.

84

http://elastalert.readthedocs.org/en/latest/running_elastalert.html

http://elastalert.readthedocs.org/en/latest/running_elastalert.html

10. ElastAlert

• rule_duplicate_IMSIcardinality – sends an alert if there are more than5502 distinct clients attempting to transfer same records as before in thelast 15 minutes;

These rules were created for monitoring of overall changes in numbers ofconnections and clients for the early alerting before actual damage can occur.The spikes for both connection events and empty connections are monitored forthe potential system issues. The unusually high numbers of transfers within thetimeframe are flagged as well for uncovering of the possible server processingissues. The cardinality rules for the unique count of clients tracking are set forrevealing device or communication issues with the specific client. Similarly, boththe overall and client specific duplicate transfer attempts are monitored forpotential issues with the data transfer and client-server side communication.

All these rules are commented and can be adjusted as well as similar additionalrules can be created using ElastAlert framework [83].

2. This number might appear too high in comparison to overall connecting clients. As thereis only type of record information in logs, between transfers flagged as the same there will bealso regular repeating transfers of different records of the same type. The purpose of monitoringthese is to flag unusually high numbers of the same records transfers for possible system issuedetection.

85

11 Conclusion

Every source of information about overall behavior and patterns of a web-basedapplication are important for gaining knowledge and improving the service. Logs,as a valuable source of information, are often underestimated. However theirprocessing and analysis may significantly improve troubleshooting efforts anduncover issues not visible in everyday use. Anomalies and monitoring ofcommunication flow can reveal important information about the processing flowand help catch issues before they cause actual damage.

Multiple log analysis systems were compared and categorized. Categorizationwas based on the information available about their functionality in attempt to getan overview of possible solutions varying by requirements.

Real-life data from a car tracking service were used to propose an open-sourcesolution for log record processing and analytics. It is based on ELK. The proposedsolution was implemented and sample data was processed and analyzed using it.Following results were accomplished:

• Sample data contents were successfully processed — Logstashconfiguration files were created for parsing the information of interestfrom the original mostly unstructured data;

• Various Kibana visualizations were created and exported for overallstatistics and monitoring and for anomalous behavior detection;

• Logstash email alerts were set for event-dependent issues and errorsalerting in time of processing;

• ElastAlert rules were created for real-time alerting capabilities based onsudden changes (spikes) and events frequency monitoring.

Various issues were encountered during the writing and implementation of thisthesis. Issues are:

• Non-standard categorization of log analysis – Overall log analysisvaries in requirements and execution. During investigation multipledistinct sources of information were found differing greatly in theirunderstanding of concept, meaning and goal of log analysis.

• Sample data contents issues – Multiple discrepancies andinconsistencies were present in the sample data set. Issues such asduplicate records, different messaging for same actions and inconsistentproperty names and values were found. Also there were differencesbetween records logged in different days (e.g. logs for one day contained

87

11. Conclusion

additional types of messages that were not present in logs froma different day). These issues caused that the Logstash parsing and filterssection implementation, testing and maintenance was much morecomplicated and time-consuming.

• Limitations of chosen implementation – One of the preferredfunctionality for sample data log analysis is to list all the messagescontained in sessions from a specific client. However due to the flatstructure of Elasticsearch, JOIN functionality is not natively supported.There is number of issues tracked for Kibana 4 and also some offunctionality that was supported in Kibana 3 is not yet available incurrent version.

11.1 Future work

There are multiple possibilities for future improvements of the solution providedin this thesis.

11.1.1 Nested queries

One possibility to improve functionality would be to enable the usage of nestedqueries. These would allow the listing of all messages received from client specificsessions. This solution would require a JOIN operation, which is not supported byElasticsearch. The required result can be accomplished in multiple ways:

• Add information about client to following session messages – thisinformation should be ideally present in the first event of a specificsession and would be then added using aggregate filter in Logstash(implemented and working if DEBUG messages are enabled);

• Adjust Logstash configuration and Elasticsearch mapping to use nestedproperties – in this case overall data model would need to be adjusted sothe messages are stored as a list of specific session (then a parent-childrelationship could be defined and messages of a specific session could besearch on using nested queries);

• Create a custom application which will communicate with Elasticsearchand use the saved result from the first query (get all sessions for specificclient) to run in a second query (get all messages for sessions from firstquery result).

88

11. Conclusion

11.1.2 Alignment of client/server logs

Client and server logs contain a number of misalignments. To enable bettercomparison and troubleshooting capabilities, the below improvement ideas needto be considered:

• Standardize the time format – currently for server it is YYYY-MM-ddHH:mm:ss.SSS and for client it is YYYY-MM-dd HH:mm:ss;

• Get information about session also in the client log file for better matching;

• Correct possible time delay between the server and client logs to be shownat corresponding time.

89

12 Appendix 1: Electronic version

Electronic version of this thesis includes:

• Logstash configuration files for both server and client logs;

• Searches/visualizations/dashboards exported from Kibana;

• ElastAlert rules and configuration file;

• Example of sample data;

• Text and images of thesis document in TeX.

91

13 Appendix 2: User Guide

As the Kibana GUI is assumed to be most used in logged data investigation, a shortuser guide is included to get started. Kibana is accessible at http://localhost:5601/.

When Kibana is accessed for the first time, index template to be used for dataquerying needs to be set. For the sample data processed by Logstash, time-basedevents should be checked, index pattern logstash-* and Timestamp as the time-field name should be chosen. See screenshot in Figure 13.1 for reference. This stepcan be done only when some indexes exist in Elasticsearch, i.e. some log files havebeen processed by Logstash.

Figure 13.1: Setting index in Kibana GUI

Once the pattern is set, one can proceed with investigating data by selectingthe Discover tab. This is detailed in the next section. The default time set whenopening Kibana is the last 15 minutes. In case there are no records logged in thistimeframe, an error page that no results were found is displayed. Date for whichall records are shown (including visualizations on dashboards) can be adjusted ina time-picker widget in the top right corner of the Kibana GUI. You can see thistime picker placement in Figure 13.2.

93

13. Appendix 2: User Guide

Figure 13.2: Changing date and time in Kibana GUI

The chosen date can be also adjusted by selecting the individual parts of thetime-based visualization graphs. Choosing section or columns of the visualizationgraphics creates a time filter for the selected area.

13.1 Discover tab

A Discover tab of the Kibana GUI is designed for basic searching and queryingdata present in Elasticsearch. Default GUI sections present in the Discover tab aredescribed below based on the screenshot with emphasized sections in Figure 13.3.

Figure 13.3: Kibana Discover tab GUI

• The search window situated at the top of the page is used for entering aquery (marked yellow in the screenshot) – input queries are based on theLucene search syntax;

• Searches can be saved, loaded and exported (marked orange in thescreenshot) – saved searches can be used in visualizations;

94


• Below the search toolbar on the right side there is a number of hits returnedby the query (marked pink in the screenshot);

• Directly below the search bar there is a small visualization of overallrecords count logged in timeframe selected in time picker widget (markedbrown in the screenshot) – selecting part of this section the timeframegets updated to show only records in timeframe chosen on visualization;

• Below the visualization there is a list of records returned by the query(marked red in the screenshot) – records are wrapped to show few firstlines by default and all contents of record can be shown by clicking thearrow on the left;

• On the left side of Discover tab contents there is a list of fields presentin the records returned by a query (marked grey in the screenshot) – thisfield section is updated automatically according to contents returned byquery and can be hidden by clicking the arrow on the right side of thepanel;

• The fields from the left panel can be added as columns to the resultslist (marked purple in the screenshot), replacing default _source field –e.g. message can be added to enable reviewing only message text of listedrecords;

• The fields listed on the left panel can be added and removed from themain results table as needed, also moved to left/right and get sorted –e.g. for reviewing client and server logs, suggested is to add type columndistinguishing between log_server and log_client types of records.

All field values listed in the log record description can be used as filters directly,see bottom part of Figure 13.4. These filters can be also edited on a source levelmanually – updated query is sent to Elasticsearch directly.See screenshot in Figure 13.4 for reference.

95


Figure 13.4: Kibana Discover tab filter

Some of the basic queries used in the Kibana Discover tab:

• Searching for the records that are missing specific field is done by queryingfor: _missing_: fieldname.

• Searching for the records for which specific field is present is done byquerying for: _exists_: fieldname.

• A field with a specific value can be searched on by using query fieldname:value – in case of string, "value" needs to be closed in quotes for exactmatch (otherwise all parts of string will be searched and matched).

• For numeric fields, also basic comparisons can be done: e.g.records_same:>10.

96


• Query may contain the boolean operators AND, NOT, OR and bracketsfor more complex queries such as tags: Connection_finished AND NOT(TM: A29 OR TM: A23).

• Scripted fields can be added to records, but they cannot be searched on inthe Discover tab, only used in visualizations.

13.2 Settings tab

On Settings -> Advanced tab default configuration can be adjusted. For examplenumber of lines shown by default is 500 but this can be increased to up to 9999.Also provided dashboards, visualizations and searches can be imported from hereas instructed in the Kibana chapter.

13.3 Dashboard tab

Dashboards can be opened on Dashboard tab using the same Save/Open/Newicons as in Search toolbar of the Discover tab. Visualizations can be added to thecurrently opened dashboard by using (+) icon. All visualizations present ondashboards can be accessed and edited by choosing the pencil icon in the upperright corner of every visualization. Visualizations can also be rearranged indashboard display by using drag and drop or removed by clicking on cross icon inthe upper right corner. All visualizations on dashboard are aligned to date andtime chosen in the time picker widget. See dashboard screenshot in Figure 13.5

Figure 13.5: Kibana Dashboard visualizations

97


13.4 Visualization tab

When the visualization is clicked on (pencil icon), it is opened in the Visualizetab, where it can be further explored and adjusted. All sections in display can befiltered on by clicking on them directly. The results of visualization in table formatcan be accessed by clicking the arrow at the bottom of visualization. The fieldsand values listed in this table can be also filtered on by clicking on them.

All applied filters can be pinned (using the pin option on filter) and then appliedfor all Kibana tabs. As a result, the value and time of interest can be set/filteredon the Visualization tab, pinned and the records themselves can be investigatedon Discover tab. Looking into the visualizations results in more detail, using thetable display and filtering can be very helpful in overall investigation of anomalousbehavior and troubleshooting. For the filtering addition example see screenshot inFigure 13.6.

Figure 13.6: Kibana visualization editing

There are multiple options of adjusting display of visualization:

• Adjusting metrics (marked yellow in the screenshot) – e.g. timeframe canbe adjusted to show counts hourly/by minutes;

• Open table list by clicking arrow on the bottom of graph (marked pink inthe screenshot);

• Set filter on value of interest (marked orange in the screenshot);

• Apply filters in the top of page (marked red in the screenshot).

Complete information about Kibana usage and much more is listed in the originalKibana 4 documentation pages. [82]

98

14 Appendix 3: Installation and setup

The system for processing sample log data consists of several components thatneed to be set up:

• Logstash – setting up data collection and adjusting configuration file;

• Elasticsearch – run downloaded Elasticsearch package directly;

• Kibana – set up index name and import dashboards (from version 4.3.creates an index in Elasticsearch automatically);

• ElastAlert – setting up framework and running provided rules.

14.1 Logstash setup

The Logstash package needs to be downloaded from the official elastic.co pages1.The version used in this master thesis is Logstash 2.1.0 (All Plugins). This versionwas chosen so there would be no need for installing additional plugins. Providedconfiguration file needs to be adjusted according to the specific input and outputrequirements.

For the Logstash file input, the path to the input file must be updated inthe configuration accordingly. Additionally, sincedb and _grokparsefailure outputpaths must be udated. For the alerting outputs, required emails need to be updatedalong with host and port to be used for sending e-mails via smtp.

Then make sure the Elasticsearch instance is running (so output can bewritten there right away) and run configuration by using the command logstash-f ls_server.conf or ls_client.conf (executable file is located in the bin folder ofLogstash installation package). To make sure the configuration is set correctlywith no syntax errors, command logstash -f ls_server.conf –configtest can be runto find possible issues in the configuration file syntax.

14.2 Elasticsearch setup

Elasticsearch needs to be downloaded from the official elastic.co pages2.Elasticsearch 2.1.1 version was used in the local tests. The installation steps arequite straightforward – download and unzip the distribution and run thebin/elasticsearch on Unix or bin/elasticsearch.bat on Windows. Elasticsearch isthen accessible on http://localhost:9200.

1. Logstash can be downloaded from https://www.elastic.co/downloads/logstash2. Elasticsearch can be downloaded from https://www.elastic.co/downloads/elasticsearch

99

https://www.elastic.co/downloads/logstash

https://www.elastic.co/downloads/elasticsearch

https://www.elastic.co/downloads/elasticsearch

14. Appendix 3: Installation and setup

14.3 Kibana setup

Kibana will not be accessible until the Elasticsearch starts and works as expecteddue to being a browser-based querying tool of an Elasticsearch instance.Download the Kibana installation package from the elastic.co pages and choosethe appropriate OS version3. A Kibana 4.3.1 version was used for the localtesting and dashboards creation. After download is complete, Kibana needs to beunzipped, elasticsearch.url needs to be pointed to the Elasticsearch instance and./bin/kibana (or bin/kibana.bat on Windows) can be run. Note that onceKibana detects the corresponding Elasticsearch instance, it creates its own index.kibana. Kibana can be accessed on http://localhost:5601.

14.4 ElastAlert setup

The ElastAlert framework setup might be more complicated if Python is notinstalled on the host yet. The framework requires Python 2.6 version and pip4.For Python 2.x versions pip is often already included in the downloadablePython package. Note that Python 3.+ will not work as there were multiple codesyntax changes between the Python 2.x and 3.x versions and ElastAlert scriptsare based on the 2.x syntax.

ElastAlert project can be downloaded from the github project page5. To getthe ElastAlert running, all steps in the documentation[83] need to be followed:

• Install the framework by running python setup.py install;

• Replace config.yaml.example with the provided config.yaml file and adjustaccordingly (Elasticsearch host/port, email alert setting etc.);

• Download rules and replace the example_rules with provided rules folder;

• Update rule.yaml files in rules folder accordingly - the Elasticsearchhost/port and email need to be set;

• Test some rule by running elastalert-test-rule rules/rule_frequency.yaml;

• If no issues are encountered, start all rules by running python -melastalert.elastalert –verbose –rule example_rulefilename.yaml.6.

3. Kibana can be downloaded from https://www.elastic.co/downloads/kibana4. Python package index – available here: https://pypi.python.org/pypi/pip5. Downloadable/cloned via the link: https://github.com/Yelp/elastalert.git6. For Windows OS these need to be launched from C:/PythonXX/Scripts/ folder with theabsolute path to the rule set in the command – e.g. python -m elastalert.elastalert –verbose –rulec:/path/elastalert/rules/rule_CarKey_cardinality.yaml

100

https://www.elastic.co/downloads/kibana

https://pypi.python.org/pypi/pip

https://github.com/Yelp/elastalert.git

15 Appendix 4: List of compared log analysis softwareNam

eType

ofsoftw

areTracking

method

Hosted/O

nprem

isePricing

Google

analyticsClient

sideJS

Pagetagging

Hosted

FreeClicky

analyticsClient

sideJS

Pagetagging

Hosted

Pricevaries

KISSm

etricsClient

sideJS

Pagetagging

Hosted

Pricevaries

ClickTale

Client

sideJS

Pagetagging

Hosted

Pricevaries

CardioLog

Client

sideJS

Pagetagging

Hosted/O

nprem

isePrice

variesWebTrends

Client

sideJS

Pagetagging

Hosted

Not

availableMint

Client

sideJS

Pagetagging

Onprem

ise$30

persite

Open

Web

Analytics

Client

sideJS

Pagetagging

Onprem

iseFree

(GNU)

Piwik

Client/Server

sideJS/PH

Ptagging

&Log

filesOnprem

ise/Hosted

Free/PRO

Craw

lTrackClient/Server

sidePH

Ptagging

Onprem

iseFree

(GNU)

W3Perl

Client/Server

sideJS

taggingor

Logfiles

Onprem

iseFree

(GNU)

AWStats

Severside

Logfiles

Onprem

iseFree

(GNU)

Analog

Serverside

Logfiles

Onprem

iseFree

(GNU)

Webalizer

Serverside

Logfiles

Onprem

iseFree

(GNU)

GoA

ccessServer

sideLog

filesOnprem

iseFree

(GNU)

Angelfish

Serverside

Logfiles

Onprem

ise$1

295per

yearLog

ExpertServer

sideLog

filesOnprem

iseFree

(MIT

)Chainsaw

Serverside

Logfiles

Onprem

iseFree

(GNU)

BareTail

Serverside

Logfiles

Onprem

iseFree

Gam

utLogViewer

Serverside

Logfiles

Onprem

iseFree/PR

OOtrosLogV

iewerServer

sideLog

filesOnprem

iseOpen

sourceLogM

XServer

sideLog

filesOnprem

isePrice

variesRetrospective

Serverside

Logfiles

Onprem

isePrice

variesLogentries

Client/Server

sideJS

tagging&

Logfiles

Hosted

Pricevaries

Sawmill

Client/Server

sideJS

tagging&

Logfiles

Onprem

ise/Hosted

Pricevaries

SplunkPlatform

solutionTagging/Log

files–more

sourcesOnprem

isePrice

variesSum

oLogic

Platformsolution

Tagging/Logfiles

–more

sourcesHosted

Pricevaries

Grok

byNum

entaAW

Sdedicated

Systemmetrics

Hosted

Pricevaries

XpoLog

Platformsolution

Tagging/Logfiles

–more

sourcesOnprem

ise/Hosted

Pricevaries

Skylineby

EtsyTraffi

cmonitoring

Systemmetrics

Onprem

iseFree

(GNU)

ELKstack

Serverside

/Logfiles

–more

sourcesOnprem

iseFree

(GNU)

Apache

family

Platformsolution

/Logfiles

–more

sourcesOnprem

iseFree

(GNU)

101

List of Tables

3.1 List of Conversion characters used in ConversionPattern 224.1 Comparison of selected client-based software features 304.2 Comparison of selected server log file analysis software

features 314.3 Comparison of selected application log file analysis software

features 33

103

List of Figures

2.1 Anatomy of APM [2] 42.2 The data science process [4] 62.3 Data mining placed in the KDD Process [7] 83.1 Illustration of communication zones for attack detection [18] 145.1 ELK stack topology[74] 429.1 General_IMSI_country visualization screenshot 709.2 General_failed_reason visualization screenshot 719.3 General_module_type visualization screenshot 729.4 General_records_sum visualization screenshot 739.5 General_bytes_max visualization screenshot 749.6 Anomaly_empty visualization screenshot 759.7 Anomaly_time_created visualization screenshot 769.8 Anomaly_SQL visualization screenshot 779.9 Anomaly_empty_clients visualization screenshot 789.10 Client_runtime_exception visualization screenshot 799.11 Kibana Objects import 8113.1 Setting index in Kibana GUI 9313.2 Changing date and time in Kibana GUI 9413.3 Kibana Discover tab GUI 9413.4 Kibana Discover tab filter 9613.5 Kibana Dashboard visualizations 9713.6 Kibana visualization editing 98

105

16 Literature

[1] ISACA. Monitoring Internal Control Systems and IT. 124 pages. ISBN9781604201109.

[2] APM digest. The Anatomy of APM – 4 Foundational Elements toa Successful Strategy. [cited 16 June 2015] From the World WideWeb: <http://apmdigest.com/the-anatomy-of-apm-4-foundational-elements-to-a-successful-strategy>

[3] Patents. On-line service/application monitoring and reporting system. [citedJune 16 2015] From the World Wide Web: <https://www.google.com/patents/US7457872>

[4] Cathy O’Neil, Rachel Schutt. Doing Data Science O’Reilly Media, 2013. 406pages. ISBN 978-1-44935-865-5

[5] Mann, Prem S. (1995). Introductory Statistics (2nd ed.). Wiley. ISBN 0-471-31009-3.

[6] Ian H. Witten, Eibe Frank, Mark A. Hall. Data Mining – Practical MachineLearning Tools and Techniques (Third Edition) Copyright c© 2011 Elsevier Inc.ISBN 978-0-12-374856-0

[7] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From DataMining to Knowledge Discovery in Databases [cited 16 June 2015] From theWorld Wide Web: <http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf>

[8] Stanford Coursera. Machine Learning [cited 16 June 2015] Fromthe World Wide Web: <https://class.coursera.org/ml-005/lecture/preview>

[9] Logi Analytics. Business Intelligence [cited 16.6. 2015] From the WorldWide Web: <http://www.logianalytics.com/bi-encyclopedia/business-intelligence>

[10] Oded Maimon, Lior Rokach. Data Mining and Knowledge Discovery HandbookSecond edition. Springer Science+Business Media, LLC 2005, 2010. e-ISBN978-0-387-09823-4

[11] Bernard J. Jansen, Amanda Spink, Isak Taksa. Handbook of Research on WebLog Analysis Copyright c© 2009 by IGI Global. ISBN 978-1-60566-975-5

107

http://apmdigest.com/the-anatomy-of-apm-4-foundational-elements-to-a-successful-strategy

http://apmdigest.com/the-anatomy-of-apm-4-foundational-elements-to-a-successful-strategy

https://www.google.com/patents/US7457872

https://www.google.com/patents/US7457872

http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf

http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf

https://class.coursera.org/ml-005/lecture/preview

https://class.coursera.org/ml-005/lecture/preview

http://www.logianalytics.com/bi-encyclopedia/business-intelligence

http://www.logianalytics.com/bi-encyclopedia/business-intelligence

16. Literature

[12] Šárka Řezáčová. Creation and Vizualization of a Timeline of Budget ofMunicipality Master thesis. [cited 16 June 2015] From the World Wide Web:<https://is.muni.cz/auth/th/268820/fi_m/?fakulta=1433;obdobi=6184>

[13] Christopher Kruegel, Giovanni Vigna. Anomaly Detection of Web-based Attacks [cited 16 June 2015] From the World Wide Web:<https://cs.ucsb.edu/~vigna/publications/2003_kruegel_vigna_ccs03.pdf>

[14] Karen Kent, Murugiah Souppaya. Guide to Computer Security LogManagement NIST Special Publication 800-92. September 2006. [cited16 June 2015] From the World Wide Web: <http://csrc.nist.gov/publications/nistpubs/800-92/SP800-92.pdf>

[15] IBM Corp. 2001, 2003. Log File Formats [cited 16 June 2015] From the WorldWide Web: <http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html>

[16] Sitepoint. Blane Warrene. Configure Web Logs in Apache February 23,2004. From the World Wide Web: <http://www.sitepoint.com/configuring-web-logs-apache/>

[17] BigPanda. Elik Eizenberg. A Practical Guide to Anomaly Detectionfor DevOps June 26, 2014. [cited 16 June 2015] From the World WideWeb:<https://www.bigpanda.io/blog/a-practical-guide-to-anomaly-detection/23-a-practical-guide-to-anomaly-detection>

[18] SANS Institute InfoSec Reading Room. Roger Meyer. Detecting Attacks onWeb Applications from Log Files Accepted: 26 January 2008. Copyright SANSInstitute. [cited 16 June 2015] From the World Wide Web:<http://www.sans.org/reading-room/whitepapers/logging/detecting-attacks-web-applications-log-files-2074>

[19] ApacheViewer. Web Analytics vs. Log File Analysis [cited 16 June 2015]From the World Wide Web:<http://www.apacheviewer.com/web_analytics_vs_log_files.pdf>

[20] Hsinchun Chen, Roger H. L. Chiang, Veda C. Storey. BUSINESSINTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACTMISQ 2012. [cited 16 June 2015] From the World Wide Web:<http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf>

108

https://is.muni.cz/auth/th/268820/fi_m/?fakulta=1433;obdobi=6184

https://is.muni.cz/auth/th/268820/fi_m/?fakulta=1433;obdobi=6184

https://cs.ucsb.edu/~vigna/publications/2003_kruegel_vigna_ccs03.pdf

https://cs.ucsb.edu/~vigna/publications/2003_kruegel_vigna_ccs03.pdf

http://csrc.nist.gov/publications/nistpubs/800-92/SP800-92.pdf

http://csrc.nist.gov/publications/nistpubs/800-92/SP800-92.pdf

http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html

http://publib.boulder.ibm.com/tividd/td/ITWSA/ITWSA_info45/en_US/HTML/guide/c-logs.html

http://www.sitepoint.com/configuring-web-logs-apache/

http://www.sitepoint.com/configuring-web-logs-apache/

https://www.bigpanda.io/blog/a-practical-guide-to-anomaly-detection/23-a-practical-guide-to-anomaly-detection



http://www.sans.org/reading-room/whitepapers/logging/detecting-attacks-web-applications-log-files-2074



http://www.apacheviewer.com/web_analytics_vs_log_files.pdf

http://www.apacheviewer.com/web_analytics_vs_log_files.pdf

http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf

http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf

16. Literature

[21] Igvita. By Ilya Grigorik on November 30, 2012. Web PerformanceAnomaly Detection with Google Analytics. [cited 16 June 2015] Fromthe World Wide Web:<https://www.igvita.com/2012/11/30/web-performance-anomaly-detection-with-google-analytics/>

[22] Speaker Deck. Elasticsearch Inc. Real-time Analytics and AnomaliesDetection using Elasticsearch, Hadoop and Storm. Published June3, 2014 in Technology. [cited 16 June 2015] From the World WideWeb:<https://speakerdeck.com/elasticsearch/real-time-analytics-and-anomalies-detection-using-elasticsearch-hadoop-and-storm>

[23] Jan Valdman. Log File Analysis PhD report. July, 2001. [cited 16June 2015] From the World Wide Web:<https://www.kiv.zcu.cz/site/documents/verejne/vyzkum/publikace/technicke-zpravy/2001/tr-2001-04.pdf>

[24] Logentries. Posted on June 5, 2014 by Trevor Parsons. 5 Ways to Use LogData to Analyze System Performance. [cited 16 June 2015] From the WorldWide Web:<https://blog.logentries.com/2014/06/5-ways-to-use-log-data-to-analyze-system-performance/>

[25] Logentries. Posted on January 6, 2015 by Chris Riley. Catching InactivityBefore It Catches You. [cited 16 June 2015] From the World WideWeb:<https://blog.logentries.com/2015/01/catching-inactivity-before-it-catches-you/>

[26] James H. Andrews. Testing using Log File Analysis: Tools, Methods,and Issues. Dept. of Computer Science, University of WesternOntario. [cited 16 June 2015] From the World Wide Web:<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.1120&rep=rep1&type=pdf>

[27] OWASP. Logging Cheat Sheet. [cited 16 June 2015] From the WorldWide Web:<https://www.owasp.org/index.php/Logging_Cheat_Sheet>

[28] JavaWorld. By Ceki Gülcü Log4j delivers control over loggingNov 22, 2000 12:00. [cited 16 June 2015] From the World WideWeb:<http://www.javaworld.com/article/2076243/java-se/log4j-delivers-control-over-logging.html>

[29] Ted Dunning and Ellen Friedman. Practical Machine Learning. Published byO’Reilly Media, Inc. Copyright c© 2014. ISBN: 978-1-491-90408-4

109

https://www.igvita.com/2012/11/30/web-performance-anomaly-detection-with-google-analytics/

https://www.igvita.com/2012/11/30/web-performance-anomaly-detection-with-google-analytics/

https://speakerdeck.com/elasticsearch/real-time-analytics-and-anomalies-detection-using -elasticsearch-hadoop-and-storm



https://www.kiv.zcu.cz/site/documents/verejne/vyzkum/publikace/technicke-zpravy/2001/tr-2001-04.pdf



https://blog.logentries.com/2014/06/5-ways-to-use-log-data-to-analyze-system-performance/

https://blog.logentries.com/2014/06/5-ways-to-use-log-data-to-analyze-system-performance/

https://blog.logentries.com/2015/01/catching-inactivity-before-it-catches-you/

https://blog.logentries.com/2015/01/catching-inactivity-before-it-catches-you/

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.1120&rep=rep1&type=pdf



https://www.owasp.org/index.php/Logging_Cheat_Sheet

https://www.owasp.org/index.php/Logging_Cheat_Sheet

http://www.javaworld.com/article/2076243/java-se/log4j-delivers-control-over-logging.html

http://www.javaworld.com/article/2076243/java-se/log4j-delivers-control-over-logging.html

16. Literature

[30] Apache logging Class Level. Copyright c© 1999-2012 Apache SoftwareFoundation. [cited 16 June 2015] From the World Wide Web:<http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html>

[31] Apache logging Class PatternLayout. Copyright c© 1999-2012 ApacheSoftware Foundation. [cited 16 June 2015] From the World WideWeb:<https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html>

[32] KDnuggets. By Sean McClure (ThoughtWorks). Data Science and Big Data:Two very Different Beasts. [cited 16 December 2015] From the World WideWeb: <http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html>

[33] Google analytics homepage. [cited 16 June 2015] From the World Wide Web:<http://www.google.com/analytics/>

[34] Clicky homepage. [cited 16 December 2015] From the World Wide Web:<http://clicky.com//>

[35] KISSmetrics homepage. [cited 16 December 2015] From the World Wide Web:<https://www.kissmetrics.com/>

[36] ClickTale homepage. [cited 16 December 2015] From the World Wide Web:<http://www.clicktale.com/>

[37] CardioLog homepage. [cited 16 December 2015] From the WorldWide Web: <http://www.intlock.com/intlocksite/productsandservices/cardiolog/cardiolog-analytics.asp>

[38] WebTrends homepage. [cited 16 December 2015] From the World Wide Web:<https://www.webtrends.com/>

[39] Mint homepage. [cited 16 December 2015] From the World Wide Web:<http://haveamint.com/>

[40] Open Web analytics homepage. [cited 16 December 2015] From the WorldWide Web: <http://www.openwebanalytics.com/>

[41] Piwik homepage. [cited 16 December 2015] From the World Wide Web:<http://piwik.org/>

[42] CrawlTrack homepage. [cited 16 December 2015] From the World Wide Web:<http://www.crawltrack.net/>

110

http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html



https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html

https://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html

http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html

http://www.kdnuggets.com/2015/07/data-science-big-data-different-beasts.html

http://www.google.com/analytics/

http://clicky.com//

https://www.kissmetrics.com/

http://www.clicktale.com/

http://www.intlock.com/intlocksite/productsandservices/cardiolog/cardiolog-analytics.asp

http://www.intlock.com/intlocksite/productsandservices/cardiolog/cardiolog-analytics.asp

https://www.webtrends.com/

http://haveamint.com/

http://www.openwebanalytics.com/

http://piwik.org/

http://www.crawltrack.net/

16. Literature

[43] W3Perl homepage. [cited 16 December 2015] From the World Wide Web:<http://www.w3perl.com/>

[44] AWStats homepage. [cited 16 December 2015] From the World Wide Web:<http://www.awstats.org/>

[45] Analog homepage. [cited 16 December 2015] From the World WideWeb: <http://web.archive.org/web/20140626104625/http://analog.cx/index.html>

[46] Webalizer homepage. [cited 16 December 2015] From the World Wide Web:<http://www.webalizer.org/>

[47] GoAccess homepage. [cited 16 December 2015] From the World Wide Web:<http://goaccess.io/>

[48] Angelfish homepage. [cited 16 December 2015] From the World Wide Web:<http://analytics.angelfishstats.com/>

[49] Log Expert homepage. [cited 16 December 2015] From the World Wide Web:<http://www.log-expert.de/>

[50] Apache Chainsaw information page. [cited 16 December 2015] From theWorld Wide Web: <http://logging.apache.org/chainsaw/index.html>

[51] BareTail homepage. [cited 16 December 2015] From the World Wide Web:<http://www.baremetalsoft.com/baretail/>

[52] Gamut log viewer homepage. [cited 16 December 2015] From the World WideWeb: <http://sourceforge.net/projects/gamutlogviewer/>

[53] Otros log viewer homepage. [cited 16 December 2015] From the World WideWeb: <https://github.com/otros-systems/otroslogviewer>

[54] Log MX homepage. [cited 16 December 2015] From the World Wide Web:<http://www.logmx.com/>

[55] Retrospective log analyzer homepage. [cited 16 December 2015] Fromthe World Wide Web: <http://www.retrospective.centeractive.com/log-analyzer>

[56] Logentries homepage. [cited 16 December 2015] From the World Wide Web:<https://logentries.com/>

111

http://www.w3perl.com/

http://www.awstats.org/

http://web.archive.org/web/20140626104625/http://analog.cx/index.html

http://web.archive.org/web/20140626104625/http://analog.cx/index.html

http://www.webalizer.org/

http://goaccess.io/

http://analytics.angelfishstats.com/

http://www.log-expert.de/

http://logging.apache.org/chainsaw/index.html

http://logging.apache.org/chainsaw/index.html

http://www.baremetalsoft.com/baretail/

http://sourceforge.net/projects/gamutlogviewer/

https://github.com/otros-systems/otroslogviewer

http://www.logmx.com/

http://www.retrospective.centeractive.com/log-analyzer

http://www.retrospective.centeractive.com/log-analyzer

https://logentries.com/

16. Literature

[57] Sawmill homepage. [cited 16 December 2015] From the World Wide Web:<http://www.sawmill.co.uk/>

[58] Splunk homepage. [cited 16 December 2015] From the World Wide Web:<http://www.splunk.com//>

[59] Sumo logic homepage. [cited 16 December 2015] From the World Wide Web:<https://www.sumologic.com/>

[60] Anomaly detective by Prelert homepage. [cited 16 December 2015] From theWorld Wide Web: <http://info.prelert.com/products/anomaly-detective-app>

[61] Grok by Numenta homepage. [cited 16 December 2015] From the World WideWeb: <https://numenta.com/grok/>

[62] Xpolog homepage. [cited 16 December 2015] From the World Wide Web:<http://www.xpolog.com/home/solutions/logAnalysis.jsp>

[63] Skyline by etsy project on github. [cited 16 December 2015] From the WorldWide Web: <https://github.com/etsy/skyline>

[64] Oculus by etsy project on github. [cited 16 December 2015] From the WorldWide Web: <https://github.com/etsy/oculus>

[65] Graylog homepage. [cited 16 December 2015] From the World Wide Web:<https://www.graylog.org/>

[66] Elastic products homepage. [cited 16 December 2015] From the World WideWeb: <https://www.elastic.co/products>

[67] Apache foundation homepage. [cited 16 December 2015] From the World WideWeb: <http://www.apache.org/>

[68] Apache flume information page. [cited 16 December 2015] From the WorldWide Web: <http://flume.apache.org/>

[69] Apache hadoop information page. [cited 16 December 2015] From the WorldWide Web: <http://hadoop.apache.org/

[70] Apache lucene solr information page. [cited 16 December 2015] From theWorld Wide Web: <http://lucene.apache.org/solr/features.html>

[71] Apache Spark information page [cited 16 December 2015] From the WorldWide Web: <http://spark.apache.org/>

112

http://www.sawmill.co.uk/

http://www.splunk.com//

https://www.sumologic.com/

http://info.prelert.com/products/anomaly-detective-app

http://info.prelert.com/products/anomaly-detective-app

https://numenta.com/grok/

http://www.xpolog.com/home/solutions/logAnalysis.jsp

https://github.com/etsy/skyline

https://github.com/etsy/oculus

https://www.graylog.org/

https://www.elastic.co/products

http://www.apache.org/

http://flume.apache.org/

http://hadoop.apache.org/

http://lucene.apache.org/solr/features.html

http://lucene.apache.org/solr/features.html

http://spark.apache.org/

16. Literature

[72] Apache Storm information page. [cited 16 December 2015] From the WorldWide Web: <http://storm.apache.org/>

[73] Amazon web services homepage. [cited 16 December 2015] From the WorldWide Web: <http://aws.amazon.com/>

[74] Sixtree. Posted by Swapnil Desai on 14 November 2014. Introduction toElasticsearch, Logstash and Kibana (ELK) Stack. [cited 26 December 2015]From the World Wide Web:<http://www.sixtree.com.au/articles/2014/intro-to-elk-and-capturing-application-logs/>

[75] Wikitech. Analytics/Cluster/Logstash. [last modified on 25 November 2014,at 19:17] [cited 26 December 2015] From the World Wide Web: <https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logstash>

[76] Logstash input plugins information page. [cited 26 December 2015] From theWorld Wide Web: <https://www.elastic.co/guide/en/logstash/current/input-plugins.html>

[77] Logstash codec plugins information page. [cited 26 December 2015] From theWorld Wide Web: <https://www.elastic.co/guide/en/logstash/current/codec-plugins.html>

[78] Logstash output plugins information page. [cited 26 December 2015] From theWorld Wide Web: <https://www.elastic.co/guide/en/logstash/current/output-plugins.html>

[79] JAMES TURNBULL. When Logstash And Syslog Go Wrong. SAT, SEP20, 2014. [cited 26 December 2015] From the World Wide Web: <http://kartar.net/2014/09/when-logstash-and-syslog-go-wrong/>

[80] Logstash filter plugins information page. [cited 26 December 2015] From theWorld Wide Web: <https://www.elastic.co/guide/en/logstash/current/filter-plugins.html>

[81] Elasticsearch documentation. [cited 29 December 2015] From the WorldWide Web: <https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html>

[82] Kibana documentation. [cited 29 December 2015] From the World Wide Web:<https://www.elastic.co/guide/en/kibana/current/index.html>

[83] ElastAlert documentation pages. [cited 29 December 2015] From the WorldWide Web: <http://elastalert.readthedocs.org/en/latest/>

113

http://storm.apache.org/

http://aws.amazon.com/

http://www.sixtree.com.au/articles/2014/intro-to-elk-and-capturing-application-logs/

http://www.sixtree.com.au/articles/2014/intro-to-elk-and-capturing-application-logs/

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logstash

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Logstash

https://www.elastic.co/guide/en/logstash/current/input-plugins.html

https://www.elastic.co/guide/en/logstash/current/input-plugins.html

https://www.elastic.co/guide/en/logstash/current/codec-plugins.html

https://www.elastic.co/guide/en/logstash/current/codec-plugins.html

https://www.elastic.co/guide/en/logstash/current/output-plugins.html

https://www.elastic.co/guide/en/logstash/current/output-plugins.html

http://kartar.net/2014/09/when-logstash-and-syslog-go-wrong/

http://kartar.net/2014/09/when-logstash-and-syslog-go-wrong/

https://www.elastic.co/guide/en/logstash/current/filter-plugins.html

https://www.elastic.co/guide/en/logstash/current/filter-plugins.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

http://elastalert.readthedocs.org/en/latest/