81
Florida State University Libraries Electronic Theses, Treatises and Dissertations The Graduate School 2007. Automation of Email Analysis Using a Database Jasbinder Singh Bali Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]

Automation of Email Analysis Using a Database - DigiNole

Embed Size (px)

Citation preview

Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2007.

Automation of Email Analysis Using aDatabaseJasbinder Singh Bali

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]

THE FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

AUTOMATION OF EMAIL ANALYSIS USING A DATABASE

By

JASBINDER S. BALI

A Thesis submitted to theDepartment of Computer Science

in partial fulfillment of therequirements for the degree of

Masters of Science

Degree Awarded:Fall Semester, 2007

The members of the Committee approve the Thesis of Jasbinder Bali defended on October

10, 2007.

Sudhir AggarwalProfessor Directing Thesis

Zhenhai DuanCommittee Member

Piyush KumarCommittee Member

Approved:

David Whalley, ChairDepartment of Computer Science

Joseph Travis, Dean, College of Arts and Sciences

The Office of Graduate Studies has verified and approved the above named committee members.

ii

The members of the Committee approve the Thesis of Jasbinder Bali defended on October

10, 2007.

Sudhir AggarwalProfessor Directing Thesis

Zhenhai DuanCommittee Member

Piyush KumarCommittee Member

The Office of Graduate Studies has verified and approved the above named committee members.

ii

This thesis is dedicated to almighty Waheguru and all my family members.

iii

ACKNOWLEDGEMENTS

I would like to thank my advisor Dr. Sudhir Aggarwal for giving me an opportunity to

work on UnMASK. I would also like to thank Leo Kermes and Zhenghui Zhu who worked

hand in hand with me on this project, for all their help and support.

— Jasbinder S. Bali

iv

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3. Overview of UnMASK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Email Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.5 Unix Tools Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4. Database Server: Automation Engine for UnMASK . . . . . . . . . . . . . . . 114.1 Why PostgreSql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 UnMASK Database Design . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 Raw Email Analysis: Basis of DB Design . . . . . . . . . . . . . . 134.2.2 Database Design Factors . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Final Database Design . . . . . . . . . . . . . . . . . . . . . . . . . 19

5. Automation Using a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1 UnMASK Unix Tools Connection Protocol (UUTC) . . . . . . . . . . . . 465.2 Automating Email Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 46

6. Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7. State Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1 Why State Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.2 Accomplishing State Maintenance . . . . . . . . . . . . . . . . . . . . . . 56

8. Tagging Unix Tools Result with an Email . . . . . . . . . . . . . . . . . . . . 59

9. Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

v

10.Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6710.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

vi

LIST OF TABLES

3.1 UnMASK Unix Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 tbl users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 tbl email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 tbl l header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 tbl ul header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 tbl ul header received . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.6 tbl email address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.7 tbl uri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.8 tbl href . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.9 tbl website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.10 tbl source master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.11 tbl whois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.12 tbl traceroute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.13 tbl dig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.14 tbl country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.15 tbl vrfy mx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.16 tbl concurrent db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.17 tbl concurrent unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.18 UnMASK Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

LIST OF FIGURES

3.1 UnMASK: Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Sample Raw Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 UnMASK: Tables, Triggers and Dataflow . . . . . . . . . . . . . . . . . . . . 19

4.3 User Defined Function sp fetch tools result . . . . . . . . . . . . . . . . . . . 41

4.4 Trigger Function func client socket . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 Workflow: Automating email analysis . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Implementation of User Defined Function sp email parser . . . . . . . . . . . 48

6.1 Communication mechanism between the database server and Unix tools server. 52

9.1 UnMASK User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9.2 Segment of a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

9.3 Part of header field report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

viii

ABSTRACT

Phishing scams which use emails to trick users into revealing personal data have become

pandemic in the world. Analyzing such emails to extract maximum information about

them and make intelligent forensic decisions based on such an analysis is a major task

for law enforcement agencies. To date such analysis is done by manually checking various

headers of a raw email and running various Unix tools on its constituent parts such as

IP addresses, links, domain names. This thesis describes the design and development of a

database system used for automation of a system called the Undercover Multipurpose Anti-

Spoofing Kit (UnMASK) that will enable investigators to reduce the time and effort needed

for digital forensic investigations of email-based crimes. It also describes how the database

is used to perform such automation. UnMASK uses a database for organizing a workflow

to automatically launch Unix tools to collect additional information from the Internet. The

retrieved information is in turn added to the database. UnMASK is a working system. To

the best of our knowledge, UnMASK is the first comprehensive system that can automate

the process of analyzing emails using a database and then generate forensic reports that can

be used for subsequent investigation and prosecution.

ix

CHAPTER 1

Introduction

Phishing scams which use emails to trick users into revealing personal data have become

pandemic in the world. Analyzing such emails to extract maximum information about them

and make intelligent forensic decisions based on such an analysis is a major task for law

enforcement agencies. To date such analysis is done by manually checking various headers

of a raw email and Unix tools like traceroute, whois, dig etc. are run to fetch information

about these IP addresses, domain names and links from the internet. Currently, there are

Programs/Tools like Sam Spade[1] and Domain Tools[2] that share the same goal of running

Unix tools like traceroute, dig, whois etc. on these IP addresses, domain names. However,

this process is not a complete system in itself. It still needs human effort to parse raw emails

manually and then feed its constituent IP address, email address, domain names etc to these

Unix tools. Also, there is no correlation between the results obtained from various runs of

Unix tools and the data is not stored in a central repository for future reference.

This thesis aims at describing the design and development of a database system that

would lead to the creation of a self sufficient system called UnMASK[3]. UnMASK aims at

parsing a raw email, break it into its constituent headers and body, pick IP addresses, email

addresses, domain names and links from these headers and the body and then automatically

invoke a Unix tools server that would run various Unix tools like traceroute, dig, whois

and GeoIP to fetch information from the internet and store this information back into

the database. The database system described in this thesis aims at automating all the

processes that would take place in UnMASK and making UnMASK a self-sufficient system

for analyzing emails. We accomplish such automation using PostgreSql database and a novel

use of database triggers to create a workflow manager; and the automation of the use of Unix

1

tools to automatically retrieve additional desired information from the Internet and store it

into the database.

Chapter 2 throws light on the background and related work in the field of database

driven automation and analysis of phishing emails. Once we are aware of the current trends

and techniques of database driven automation and analysis of phishing emails, we get a fair

idea of our direction of research and the methodologies we would use to design the database

to accomplish the goal of UnMASK.

Chapter 3 gives an overview of UnMASK project. Here we discuss its problem statement

and various plausible solutions from the database point of view. Then we talk about the

final solution adopted for accomplishing UnMASK’s goal of automatic email analysis. Before

explaining the database driven automation, we also throw light on UnMASK’s software

architecture to get an idea about the UnMASK system. Email Parser and the Unix Tools

Server are the two major components of UnMASK based on which, we designed our database

system. So, it becomes necessary to understand these two components of UnMASK before

discussing the database design. We refer [3] and explain these two important components

of UnMASK in Sections 3.5 and 3.4. It should be noted that the design and development

of Email Parser and the Unix Tools Server used in UnMASK is not a part of this thesis work.

PostgreSql is a powerful database system because of its provision for various programming

language interfaces. In Chapter 4, we discuss the reasons for choosing Postgresql database

for UnMASK. Here we also discuss UnMASK database design in detail.

In Chapter 5, we discuss in depth the concept of automation of email analysis using a

database. Section 5.1 talks about UnMASK Unix Tools Connection (UUTC) protocol. This

protocol is the basis of database driven automation in UnMASK and explains the connection

mechanism between the database server and the Unix tools server. In Section 5.2 we discuss

the process of automating email analysis in detail.

Performance is a major issue in database driven automation. Various processes running

to achieve this automation in UnMASK make the system very slow. In Chapter 6, we discuss

2

how the database driven automation is made efficient and how a speed up is achieved in this

process.

Database server and Unix tools server are two distinct entities with various inter-related

processes running at both ends. In Chapter 7, we throw light on how the process of database

driven automation in UnMASK takes care of maintaining state between the database server

and the Unix tools server. Here we also talk about the need of maintaining such a state.

Results obtained after running various tools in the Unix tools server should be properly

tagged to various entities1 of an email. In Chapter 8, we discuss how results obtained from

the Unix tools server are tagged with various entities of an email.

Chapter 9 illustrates a case study and explains the results obtained from UnMASK

system after running it for a particular email. Here we talk about various reports that can

be generated after a raw email is automatically parsed and results obtained from the internet

by running various Unix tools stored back into the database. These reports are used by the

law enforcement officers to analyze the email and take forensic decisions accordingly.

In Chapter 10, we give a summary of contributions from this thesis work. We also give

incite of the future work that we can potentially conduct to add new features to the system

and further improve its performance.

1In this thesis, we use word ’entity’ for entities like email adress, IP adress, domain name, link, URL orURI found in an email.

3

CHAPTER 2

Background and Related Work

The goal of UnMASK project is to create a fully automated all-encompassing tool for

processing those emails that are submitted to or acquired by law enforcement agencies

as potential phishing emails. Such email messages sometimes contain spoofed, concealed,

incomplete, or otherwise flawed information, and are generally untrustworthy. In the current

practice, in order to extract reliable and useful information from an email message, it is

important to involve a computer forensics expert who understands the complexity of the

Internet email protocols and the loopholes in current email messaging systems. UnMASK

aims to provide automated email processing capability to law enforcement agents to help

them discern and filter any flaws or fraudulent information, reveal the true nature of the

email, and present derived evidence or leads to track down the transgressors. In UnMask,

we chose database as the automation engine for email analysis.

In current UnMASK system, as soon as raw email is submitted, it gets automatically

parsed in the database and various network querying tools currently available to Unix

users, including but not limited to: dig, whois, and traceroute, are automatically invoked at

appropriate times through the database itself. Database acts as a full fledged automation

engine for the whole system.

Currently, there are some working systems that involve database automation. Some of

them are explained as follows:

1. Microsoft’s AutoAdmin[4] project is an effort to make database systems self-tuning and

self-administering. With this database automation, database auto-tunes itself instead

of applications tracking and tuning it and thus the database becomes more responsive

4

to application needs. This project is used to self-tune components included in Microsoft

SQL Server.

2. As far as analyzing an email is concerned, there are some projects and tools developed.

Tools or websites such as Sam Spade[1] or Domain Tools[2], share a similar goal to

ours. These tools are used interactively to various degrees by the law enforcement

community. These tools/websites provide network-query functionality that lets users

probe domain names, IP addresses, etc. Sam Spade, for example, lets user crawl

websites to pull out a list of email-addresses/links. These tools also let users analyze

email headers to determine whether the email message was sent from a valid address or

forwarded via an open relay to cover the sender’s tracks. However, these tools expect

some reasonable networking expertise from a human being. More importantly, they do

not sufficiently automate the work nor do they provide a database for further analysis.

3. SPARTA’s Phisherman project[5] is more closely related to UnMASK, in that both

employ a database as central repository. However, the Phisherman project is a global

effort to simply collect and archive data related particularly to phishing scams and

disseminate this data to its subscribers. In contrast, our goal is to help users in a more

direct way, i.e., provide them an automated tool to process emails.

5

CHAPTER 3

Overview of UnMASK

The basis of design and development of the database system described in this thesis is the

goal of UnMASK. The database system is designed in such a way that it perfectly fits in the

requirements of UnMASK. This database system works hand in hand with other components

of UnMASK. So, it is very important to understand the goal of UnMASK and details about

its other components based on which we designed our database system. In this chapter, we

essentially discuss the problem statement of UnMASK, the proof of concept for the final

solution and other major components of UnMASK like the Raw Email Parser and the Unix

Tools Server. Please note that Raw Email Parser and the Unix Tools Server is not a part

of this thesis work. However, these two entities go hand in hand with the database server

described in this thesis. So it becomes essential to get an idea about these two entities in

UnMASK before discussing various aspects of the database server.

3.1 Problem Statement

UnMASK aims at automating the analysis of emails. This should be done by automatically

parsing a raw email and invoking various Unix tools like dig, traceroute, whois etc. and

storing the information retrieved by running these tools back into the database. Appropriate

reports should be automatically generated giving a full fledged analysis of the email. The

whole system should be a one click process.

6

3.2 Proof of Concept

To begin with, a raw email should be parsed and broken down into its constituent headers

and the body. In these headers and the body, entities like IP address, email address, domain

names would be picked, stored in the database tables and Unix tools like Traceroute, dig,

whois and Unix commands like ESMTP VRFY and ESMTP EXPN should automatically

run and give an analysis of this email in the form of detailed reports. A Proof of Concept

had to be made that would essentially explain a feasible way to achieve this automation.

Initial approach towards running the Unix tools automatically on email addresses, domain

names and IP addresses was to have a separate process noticing an insert in tables where

these entities are stored. Then this process would perform the task of picking up email

address, domain name / IP address from the newly inserted record(s) that are not serviced

by the Unix tools server yet and run Unix tools like traceroute, dig, whois etc. on them. This

was thought of being done by a periodic cron job, or using NOTIFY/LISTEN commands

in PostgreSql[6]. This is very conventional approach followed in such a scenario. However,

after a thorough research, a major flaw in this approach was highlighted. The cron job would

keep checking constantly if there is any new record being inserted in some database tables.

This cron job hence will be an infinite loop checking whether there is a new record inserted

and cause undesirable thrashing if there is too much contention. So from operating system

point of view, this approach proved to be very dangerous.

After some more research, we finally decided to use database triggers in a novel way to

automate the whole process. This method is really unconventional and highlights the beauty

of the way triggers can be used in PostgreSql. The final Proof of Concept is as follows:

As soon as an email address, IP address, domain names, links, URL’s or URI’s are

stored in the database, an action would be triggered to automatically parse the email and

store entities like email address, IP address, domain names in separate database tables.

These tables can further trigger actions to automatically connect to the Unix tools server

passing parameters required by the tools server to run various Unix tools like traceroute, dig,

whois etc. and store the results from the tools server back to the database. Using database

7

triggers to automatically parse the email and then automatically connect to the Unix tools

server was the final Proof of Concept for automating the email analysis.

3.3 Software Architecture

In this section, we describe the architecture of UnMASK system and show how different

components in the system interact with each other. Figure 3.1 shows the interaction of

the User Interface, PostgreSql database and the Unix Tools system. As seen in the figure,

a user uploads an email into the system using a User Interface. This invokes a server side

JSP script which opens an ODBC connection with PostgreSql database and the email gets

stored in a database table. At this point of time, a trigger is invoked which in turn calls

a User Defined Function in the database that parses the email. Various headers and body

of the email after being parsed correctly are further stored in appropriate tables. Tables in

which entities like IP addresses, domain names and email addresses are stored further have

triggers associated with them. As soon as a record is inserted in these tables, a trigger is

invoked which in turn calls a User Defined Function to establish a connection with the Unix

tools server passing required parameters to it. Unix tools server, using the parameters sent

by the database server, runs various Unix tools like traceroute, dig, whois etc. and returns

the result back to the database where it is stored in appropriate result tables.

3.4 Email Parser

The major part of the database design is based on how the parsing of an email is done in

UnMASK. Based on that, various database design decisions were taken and final design was

arrived at. In UnMASK, various parsers are used to deconstruct a raw email, analyze email

headers and the body, extract specific components from the email such as email addresses,

IP addresses, domain names, links, URL’s, URI’s etc. More details on Email parsers used

in UnMASK can be found in [3]. To re-iterate, development of Email Parser is not a part

of this thesis work. However, database related transactions inside the parser are part of this

thesis.

8

Figure 3.1: UnMASK: Software Architecture

3.5 Unix Tools Server

The Unix tools server is a daemon that runs programs (tools) invoked by the database server.

Some of the tools that were developed in UnMASK are shown in Table 3.1 1, which also shows

the parameters required by each tool. Parameter unmask id is a digital ID generated by our

database server, explained in detail in next section. Parameters ’domain’ and ’local name’

are the domain name and user name parts of an email address respectively. ’Source’ indicates

the source of any entity found in the email. ’Dns server’ is an optional parameter sent to

the Unix tools server. Refer [3] for more details on the Unix tools server used in UnMASK.

Again, development of Unix tools server is not a part of this thesis work. However, database

related transactions inside the server code are part of this thesis.

1Table acquired from [3], Page 6

9

Table 3.1: UnMASK Unix Tools

Tool Name Parameters Functiontool1 unmask id

domainlocal namesource[dns server]

To find the mail servers of the domain, and then ESMTPVRFY to verify the email address at one of the mailservers.

tool2 unmask idhost

To find reachability and routes to an IP address orcanonical host name.

tool3 unmask iddomain

To find registration data for a domain.

tool4 unmask iddomainsource[dns server]

To get full DNS information

tool5 (usesa packagecalledIPGEO[7])

unmask idhost

Find geographical location (currently only country) ofan IP address or a canonical host name

10

CHAPTER 4

Database Server: Automation Engine for UnMASK

To implement the UnMASK system, we choose to use the PostgreSql database. Our

requirements for a database were the ability to: (1) store all email related data after parsing

it to an appropriate level of granularity and (2) mechanisms to invoke a toolkit of various

Unix tools like traceroute, dig, whois, etc. to retrieve additional information related to the

email from the Internet. In this chapeter, we explaing why we choose PostgreSql database

for UnMASK. Also, we discuss in detail the database design keeping in mind the Email

Parser and the Unix Tools Server explained in Chapter 3.

4.1 Why PostgreSql

We choose PostgreSQL over other relational database management systems because it is free

/ open source and it has excellent support for many features including the following:-

1. Native Interfaces for Procedural Languages: PostgreSQL allows user defined functions

to be written in various programming languages besides the native PL/pgSQL.

Currently supported languages include Perl, C, Python etc. We extensively use Perl

in our database programming as there are Perl packages for email parsing available on

CPAN[8] website. Also, Perl is considered to be the most useful language for string

manipulation. Other popular relational databases like Oracle and Sql Server have

limited support for programming languages. Sql Server 2005 supports .Net compliant

languages like C# and Oracle supports Java. So PostgreSql has an edge over other

popular relational database management systems for its more versatile procedural

languages support.

11

2. Transactional Data Definition Language (DDL): DDL statements are used to build

and modify the structure of the tables and other objects in the database. For example,

create table, update table, alter table etc. A sample DDL statement is shown as

follows:-

. CREATE TABLE tbl l header

. (

. unmask id int4,

. header name text,

. header content text,

. time stamp timestamp

. );

This create table statement creates the structure of a database table tbl l header.

Another example of a DDL statement is as follows:-

. ALTER TABLE <tbl l header>

. ADD CONSTRAINT <fk tbl l header uid> FOREIGN KEY (<unmask id>)

. REFERENCES <tbl email> (<unmask id>);

Alter Table statement shown above modifies table tbl l header and adds a constraint

(Foreign Key) to it so that column unmask id in it references column unmask id in

table tbl email.

In Oracle and other major RDBMS, these statements when included in one single

transaction are not atomic. Let us consider the following transaction for an example:-

BEGIN

. DDL1;

. DDL2;

12

END

In Oracle, DDL1 would commit right after it is issued without waiting for the end of

the over all transaction making the transaction non-atomic. Now if DDL1 succeeds

and DDL2 fails, then there would be an inconsistency in the database if there is a

dependence between DDL1 and DDL2. However, In case of PostgreSql, there is a

concept of Transactional DDL and the above transaction would be strictly atomic and

would work the same way as it works in the case of transactions with DML (Data

Manipulation Language) statements like INSERT, SELECT etc. In UnMASK, to

migrate the database from development to the production environment, we need to

run various database scripts in the new database server. These scripts include DDL

statements with bunch of other DML statements. So while running the database script,

if any of the DDL statements fail, the whole script would abort instead of creating an

inconsistent database.

3. Version control of User Defined Functions : Another unique feature that PostgreSql has

is the version control of User Defined Functions. For example, a User Defined Function

is created and while some application is running using this User Defined Function, the

User Defined Function is updated. The application would still use the old version

of the User Defined Function instead of the updated one and this will not affect the

currently running application. On the other hand, applications running on Oracle

database would crash or hang or show some undsired output if Stored Procedures are

updated while they are being used by an application. It should also be noted here that

we did not use term Stored Procedure in context with PostgreSql because there is no

concept of Stored Procedures in PostgreSql and this is explained in detail in Section

4.2.3.2.

4.2 UnMASK Database Design

4.2.1 Raw Email Analysis: Basis of DB Design

UnMASK database is designed for storing raw email data and the results retrieved by running

various Unix tools. So before discussing the actual database structure, we need to analyze a

13

raw email. A sample raw email is shown Figure 4.1.

After analyzing the raw email, we figured out the following logical divisions in it:-

1. Limited header fields : These are the fields in raw email that appear only once. For

example, ’from’, ’to’, ’sender’, ’subject’ etc.

2. Unlimited header fields : These are the fields in raw email that appear one or more

number of times. For example, ’cc’, ’bcc’, ’received fields’, ’resent fields’ etc.

3. Body : Body of an email can be in different forms depending upon the MIME Version

header field. It can either be in the form of plain text, html or any other valid format

define in RFC 2822.

Various header fields like ’from’, ’to’, ’cc’, ’bcc’, ’received’ etc. and the body of raw email

carry information like email address, domain name, URL, URI on which we decided to run

various Unix tools. Keeping various header fields decoupled from each other in the database,

based on their logical division in raw email, would help in querying the database efficiently.

For example, if limited and unlimited fields are stored together in one single database table,

then various Sql queries would take longer time to fetch results as the number of records

keep increasing. So it is advisable to divide various header fields in different database tables

for a better throughput of Sql queries.

Among various unlimited header fields, there are header fields called ’received fields’

carrying vital information. These header fields are very important and used to trace the

trail of an email from sending to the receiving end. All the relay servers between the sending

end and the receiving end of an email add one received header field to it. Let us analyze one

such received header field as follows:-

. Received: from officialgiftcards.info (OFFICIALGIFTCARDS.INFO [66.240.223.56])

. by mx.google.com with ESMTP id g17si3029410nfd.2007.10.13.15.25.13;

. Sat, 13 Oct 2007 15:25:15 -0700 (PDT)

14

Figure 4.1: Sample Raw Email

15

Here we see that the received header field is further divided into various name-value pairs.

Looking at the first line that carries information about the relay server, we see that this field

(’from’) is sub-divided into entities that we term as from-from (officialgiftcards.info), from-

domain (OFFICIALGIFTCARDS.INFO) and from-address (66.240.223.56). All these fields

should be distinctly stored in the database for an ease in creating a correlation logic for

report generation and various other Sql queries used within the database code.

As seen in Figure 4.1, all the header fields are logically divided into a name-value pair.

For example, ’from’ field carries information like From: [email protected]. ’Received’ fields,

as already discussed, are sub-divided as ’from’, ’by’, ’via’ in the form of name-value pairs.

Analysis of the body of raw email shows that the way different fields appear in the body

depends upon the content type of the email. If the email is in plain text format, then all

the entities like email address, domain names, URL’s etc. are found inline. If the email is in

html format, then these entities can exist is the form of clickable links. These links appear

in the form of anchor (<a>) tags. An anchor tag looks like as follows:-

. <a href=”http://192.145.2.1”>Bank of America</a>

Here we see that the sender claims to redirect the recipient of the email to Bank of

America’s website by clicking on this link. However, if we look at the href part of anchor

tag, it would take the recipient to some unknown IP address. Such links are very prevalant

in phishing emails and extremely useful for taking forensic decisions on an email. Hence,

such type of data should be decoupled from the rest of the email data.

4.2.2 Database Design Factors

Based on the raw email analysis discussed in the previous section, various factors were

considered before finalizing the design of the database. These are enumerated as follows:-

1. Database should have various logical entities decoupled from each other based on

various headers in raw email. Decoupling should also be done on the basis of body of

16

the email. Also, since ’received’ fields carry vital information about the relay servers,

these should be stored separately for an easy and efficient analysis of such fields.

2. Email address, IP address, domain names, URLs, links etc. being the entities on which

Unix tools would run, should be stored separately. Tables in which such entities are

stored should trigger appropriate actions to run Unix tools on these entities. Also,

these entities should be properly tagged to a particular email.

3. Results fetched by running various Unix tools should be stored separately in results

table. These results should be properly tagged to an email.

4. Data in result tables should not be redundant. For example, ’whois’ result for

yahoo.com is already there in the database and ’whois’ on yahoo.com is run again

after certain period of time. Now, if the result of current run of ’whois’ on yahoo.com

is same as what is already there in the database then, there is no point storing the same

’whois’ result as a separate record in the database. Instead of that, the old ’whois’

result could be logically tagged for yahoo.com found in the current email. This helps

a lot in saving the storage space in the database.

5. Unnecessary calls from the database server to the Unix tools server should be avoided.

For example, if yahoo.com is found multiple times in an email, then connection to the

Unix tools server for the same entity should not be established more than once for

the same email. This is because we can’t expect the Unix tools like whois, traceroute,

dig etc. to give different information within a period of few seconds or minutes. So if

an entity exists more than once in raw email, connection with the Unix tools server

should be established for the first occurrence of that entity in the email. For the rest

of the occurrences of same entity in the same email, results from the first run of Unix

tools should be used. This would help us in increasing the efficiency of the database

server by avoiding expensive and unnecessary calls to the Unix tools server. However,

there is an exception in this rule. If the last run of the tool on the same entity stored

a blank result in the database, then the tool should be run again even though it was

already run for the same entity in the same email.

6. Since network querying tools like whois, traceroute, GeoIp, Dig are not expected do

17

give different results for an entity in a period of 10 days1, across emails for the same

entity, Unix tools server should not be connected to more than once in a period of

10 days. For example, Email-1 was uploaded in the UnMASK system and IP address

192.145.0.1 was found in it. All the network querying tools were run on this entity

by establishing a connection with the Unix tools server. Now, let say after 6 days

Email-2 was uploaded in the system and same IP address i.e. 192.145.0.1 was found

in it. Connection with the Unix tools server should not be established for this entity

because it was already done within a period of 10 days. Its only after 10 days period,

connection with the Unix tools server would be established for the same entity across

different emails. This rule has 2 exceptions listed as follows:-

• Since email addresses are ephemeral, so the 10 days logic explained above is not

used for email addresses while running Unix commands like ESMTP VRFY and

ESMTP ESPN on them.

• If the result of last run of a tool on an entity was null, then in this case the 10

days rule is not used and the tool is run again on that entity.

7. As discussed in the previous section, since href’s carry vital information and need to

be analyzed separately, this entity should be decoupled from the rest of the entities

and stored separately. One of the requirements of reports displayed in UnMASK User

Interface (See Chapter 9) is to list various different websites found in an email and

various link(s) under those website. For this purpose, href, link, URL, URI etc. found

in the email should be deconstructed and website part stored separately. Also, all

the websites and the links from which these website names are taken out should be

properly mapped to each other.

8. All database inserts and retrievals should take place using User Defined Functions to

avoid SQL Injection [9]. Also all the business logic inside the database should be

embedded in User Defined Functions.

9. When the database inserts occur, they can initiate other database activities through

the use of database triggers. Activities can include parsing fields of records in tables,

1Choosing a period of 10 days is just a design decision and is not a standard followed across varioussystems.

18

Figure 4.2: UnMASK: Tables, Triggers and Dataflow

initiating a connection to the Unix tools server and entering new records into the tables.

4.2.3 Final Database Design

Based on various design factors explained in the previous section, we designed the database

so that the tables that contain raw email and its deconstructed components are ”write once”.

This helps in maintaining an evidentiary trail for subsequent prosecution. We divided the

flow of data in the database into 3 levels as shown in Figure 4.2.

Various database entities such as Tables, User Defined Functions and Triggers, used in

UnMASK are explained as follows:-

4.2.3.1 Tables

As discussed in the previous section, we divided the whole data flow into 3 levels. These

are:-

19

• Level 1 Tables : These are the initial database tables in the system that store the raw

email in text format and all the information related to the user uploading an email.

All basic information related to an email being uploaded and the user uploading it is

stored in Level 1 database tables. Level 1 database tables are enumerated as follows:-

1. tbl users : This database table stores information about the user uploading an

email into the system. See Table 4.1 for column details.

2. tbl email : This table stores the raw email in text format and generates a unique

ID called unmask id which is unique to every email uploaded in the system. See

Table 4.2 for column details.

• Level 2 Tables : These tables store data related to the raw email obtained after it is

parsed. This includes limited header fields, unlimited header fields and entities like

email address, IP address, domain, link, URLs etc. Level 2 tables are enumerated as

follows:-

1. tbl l header : All limited header fields in a raw email like ’from’, ’sender’, ’to’,

’subject’ etc. are stored in this table. This table is designed in such a way that

only those limited header fields are stored that exist in the email. See Table 4.3

for column details.

2. tbl ul header : All unlimited header fields in a raw email like ’cc’, ’bcc’, ’resent-

fields’ etc. excluding received fields are stored in this table. This table is designed

in such a way that only those unlimited header fields are stored that exist in

the email. See Table 4.4 for column details. Since these fields are unlimited in

number, a column called seq no is added to maintain the sequence of such fields

within an email. See Table 4.4 for column details.

3. tbl ul header received : As discussed in Section 4.2.1, received fields carry vital

information about the relay servers and need to be analyzed separately. So all the

received fields in a raw email are stored in this table as separate records. If a raw

email has 3 received fields then, 3 different records are inserted in this table with

the value of column seq no ranging from 1 to 3. As shown in Table 4.5, various

columns of this table are chosen in compliance with RFC 2822 [10].

20

4. tbl email address : This table stores email addresses found anywhere in an email.

The ’source’ column in the table shows which header this email address belongs to.

’Source’ column is mapped with ’source text’ column in table tbl source master

(See Table 4.10).

5. tbl uri : This table stores entities like IP address, domain names, links, URLs,

URIs found anywhere in the email. The ’source’ column in the table shows which

header an entity belongs to. ’Source’ column is mapped with ’source text’ column

in table tbl source master.

6. tbl href : As discussed in Section 4.2.1, href’s need to be analyzed separately

for the type of information they carry. So a separate database table is created

to store href information. Apart from href’s, any link, URL or URI found in

the email is also stored in this table. All such entities that are not a part of

anchor (<a>) tag have the value of ’display’ column as NULL. See Table 4.8 for

column details. All href’s, links, URLs, URIs have a website part. For example,

http://www.abc.com?ui=123. The website part of this link found in the email is

www.abc.com and gets stored in table tbl website (See Table 4.9). ’website id’

column of tbl href is mapped with ’website id’ column of tbl website.

7. tbl website: This table is created for the reasons as discussed in Section 4.2.2

point 7 . See Table 4.9 for column details.

8. tbl concurrent db: This table stores the number of times a successful connection is

established with the Unix tools server by the database server for a particular email

(unmask id). The db count value for the current unmask id gets incremented by

1 for every connection opened successfully with the Unix tools server (Right after

an insert, either in table tbl email address or tbl uri, in the parser code). As

soon as the parser code finishes its processing, the db count value for a particular

unmask id would give a count of total number of connections the database server

opened with the Unix tools server for a particular email. This value is compared

with the unix count column value in Table 4.17 for maintaining a state between

the database server and the Unix tools server.

• Level 3 Tables : Level 3 tables store the results fetched by running various Unix tools

in the Unix tools server. Theese results are stored in the database as it is without

21

any further parsing. One of the Level 3 tables i.e. tbl concurrent unix is used for

state maintenance (See Chapter 7) between the database server and the Unix tools

server. As already explained in Section 4.2.2 Point 4, data in all the result tables is

not duplicated. Level 3 tables are explained as follows:-

1. tbl verify mx : Result obtained by running Unix commands ESMTP VRFY and

ESMTP EXPN on an email address is stored in this table. See Table ?? for

column details.

2. tbl whois : Result of ’whois’ run on a particular entity is stored in this table. See

Table 4.11 for column details.

3. tbl dig : Result of ’dig’ run on a particular entity is stored in this table. See Table

4.13 for column details.

4. tbl traceroute: Result of ’traceroute’ run on a particular entity is stored in this

table. See Table 4.12 for column details.

5. tbl country : Result of ’GeoIP’ tool run on a particular entity is stored in this

table. See Table 4.14 for column details.

6. tbl concurrent unix : This table stores the number of times a tool in the Unix tools

server finishes its job for a particular email’s (unmask id) entity. The unix count

column value for the current unmask id gets incremented by 1 after completion of

running a Unix tool or command on a particular email entity. As soon as the Unix

tools server services all the requests from the database server side for a particular

email, the unix count value for a particular unmask id would give a count of the

total number of tools run by the Unix tools server for a particular email. This

value is matched with the db count column value in Table 4.16 for maintaining

a state between the database server and the Unix tools server.

Apart from Level 1, 2 and 3 tables that take care of the data flow in UnMASK system,

we also have a helper table in the database system explained as follows:-

1. tbl source master : This table is a master table for various ’source’ column values used

in the database in tables like tbl email address, tbl uri and tbl href. Source means part

of the raw email to which an entity belongs. For example, a limited header field like

22

’From: [email protected]’ stores an email address. Then this email address is stored

in table tbl email address with value of ’source’ column as ’from’. See Table 4.10 for

column details.

23

Table 4.1: tbl users

Columnname

Datatype

Description

username text A username under which the email is uploaded.firstname text First name of the user uploading the email.lastname text Last name of the user uploading the email.org serial Name of organization of the user uploading the email.phone text Phone number of the user uploading the email.email text Email address of the user uploading the email.time stamp timestamp Keeps a track of date-time when the email was uploaded

in the system. Automatically inserted by the system andhas a default value of current date-time.

Primary Key: username.

. Foreign Key: None.

24

Table 4.2: tbl email

Columnname

Datatype

Description

username text A username using which the email is uploaded.casename text Name of the case under which the email is uploaded.

Each email belongs to a particular case.email filename text Physical name of the raw email file uploaded.unmask id serial Unique identifier automatically generated by the system

and assigned to an email uploaded in the system.raw email text Text form of the raw email uploaded in the system.time stamp timestamp Keeps a track of date-time when the email was uploaded

in the system. Automatically inserted by the system andhas a default value of current date-time.

Primary Key: unmask id.

. Foreign Key: None.

Table 4.3: tbl l header

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs

header name text Name of the Limited header field present in the email.header content text Contents of the limited header present in the email.time stamp timestamp Date-time when record is inserted in the table. Auto-

matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, header name.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.

25

Table 4.4: tbl ul header

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.

seq no text Unique sequence number of various unlimited headerfields in a particular email (unmask id).

header name text Name of unlimited header field present in the email.header content text Contents of unlimited header field present in the email.

(received fields not included)time stamp timestamp Date-time when record is inserted in the table. Auto-

matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, header name, seq no.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.

26

Table 4.5: tbl ul header received

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.

seq no text Unique sequence number of various received headerfields for a particular email.

id text Id part of a received field header.from from text from-from part of a received field header.from domain text from-domain part of a received field header.from address text from-address part of a received field header.rec by text received-by part of a received field header.via text via part of a received field header.rec with text received-with part of a received field header.rec for text received-for part of a received field header.date time text date-time part of a received field header.comments text comments part of a received field header.time stamp timestamp Date-time when record is inserted in the table. Auto-

matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, seq no.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.

27

Table 4.6: tbl email address

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.

source text Part or name of the header in the email where the emailaddress was found.

email local text Local part of the email address.email domain text Domain part of the email address.ip address inet Ip address of the mail server for the domain of this email

address.vrfy mx id int4 Foreign key to vrfy mx id column of table tbl verify mx

that stores ESMTP VRFY and ESMTP EXPN recordsfor this email address

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, source, email local, email domain.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,

. source REFERENCES source text in Table tbl source master,

. vrfy mx id REFERENCES vrfy mx id in Table tbl verify mx.

28

Table 4.7: tbl uri

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to an every email uploaded inthe system to which the header belongs.

source text Part or name of the header in the email where the IPaddress, link, URL, URI or domain name was found.

canonical name text Name or value of the IP address, link, URL, URI ordomain that was found in the email

whois id int4 Foreign key to whois id column of table tbl whois thatstores ’who is’ records for this uri/url/ipaddress

traceroute id int4 Foreign key to traceroute id column of tabletbl traceroute that stores ’traceroute’ records forthis uri/url/ipaddress

dig id int4 Foreign key to dig id column of table tbl dig that storesthe ’dig’ records for this uri/url/ipaddress

country id int4 Foreign key to country id column of tabletbl country that stores the ’country’ records forthis uri/url/ipaddress

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, source, canonical name.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,

. source REFERENCES source text in Table tbl source master,

. whois id REFERENCES whois id in Table tbl whois,

. traceroute id REFERENCES traceroute id in Table tbl traceroute,

. dig id REFERENCES dig id in Table tbl dig,

. country id REFERENCES country id in Table tbl country.

29

Table 4.8: tbl href

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned an email uploaded in thesystem to which the header belongs.

seq no int4 Unique sequence number of various href’s or linkspresent in an email.

source text Part or header name of the email where the href / linkwas found.

href text Text stored in the href part of the anchor tag found inthe email

display text Display part of anchor tag found in the raw email.Remains null it its a normal

website id int4 Foreign key to website id column of table tbl websitethat stores ’who is’ records for this UIR/URL/domainname/IP address

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: unmask id, seq no, source.

. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,

. source REFERENCES source text in Table tbl source master,

. website id REFERENCES website id in Table tbl website.

30

Table 4.9: tbl website

Columnname

Datatype

Description

website id int4 Unique identifier assigned to a website found in an email.It can also be the website part of any href or normal linkfound in the email.

website name text Name of the website.port int4 Port associated with any href / link found in raw emailtime stamp timestamp Date-time when record is inserted in the table. Auto-

matically inserted by the system and has a default valueof current date-time.

Primary Key: website id.

. Foreign Key: None.

Table 4.10: tbl source master

Columnname

Datatype

Description

source text text Name of the source of an entity found in an email.source header text Name of the header in raw email where the source is

found e.g. limited-header, body etc.

Primary Key: source text.

. Foreign Key: None.

31

Table 4.11: tbl whois

Column name Datatype

Description

whois id serial Unique identifier (auto incremented field) assigned to awhois result.

canonical parameter text Canonical parameter on which whois is run. Unix tool’whois’ runs on the host name. Canonical parameteri inthis thesis is defined as the host part of any link, URL,URI etc. found in an email.

whois result text Result of whois run on a host name(canonical parametercolumn).

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: whois id.

. Foreign Key: None.

Primary Key: traceroute id.

. Foreign Key: None.

32

Table 4.12: tbl traceroute

Column name Datatype

Description

traceroute id serial Unique identifier (auto incremented field) assigned to atraceroute result.

canonical name text Entity (domain name, IP address etc.) on which tracer-oute is run.

traceroute result text Result of traceroute run on an entity (canonical namecolumn)

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

33

Table 4.13: tbl dig

Column name Datatype

Description

dig id serial Unique identifier (auto incremented field) assigned to adig result.

canonical name text Entity (domain name, IP address etc.) on which dig isrun.

dig result text Result of dig run on an entity (canonical name column)time stamp timestamp Date-time when record is inserted in the table. Auto-

matically inserted by the system and has a default valueof current date-time.

Primary Key: dig id.

. Foreign Key: None.

Table 4.14: tbl country

Column name Datatype

Description

country id serial Unique identifier (auto incremented field) assigned toresult of GeoIP tool[7] run on an entity(domain name,IP address).

canonical name text Entity (domain name, IP address etc.) on which GeoIPtool is run.

country result text Result of GeoIP tool run on an entity (canonical namecolumn)

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: country id.

. Foreign Key: None.

34

Table 4.15: tbl vrfy mx

Columnname

Datatype

Description

vrfy mx id serial Unique identifier (auto incremented field) assigned toeach record inserted in this table.

email local text Local part of the email address that needs to be verified.email domain text Domain part of the email address that needs to be

verified.verify result text Result of running Unix command ESMTP VRFY on

the local part of email address (email local) against thedomain part (email domain).

expn result text Result of running Unix command ESMTP EXPN runon the local part of email address (email local) againstthe domain part (email domain).

mail server server Name of the mail server of the domain part(email domain) of the email address.

mx records text MX records of the mail server of domain to which theemail address belongs.

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: vrfy mx id.

. Foreign Key: None.

35

Table 4.16: tbl concurrent db

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to each and every email up-loaded in the system.

dbcount int4 Number of socket connections opened from the databaseserver side for a particular email (unmask id).

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: None.

. Foreign Key: None.

Table 4.17: tbl concurrent unix

Columnname

Datatype

Description

unmask id int4 Unique identifier assigned to each and every email up-loaded in the system.

unixcount int4 Number of tools run by the Unix tools server for aparticular email (unmask id).

time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.

Primary Key: None.

. Foreign Key: None.

36

4.2.3.2 User Defined Functions

The entire database inserts and retrievals in UnMASK take place using User Defined

Functions. There are no inline queries in the email parser code and the User Interface

code. Using User Defined Functions helps us taking care of Sql Injection. User Defined

Functions in PostgreSql have a version control as explained in Section 4.1. Almost all the

business logic of UnMASK is embedded inside User Defined Functions.

While designing the database for UnMASK, one of the design decisions taken was not

to have inline Sql queries in the User Interface code, Email Parser code and the database

code used in the Unix tools server. This was necessary to avoid Sql Injection arising from

the User Interface side. Also, we wanted to write all our business logic in the database

itself for various security reasons. Thus, we wanted to create Stored Procedures that would

have all the inline Sql queries used at various tiers of the UnMASK system and database

related business logic. However, after a thorough research, we figured out that there is no

concept of Stored Procedures in PostgreSql. Instead of that, PostgreSql has User Defined

Functions and Stored Procedures are wrapped around in User Defined Functions. To throw

some more light on this fact, let us analyze difference between a Stored Procedure and a

User Defined Function in other database management systems like Microsoft’s Sql Server

and Oracle. These differences can be enumerated as follows:-

1. Stored Procedures are parsed, compiled and stored in compiled format in the database.

We can also say that Stored Procedures are stored as pseudo code in the database i.e.

compiled form. On the other hand, User Defined Functions are parsed, and compiled

at runtime.

2. A User Defined Function must return a value where as a Stored Procedure doesn’t

need to (it definitely can, if required).

3. A User Defined Function can be used with any Sql statement. For example, we have a

function ’FuncSal(int)’ that returns the salary of a person. This function can be used

in a Sql statement as follows:-

37

. SELECT * FROM tbl sal WHERE salary = FuncSal(x)

Here internally, a call is made to User Defined Function ’FuncSal’ with any integer x,

as desired, and compared with the ’salary’ column of the database table tbl sal.

We can have Data Manipulation Language (DML) statements like insert, update and

delete in a function. However, we can’t call such a function (having insert, update,

delete) in a Sql query. For example, if we have a function (FuncUpdate(int)) that

updates a table, then we can’t call that function from a Sql query.

. SELECT FuncUpdate(field) FROM sometable; will throw error.

On the other hand, Stored Procedures can’t be called inside a Sql statement.

4. Operationally, when an error is encountered, the function stops, while an error is

ignored in a Stored Procedure and proceeds to the next statement in the code (provided

one has included error handling support).

5. User Defined Functions return values of the same type, Stored Procedures return

multiple type values.

6. Stored Procedures support deferred name resolution. To explain this, let’s say we

have a stored procedure in which we use named tables tbl x and tbl y but these tables

actually don’t exist in the database at the time of this stored procedure creation.

Creating such a stored procedure doesn’t throw any error. However, at runtime,

it would definitely throw an error that tables tbl x and tbl y don’t exist in the

database. On the other hand, User Defined Functions don’t support such deferred

name resolution.

In PostgreSql, as already mentioned, there is no provision for Stored Procedures. It

only provides User Defined Functions. However, these functions have mixed behavior i.e.

in some scenarios they behave as a Stored Procedure and in others, they would behave as

a normal User Defined Function. To explain this, let us analyze each difference mentioned

above one by one and see how a PostgreSql User Defined Function behaves in those scenarios.

38

1. In PostgreSql, whether a User Defined Function gets parsed, interpreted or compiled

at run time strictly depends upon the type of language interface used in the function.

If the language interface used is that of ’plpgsql’ or ’C’, then such type of User Defined

Functions are compiled only once when they are created. This compiled version is kept

in the database for any future calls to the same function instead of compiling it every

single when it is called. On the other hand, if language interface used is that of ’Perl’

or any other interpreted language then, the rules for that language are followed. In

case of Perl, the User Defined Function would be parsed and interpreted every single

time it is called and follows Perl language standards. In nutshell, what happens to a

User Defined Function at the time when it is called by an application strictly depends

upon the type of language interface it is using. It may act like Sql Server’s or Oracle’s

stored procedures or their User Defined Functions depending upon the type of language

interface it is using.

2. As far as return values are concerned, a User Defined Function in PostgreSql behaves

strictly like Microsoft Sql Server’s or Oracle’s User Defined Functions i.e. it must

return a value.

3. A User Defined Function in PostgreSql, like any other RDBMS, can be used with

any Sql statement. At the same time, we can also use functions that have Data

Manipulation Language (DML) statements like insert, update, delete. Such functions

that use DML statements can’t be used with Sql statements in other RDBMS like

Oracle, as already explained above.

4. In PostgreSql, a User Defined Function has a provision for Exception Handling. For

example, in function sp client socket() in UnMASK, the following code snippet can be

seen using exception handling :-

. BEGIN

. PERFORM INET(new.canonical name);

. EXCEPTION WHEN invalid text representation THEN

. ip address present=0;

. END;

39

Here we check whether ’canonical name’ column in a table is in the form of IP address

or plain text. Function INET, which is a PostgreSql internal function, throws an

exception if a non IP address argument is provided to it. We catch this exception and

give a value of 0 to variable ip address present. This is one common feature in Stored

Procedures in other RDBMS that we see in PostgreSql User Defined Functions. In other

RDBMS like Microsoft’s Sql Server or Oracle, exception handling is not supported in

User Defined Functions. There it is only supported in stored procedures.

5. In PostgreSql, a User Defined Function may return multiple values, but of the same

type. This is strictly in accordance with User Defined Functions in other RDDMS like

Sql Server and Oracle. A User Defined Function in PostgreSql can’t return multiple

values of different types, unlike stored procedures in Sql Server and Oracle that can

return multiple values of different types.

6. User Define Functions in PostgreSql support deferred name resolution like stored

procedures in Sql Server and Oracle. Sql Server and Oracle don’t support deferred

name resolution in their User Defined Functions, unlike PostgreSql.

Based on the difference between a Stored Procedure and a User Defined Function in other

RDBMS like Sql Server and Oracle and then analyzing the way User Defined Functions in

PostgreSql behave in all such scenarios, we can say that in PostgreSql, a Stored Procedure

is wrapped around in a User Defined Function.

Figure 4.3 shows an example of a User Defined Function called sp fetch tools result

used in UnMASK. This User Defined Function is used to fetch Unix tools result (dig,

traceroute, whois, ESMTP VRFY & ESMTP EXPN) for entities like email address, URL,

URI, link, domain name etc. in an email (unmask id) and tags it with all these entities

in tables tbl email address and tbl uri. Unix tools results are fetched from tables tbl dig,

tbl traceroute, tbl whois, tbl country, tbl verify mx. As seen in the figure, this User Defined

Function uses the concept of database cursors[11] to iterate in various records in tables tbl uri

and tbl email address.

40

Figure 4.3: User Defined Function sp fetch tools result

41

4.2.3.3 Triggers

Triggers in PostgreSql behave exactly the same way as in other RDBMS like Sql Server and

Oracle, although there is a slight difference between the trigger constructs. In Sql Server

and Oracle, the set of activities that a trigger has to perform is a part of the trigger body.

However, PostgreSql has special type of User Defined Functions called Trigger Functions,

with a return type ’trigger’. That’s how a Trigger Functions is distinguished from a regular

User Defined Function in PostgreSql. A trigger in PostgreSql always has an associated

Trigger Function that has a set of activities a trigger needs to perform. Trigger body in

PostgreSql looks like as follows:-

. CREATE TRIGGER trg email address

. AFTER INSERT

. ON tbl email address

. FOR EACH ROW

. EXECUTE PROCEDURE func client socket (’email address’)

As we can see in the code above, a trigger body contains the name of the trigger,

the type of trigger, name of the database table trigger is associated with and the name

of Procedure (Trigger Function) it needs to execute when the trigger is invoked. Here,

trigger trg email address is an ’After Insert2’ trigger associated with table tbl email address

and executes Trigger Function func client socket right after a row is inserted in table

tbl email address. A Trigger Function can internally call other User Defined Functions,

if required. Figure 4.4 shows an example of a Trigger Function func client socket used

in UnMASK. As seen in the figure, the Trigger Function doesn’t accept any parameters

in its signature even though we passed a text parameter ’email address’ (func client socket

(’email address’)) while calling it from the trigger body. A Trigger Functions also differs

from a normal User Defined Function in parameter passing. In PostgreSql, the number

of parameters one can pass to a Trigger Function is unlimited and dynamic. If passed,

these parameters become an element of a special array TG ARGV[ ], defined internally

in PostgreSql libraries. This array gets dynamically created and populated based on the

2After Insert triggers are invoked after every successful insertion of a row in the database table.

42

number of parameters passed to the Trigger Function. If no parameter is passed to a Trigger

Function then, the TG ARGV[ ] array associated with that Trigger Function is undefined.

In Figure 4.4, we can see that the parameter ’email address’ passed to the Trigger Function

is accessed using TG ARGV[0], i.e. the first and the only element of array TG ARGV[ ].

Note that Figure 4.4 is a code snippet taken out from Trigger Function sp client socket

used in UnMASK. It doesn’t show the complete implementation of Trigger Function. The

purpose of including this figure is just to show how Trigger Functions work in PostgreSql.

In UnMASK, all the triggers created are ’After Insert’ triggers used to invoke desired

action(s) after a record is inserted in tables tbl email, tbl email address and tbl uri. Table

4.18 gives a complete list of triggers used in UnMASK.

43

Figure 4.4: Trigger Function func client socket

44

Table 4.18: UnMASK Triggers

Trigger name Associated ta-ble

Functionality

trg email tbl email This trigger is invoked after a record is inserted intable tbl email and calls User Defined Function sp emailwhich parses the raw email and deconstructs it downinto its constituent headers.

trg email address tbl email address This trigger is invoked after a record is inserted intable tbl email address and calls User Defined Functionsp client socket. This User Defined Function opens asocket connection between the database server and theUnix tools server to verify the the validity of an emailaddress.

trg uri whois tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the Unix tools server torun whois on the newly inserted canonical name columnin the table.

trg uri traceroute tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the Unix tools serverto run traceroute on the newly inserted canonical namecolumn in the table.

trg uri dig tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the unix tools server torun dig on the newly inserted canonical name column inthe table.

trg uri country tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the unix tools serverto run GeoIP tool on the newly inserted canonical namecolumn in the table.

45

CHAPTER 5

Automation Using a Database

In this chapter, we discuss how automation of email analysis is achieved using a database

in context with UnMASK. Before giving a deep insight of such automation, we explain the

UUTC protocol that we developed in this project for establishing a connection between the

database server and the Unix tools server.

5.1 UnMASK Unix Tools Connection Protocol(UUTC)

The connection between the database server and the Unix tools server is through a new

protocol that we designed and implemented, called the UnMASK Unix Tools Connection

(UUTC) protocol. This protocol opens a socket connection, when needed, to a daemon

process (the Unix tools server) and allows parameters needed for invoking specific tools to

be sent across the connection, and permits return information to be properly put back into

the database. This protocol is a novel idea of opening connections from the database server

to an external daemon process.

5.2 Automating Email Analysis

Having discussed project UnMASK in detail and the Proof of Concept for accomplishing

the automation of email analysis in Chapter 3, design of the database server in Chapter 4,

Email Parser in Section 3.4 and the Unix tools server in Section 3.5, we now discuss the

automation process in detail.

46

Figure 5.1: Workflow: Automating email analysis

Figure 5.1 shows the work flow of the entire automation process. As soon as a raw

email is inserted in PostgreSql database, it gets stored in table tbl email as a text field.

Table tbl email has ’After Insert’ trigger trg email associated to it. As soon as a record

gets inserted in table tbl email, trigger trg email is invoked. This trigger internally calls

a User Defined Function sp email parser. As seen in Figure 5.2, User Defined Function

sp email parser is internally divided into 3 sequential transactions.

47

Figure 5.2: Implementation of User Defined Function sp email parser

Transaction 1 is a PostgreSql function written in Perl. The use of transactions 2 and 3

is explained later in Chapters 7 and 8. Coming back to Transaction 1, this User Defined

Function is the implementation of an email parser and iteratively parses a raw email into

finer and finer granularity. The part of this parser code which is work of this thesis is the

database transactions taking place inside.

In the parser, wherever an email address is found in raw email, it gets stored as a separate

record in table tbl email address in the database. Similarly, wherever an IP address, link,

domain name, URI or a URL is found in raw email, it gets stored in table tbl uri. Tables

tbl email address and tbl uri both have ’After Insert’ triggers associated with them, as shown

in Table 4.18. As soon as a record is inserted in these two tables, a User Defined Function

sp client socket is called. User Defined Function sp client socket is a PostgreSql function

48

written in Perl. This User Defined Function is an implementation of a client socket. The

socket is used to establish a communication channel between the database server and the

Unix tools server. Various parameters passed through this socket depends on which trigger

established the socket connection.

If the socket connection is established because of the action of trigger trg email address on

table tbl email address, then the ’local’ and the ’domain’ part of an email address are passed

to the Unix tools server through the socket. This methodology of data transfer through a

socket connection between the database server and the Unix tools server is what we call

the UUTC protocol defined in Section 5.1. Unix tools server, as explained in Section 3.5

is a daemon that runs programs (tools) invoked by the database server. Socket connection

established by trigger trg email address initiates a tool called ’Tool1’ in Unix tools server as

described in Table 3.1. Tool1 is implemented as follows: It runs ’dig’ on the domain part

of the email address to find out the mail server to which this domain belongs. If a valid

mail server is returned then, Unix commands ESMTP VRFY and ESMTP EXPN 1 are run

on that local and domain part of the email address. After Tool1 has completed its task, an

ODBC connection (Part of UUTC protocol) is opened between the Unix tools server and the

database server. Using this ODBC connection, results of Tool1 i.e. the mail server records

and result of ESMTP VRFY and EXPN are stored in table tbl verify mx.

Similarly, if the socket connection is established because of the action of triggers

trg uri whois, trg uri traceroute, trg uri dig, trg uri country on table tbl uri as shown in

Table 4.18 then, the canonical name field in table tbl uri is sent across to the Unix tools

server through the socket. All of these four triggers run independently of each other, even

though they are associated with the same database table, open a separate socket connection

with the Unix tools server and an appropriate tool in the Unix tools server is initiated. Whois,

dig, traceroute and country results are saved back in tables tbl whois, tbl dig, tbl traceroute,

tbl country by opening an ODBC connection between the Unix tools server and the database

server as explained above.

1ESMTP stands for ’Extended Simple Mail Transfer Protocol’. Only those mail servers that supportESMTP can run VRFY and EXPN commands. VRFY is used to verify the validity of a user in a mailserver. EXPN returns the whole mailing list in that mail server of which the verified user is a part.

49

One important point to note here is that the data from the Unix tools server back to the

database server is not passed using the same socket connection. The reason behind this is

that the Unix tools server closes this socket connection right after it receives the required

parameters from the database server. This is done in order to make the database server

process not wait for the Unix tools server process, which might take long at times to return

results, and return right after establishing a socket connection with the Unix tools server and

passing required parameters to it using this socket. The reason behind closing this socket

connection in such a manner is explained in detail in Chapter 6.

50

CHAPTER 6

Performance Improvement

Unix tools like dig, traceroute, whois etc. connect to the internet and return results. Results

returned are stored back in the database. Sometimes return time of these tools is very high

depending upon the network speed. In UnMASK, for one single email, connection to the

Unix tools server is made several times depending upon the number of email address, IP

address, domain names etc. found in the email. Each of these entities found in the email

means an insert in table tbl email address or tbl uri, opening a separate socket connection

with the Unix tools server.

We use PostgreSql’s ”Server Programming Interface”[12] technique to make all the

database inserts inside email parser code part of one single atomic transaction. We make

this transaction atomic to avoid any inconsistency in the database. This ensures that either

all or none of the database inserts in the parser code will commit. Let’s say there are 20

database inserts in the parser code for 20 different email addresses, IP address, domain names

etc. found the the email. All these inserts being a part of one single atomic transaction

are sequential. This means that nth insert statement in the parser code on tables table

tbl email address or tbl uri would start only when (n-1)th database insert, along with all

the actions initiated by it, is successful. Both the tables tbl email address and tbl uri have

triggers associated with them. These triggers internally implement UUTC protocol and open

a socket connection with the Unix tools server. Now if the client socket at the database server,

started by (n-1)th database insert inside the parser code, waits for the Unix tools server to

respond, which might take long to return results depending upon the network speed, then all

the database inserts in the parser code right after that would hang, waiting for the Unix tools

server to return results for (n-1)th insert. This delay from the Unix tools server would keep

51

Figure 6.1: Communication mechanism between the database server and Unix tools server.

accruing from every database insert inside the parser code, as each database insert opens a

new socket connection with the Unix tools server, making the over all system terribly slow.

This can be visualized in Figure 6.1.

To overcome this performance issue, we from the Unix tools server side, close the socket

connection between the database server and the Unix tools server right after passing the

52

required parameters to the Unix tools server through this socket. As a result, we disconnect

the Unix tools server from the database transaction and hence the latency of the Unix tools

server won’t affect the performance of the database process. All the database inserts can

run independent of the Unix tools server. Unix tools server after fetching results from the

internet for a particular tool opens a separate ODBC connection with the database server

and stores the results in result tables like tbl verify mx, tbl traceroute etc. So we have a

gain in the over all performance of the system by closing the socket connection between

the database server and the Unix tools server right after passing the required parameters

from the database server. This method of performance improvement by closing the socket

connection in a manner explained above opens gates for dealing with the state maintenance

problem between the database server and the Unix tools server, explained in next section.

To further improve the performance of the system, we do the following at the database

server end:-

• In one single email, none of the Unix tools is run more than once for a particular email

address, IP address, URI, domain name etc. even though the same entity has several

occurrences in that email. That means for the same entity in an email, the socket

connection with the Unix tools server would not be opened more than once.

• Across different emails uploaded in the system, if a Tool, except Tool1 (ESMTP VRFY

and EXPN), was already run in past 10 days for a particular entity then, that tool is

not run again for the same entity within a period of 10 days. Results from the old run

of that tool(s) on that entity would be tagged to it programmatically. An exception

to this rule is if the tool’s last result in past 10 days for that entity was null. In that

case the tool is run again.

The rationale behind adopting the above methodology (10 days rule) to improve the

performance of the system is that result from running tools like whois, dig, traceroute,

GeoIP is not expected to change drastically in a period of 10 days. So we save on

resources and time by running these tools just once in a period of 10 days. We

run Unix commands ESMTP VRFY and ESMTP EXPN every single time because

53

email addresses are ephemeral and we can’t expect the same results from these Unix

commands, for the same email address, even after one single day.

54

CHAPTER 7

State Maintenance

As discussed in the previous section, closing the socket connection right after passing the

required parameters to the Unix tools server through this socket opens gates for the state

maintenance problem between the database server and the Unix tools server. As soon as the

socket connection is closed by the Unix tools server, database server has no way of keeping

track of the success or the failure of tool(s) run at the Unix tools server side. This means

that the database sever becomes stateless with regards the Unix tools server.

In this section, we essentially discuss why database server needs to maintain state of the

Unix tools server and how we maintain such a state.

7.1 Why State Maintenance

As already discussed, the aim of this automation process is to generate reports that law

enforcement agencies would use to analyze an email for further investigations. Since the

database process and the Unix tools server process get disconnected by closing the socket

connection right after the required parameters are passed from the database server side to

the Unix tools server side, the email parser code keeps executing all its database inserts and

returns an OK signal to the User Interface right after all the database inserts are successfully

completed. At this point of time, even though the database server completed all its activities,

it is not sure whether the Unix tools server serviced all the requests sent by it. The User

Interface on receiving an OK signal from the database server can immediately show a screen

having links for the generation of different kinds of reports.(See Figure 9.2 for a report

segment). In this screen, the user can click on various links to see reports on various tool

55

results. However, if the Unix tools server is still not done with its job(s) on the email for

which the user wants to see reports, this page won’t show any results as the result data

has not yet been completely populated by the Unix tools server in result tables. This would

result in an ambiguity in the system. To avoid this ambiguity, there is a need for the database

server to know the state of the Unix tools server. The database server should send an OK

signal to the User Interface only when Unix tools server has serviced all the requests sent

from the database server side for a particular email. So we had to design a methodology to

deal with this state maintenance problem, explained in next section.

7.2 Accomplishing State Maintenance

To solve the state maintenance problem, we refer to Transaction 2 in Figure 5.2. We created

two database tables, table tbl concurrent db 4.16 having columns unmask id, dbcount and

table tbl concurrent unix 4.17 having columns unmask id, unixcount. Column dbcount

in table tbl concurrent db stores the number of database inserts done in the email parser

code at the datbase server side for a paticular unmask id (email) and column unixcount in

table tbl concurrent unix stores the number of requests Unix tools server serviced for that

unmask id (email). For a particular unmask id (email), value stored in column dbcount in

table tbl concurrent db is inncremented right after any insert statement inside the parser

code and the value of column unixcount in table tbl concurrent unix is incremented by the

Unix tools server right after it finishes running a tool. Fundamentally, for a particular

email, if the values of columns dbcount and unixcount in tables tbl concurrent db and

tbl concurrent unix respectively are equal then, that means the Unix tools server has serviced

all the requests sent by the database server for a particular unmask id (email). Also, the

Unix tools server would still increment the ’unixcount’ value even if the tool fails and doesn’t

return any results.

So, before updating tables tbl email address and tbl uri and sending an OK signal to the

User Interface, the database server keeps checking repeatedly in a loop whether the values

of ’dbcount’ and ’unixcount’ are equal for that email. This loop can be visualized as follows:

START LOOP

56

. IF dbcount = unixcount THEN

. BREAK AND SEND OK TO UI

. ELSE

. LOOP AGAIN

. END IF

. END LOOP

As seen in the code snippet above, the loop keeps executing infinitely until the dbcount

and the unixcount values in tables tbl concurrent db and tbl concurrent unix are equal for

a particular email(unmask id). However, after some research and testing, we figured out the

following flaws in this method:-

• This loop would keep executing infinitely for ever if the Unix tools server crashed by

any chance. In that scenario, the dbcount and unixcount values would never be equal

and hence the whole system would hang.

• This loop is a ’Busy Wait’ and keeps executing infinitely consuming alot of CPU

resources.

To solve these problems, we modified our loop as follows:

START LOOP

. IF dbcount = unixcount THEN

. BREAK AND SEND OK TO UI

. ELSE

. IF Unix Tools Server Still Alive THEN

. SLEEP 5 SECONDS

. LOOP AGAIN

. ELSE

. BREAK AND SEND ERROR CODE TO UI

. END IF

. END IF

57

. END LOOP

As seen in the modified code above, we added two extra conditions. These are enumerated

as follows:-

1. The first one i.e. ”IF Unix Tools Server Still Alive THEN” checks whether the Unix

tool server is still alive before re-iterating the loop. This check is done by pinging the

Unix tool server iteratively. If at all the Unix tools server crashed due any reasons

whatsoever, the loop ends and an appropriate error code is sent to the User Interface.

2. The second condition i.e. ”SLEEP 5 SECONDS” stops the loop for 5 seconds to

avoid busy waits and save CPU resources from infinite looping. There is a function

called pg sleep in PostgreSql. However, this function is not available in the version of

PostgreSql that we are using. pg sleep is available in PostgreSql 8.2 and above and

we use PostgreSql 8.1 in UnMASK. For a work around, to implement SLEEP, we call

a User Defined Function in Perl that runs SLEEP command in Perl for 5 seconds. In

future, when we upgrade our database to a latest version of PostgreSql, we will remove

the call to a Perl User Defined Function and use pg sleep.

58

CHAPTER 8

Tagging Unix Tools Result with an Email

Referring to Figure 5.2, Transaction 3 is used to tag the results of tools running in the Unix

tools server to a particular email (unmask id). These results are stored in result tables like

tbl verify mx, tbl dig, tbl traceroute, tbl whos and tbl country in the database. Before we

go ahead and explain this tagging process, we first need to understand why this tagging is

not done at the Unix tools server side when it stores the tools result in the database. Also,

as explained in Section 5.2, database server and the Unix tools server are disconnected and

hence, there is no way Unix tools server can maintain a state of the database server. Now,

the time when Unix tools server opens an ODBC connection with the database server to

store the tools result in results table, it has the unmask id and name of the entity of the email

for which a tool was run. Using the same ODBC connection, Unix tools server could upadte

table tbl emai address with column vrfy mx id and table tbl uri with columns traceroute id,

dig id, whois id and country id. However, since the all the inserts in Transaction 1 (see

Figure 5.2) are part of the main transaction that started with an insert in table tbl email,

all these inserts are not visible outside this main transaction. For example, Unix tools

server runs traceroute for unmask id 967 and canonical name yahoo.com. Then it opens an

ODBC connection with the database to update table tbl uri with column traceroute id. At

this point of time, there is a strong possibility that the main database transaction that we

just talked about, has not completed yet. This means that record for unmask id 987 and

canonical name yahoo.com in table tbl uri is still not visible to the Unix tools server. So the

Unix tools server may not be able to tag the traceroute result for yahoo.com for unmask id

987 in table tbl uri. This holds true for other tools as well. Thus, this tagging is not done at

the Unix tools server side. Instead of that, we have Transaction 3 (see figure 5.2) dedicated

for this tagging job.

59

We also need to understand the way results from the Unix tools server are stored in the

results table like tbl vrfy mx, tbl traceroute, tbl dig, tbl whois and tbl country. In these

tables, there is no unmask id column. So, we can’t associate any tools result with an email

directly from these tables. Also, records in these tables are not duplicated. For example,

if we already have a ’whois’ record for yahoo.com in tbl whois and whois was run again on

yahoo.com. If the latest whois result is same as the one already stored in the table, then it

is not stored. If it is different, then a new record is inserted with a new whois id and latest

whois result. Same holds true for other result tables. This way, we can save a lot of database

storage space by unnecessarily storing redundant information.

Now that we understand why the process of tagging a tool result with an email is not

done at the Unix tools server side, we explain how this tagging is done at the database

side. Referring to Figure 5.2 again, transaction 3 has unmask id and name of the entity on

which the tool was run, and using this information it can update tables tbl email address

and tbl uri. For a particular entity (columns email local and email domain in table

tbl email address and column canonical name in table tbl uri), this transaction picks the

latest ID’s of tool results from various result tables and updates tables tbl email address

and tbl uri with these ID’s. Column vrfy mx id is updated for table tbl email address and

and columns traceroute id, dig id, whois id and country id are updated for table tbl uri.

Therefore, at the end of Transaction 3, tools result are tagged with entities in an email

in tables tbl email address and tbl uri and can be used for report generation at the User

Interface level.

60

CHAPTER 9

Case Study

The user interface for UnMask supports a case management system for uploading of email

files for analysis as well as generation of reports based on information stored in the database.

UnMask uses a password-based user access control. In order to submit an email, the user

first logs into the system. The user can submit an email file (in eml format) by browsing for it

locally and uploading it as part of a new or an existing case. After the email is deconstructed

and processed as discussed in Section 4, the user is able to view the generated reports. Figure

9.1 illustrates the case management screen of the Unmask user interface. The three cases

that user liu is investigating are listed, and further information on each can be accessed by

clicking on the case name. The implementation of the user interface is through an interactive

web-based infrastructure rendered using dynamic web pages written in Sun Microsystems’

Java Server Pages (JSP) technology. The application logic in a JSP page uses the Java

Database Connectivity (JDBC) API to create dynamically-generated HTML output from

the contents of the database by providing a call-level API for SQL-based database access.

Also within a JSP page we have HTML code that displays static text and graphics. When

the page is displayed in a user’s browser, it contains both static HTML content and dynamic

information retrieved from the database about his or her specific case.

Reports are designed to support law-enforcement in analyzing email components. For

example, the sender email address in the raw email being analyzed may have been forged,

or a URL in the rendered email may be redirecting the recipient to a website different from

what is commonly inferred from its name. As part of requirements analysis, a brief survey

was done to ascertain what investigators would ideally like to see in a report. Some of these

desired features were determined to be: For each email address found in the phishing email,

determine the MX record for its domain, and also the results of executing ESMTP EXPN

61

Figure 9.1: UnMASK User Interface

and ESMTP VRFY on the mail server. It must be clearly mentioned as to what field (i.e.,

”From”, ”Cc”, ”Bcc”, etc.) the particular email address was found in. Determine the IP

address of the originating machine, and run the network utilities traceroute, dig, and whois

on this address. For each IP address/URL specified anywhere in the body of the raw email,

again run the aforementioned network utilities. The reports that UnMask generates include

all the above information organized in a structured fashion discussed in the next subsection.

See Figure 9.2 for a portion of a report illustrating the Registrant analysis of a domain

name found in the email body.

The report follows the structure of an email message. Starting with email header

information the report shows the specific header fields isolated for clarity and coupled

62

Figure 9.2: Segment of a Report

with information gathered by the UNIX tools. This additional information expands the

investigator’s understanding of that field. For example the trace fields ”Received:” would

appear with an analysis of the sending and receiving mail hosts (IP address, domain name,

traceroute result, DNS and whois records, etc). As mail hosts (represented by name or IP

address), email addresses, website links, and other items appear in the report at different

sections of the email, so does the information gathered about these items. This provides the

investigator with as much information as possible and aids in the decision making process

on what forensic leads to follow further.

To have a better understanding of the UnMask report structure and how it may be

used by law enforcement, we present an example section of a report that provides detailed

63

forensic information on the ”Received:” fields in an email header. Each email message carries

in its header a set of ”Received:” fields (the set can be empty), which collectively describe

the routes that the message takes from the sender to the recipient of the message at the

mail relay server level. It is important to note that, in order to mislead the recipient (or

investigator) of an email message about where the message originates, it is common for the

(spam and phishing) message sender to forge the first few ”Received:” fields. However, the

set of ”Received:” fields in a message still contains the true path that the message takes as

part of the route shown in the set of ”Received:” fields, so it provides valuable investigation

information for law enforcement.

As discussed above, an UnMask report follows the structure of an email message. For

each field, we provide additional forensic information gathered by the Unix Tools system. In

particular, for each ”Received:” field, we first extract the domain names and IP addresses

of the mail relay servers appearing in the field. To aid the law enforcement investigation, we

then determine the location and contact information of the organization (or person) that is

responsible for a domain name (or IP address), the route to the mail relay server, and the IP

address of a domain name (and vice versa), along with other information that we collect, by

launching the corresponding Unix tools. Discrepancies discovered during the analysis of the

”Received:” fields are also reported and highlighted. The following snippet is an example

”Received:” field from an email message that we received, which contains two domain names

(walking14.legessermon.com, mx.google.com) and one IP address (64.192.31.14). For each

of them, we collect proper forensic information by launching the corresponding tools. The

information shown was collected within one day after we received the message.

. Received: from walking14.legessermon.com (walking14.legessermon.com [64.192.31.14])

. by mx.google.com with ESMTP id e18si15752160qbe.2007.05.30.10.46.13;

. Wed, 30 May 2007 10:46:24 -0700 (PDT)

Figure 9.3 shows a snapshot of the report section related to the domain name walk-

ing14.legessermon.com found in the example ”Received:” field (the snapshot only captured

a segment of it). In this section of the report, we first determine location and contact

information of the organization that is responsible for the domain name (partially shown

in the figure), the MX and DNS records for the corresponding domain, the route to the

64

domain name, and the IP address of the domain name, among other things. We noted that

the corresponding IP address returned from our tool is 64.192.31.2, which is different from

the one listed in the ”Received:” field for the aforementioned domain name. No strong

conclusion can be drawn from this discrepancy (notice that the two IP addresses are on the

same subnet); however, we reported the fact as it is.

Similarly, we collected similar information about the IP address 64.192.31.14 and the

domain name mx.google.com. We do not discuss the results related to them here due to space

limitation. But we note that, the location and contact information returned from probing

the IP addresses tend to be more long-lived and reliable than the ones from probing the

domain names. Domain names (especially for phishing sites) and their associated registration

information tend to be short-lived. However, IP address allocation is normally delegated

to ISPs nowadays and is quite stable. Three days later we re-ran the tools to generate

another report on the message, containing the domain name walking14.legessermon.com and

its associated registration/contact information. The resulting information turned out to be

the same; however if it would have been different then further investigation may have been

warranted.

65

Figure 9.3: Part of header field report

66

CHAPTER 10

Conclusion

The thesis work provided a method to automate the analysis of an email using a database.

Using database triggers and Perl language interface in Postgresql, emails can be parsed on the

fly and socket connection can be opened to a daemon (Unix tools server) running separately

on another server machine. We described our UUTC protocol that we developed for this

automation process. Using this protocol, we established a communication channel between

the database server and the Unix tools server. Unix tools server upon accepting required

parameters from the database server through the socket connection runs various tools like

traceroute, dig, whois etc. and stores the results back in the database. Various important

reports are generated in which the tools result are seen. These reports are used to analyse

an email for further investigations.

However, there is a scope of improvement. Some of them are explained in the next

section.

10.1 Future Work

In future, we can think of incorporating the follwoing in the automation process we explained

in this thesis work to make it more efficient :-

• Since Postgresql is an open source database, we can implement UUTC protocol in

the source code of Postgresql. So, instead of PostgreSql engine interpreting the client

socket code every single time, we can have ready to use libraries in the database itself.

Thus, instead of doing the same kind of expensive interpretation several timea for

one single email, we can embed the UUTC code in the PostgreSql source code, have

pre-compiled libraries for UUTC and hence, further improving the performance of this

automation process.

67

• Since the analysis of raw email is done manually by studying the final reports generated,

it makes the analysis process slow to finally come to a conclusion whether email

is a phishing email or not. However, if we incorporate the concept of uncertain

databases[13] [14], it would allow the system itself to judge whether an email is a

phishing email or not, giving a human analyst a lot of help during the analysis of this

email.

68

REFERENCES

[1] Sam Spade. http://www.pcworld.com/downloads/file/fid,4709-order,1-page,1-c,spamblockers/description.html. 1, 2

[2] Domain Tools. http://www.domaintools.com. 1, 2

[3] Sudhir Aggarwal, Jasbinder Bali, Zhenhai Duan, Leo Kermes, Wayne Liu, ShahankSahai, and Zhenghui Zhu. The design and development of an undercover multipur-pose anti-spoofing kit (unmask). Annual Computer Security Applications Conference(ACSAC), December 2007. 1, 3.4, 3.5, 1

[4] Auto Admin. http://research.microsoft.com/dmx/autoadmin/. 1

[5] Phisherman SPARTA Inc. http://www.isso.sparta.com/documents/phisherman.pdf. 3

[6] Postgresql. http://www.postgresql.org. 3.2

[7] IPGEO Tools. http://www.ipgeo.com. 3.1, 4.14

[8] CPAN. http://www.cpan.org. 1

[9] SQL Injection. http://en.wikipedia.org/wiki/sql injection. 8

[10] RFC 2822. http://www.ietf.org/rfc/rfc2822.txt. 3

[11] Database Cursors in PostgreSql. http://www.postgresql.org/docs/8.2/static/sql-declare.html. 4.2.3.2

[12] Postgresql: Server Programming Interface. http://www.postgresql.org/docs/8.2/static/spi.html.6

[13] Laks V.S. Lakshmanan and Nematollaah Shiri. A parametric approach to deductivedatabases with uncertainty. IEEE Transactions on Knowledge and Data Engineering,13(4):554–570, July/August 2001. 10.1

[14] S. McClean, B. Scotney, and M. Shapcott. Aggregation of imprecise and uncertaininformation in databases. IEEE Transactions on Knowledge and Data Engineering,13(6):902–912, November/December 2001. 10.1

69

BIOGRAPHICAL SKETCH

Jasbinder S. Bali

Jasbinder S. Bali was born in Kashmir, India in November 1980. He did his B.Tech in

Electrical Engineering from National Institute of Technology, Jamshedpur (India) graduating

in July, 2002. He came to USA in August 2005 to pursure Masters in Computer Science at

The Florida State University.

Before pursuing Masters in Computer Science, Jasbinder worked in the Information

Technology industry for about 3 years with companies like Computer Sciences Corporation,

New Delhi (India) and Patni Computer Systems, Bombay (India) as a Software Engineer

from October 2002 to July 2005, working on various software design and development

projects. Currently, he is working with Dr. Sudhir Aggarwal at FSU’s Electronic Crime

Investigative Technologies Laboratory on a phishing email project called UnMASK. He is

also working towards the completion of his Masters in Fall 2007.

Jasbinder is the current president of Florida State University’s Cricket Club and captain-

ing the team for the past two seasons.

70