Upload
khangminh22
View
0
Download
0
Embed Size (px)
Citation preview
Florida State University Libraries
Electronic Theses, Treatises and Dissertations The Graduate School
2007.
Automation of Email Analysis Using aDatabaseJasbinder Singh Bali
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
THE FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
AUTOMATION OF EMAIL ANALYSIS USING A DATABASE
By
JASBINDER S. BALI
A Thesis submitted to theDepartment of Computer Science
in partial fulfillment of therequirements for the degree of
Masters of Science
Degree Awarded:Fall Semester, 2007
The members of the Committee approve the Thesis of Jasbinder Bali defended on October
10, 2007.
Sudhir AggarwalProfessor Directing Thesis
Zhenhai DuanCommittee Member
Piyush KumarCommittee Member
Approved:
David Whalley, ChairDepartment of Computer Science
Joseph Travis, Dean, College of Arts and Sciences
The Office of Graduate Studies has verified and approved the above named committee members.
ii
The members of the Committee approve the Thesis of Jasbinder Bali defended on October
10, 2007.
Sudhir AggarwalProfessor Directing Thesis
Zhenhai DuanCommittee Member
Piyush KumarCommittee Member
The Office of Graduate Studies has verified and approved the above named committee members.
ii
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Sudhir Aggarwal for giving me an opportunity to
work on UnMASK. I would also like to thank Leo Kermes and Zhenghui Zhu who worked
hand in hand with me on this project, for all their help and support.
— Jasbinder S. Bali
iv
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Overview of UnMASK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.4 Email Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.5 Unix Tools Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4. Database Server: Automation Engine for UnMASK . . . . . . . . . . . . . . . 114.1 Why PostgreSql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 UnMASK Database Design . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Raw Email Analysis: Basis of DB Design . . . . . . . . . . . . . . 134.2.2 Database Design Factors . . . . . . . . . . . . . . . . . . . . . . . . 164.2.3 Final Database Design . . . . . . . . . . . . . . . . . . . . . . . . . 19
5. Automation Using a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.1 UnMASK Unix Tools Connection Protocol (UUTC) . . . . . . . . . . . . 465.2 Automating Email Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 46
6. Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7. State Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1 Why State Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.2 Accomplishing State Maintenance . . . . . . . . . . . . . . . . . . . . . . 56
8. Tagging Unix Tools Result with an Email . . . . . . . . . . . . . . . . . . . . 59
9. Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
v
10.Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6710.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
vi
LIST OF TABLES
3.1 UnMASK Unix Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1 tbl users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 tbl email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 tbl l header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 tbl ul header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 tbl ul header received . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 tbl email address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 tbl uri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 tbl href . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 tbl website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.10 tbl source master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.11 tbl whois . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.12 tbl traceroute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.13 tbl dig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.14 tbl country . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.15 tbl vrfy mx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.16 tbl concurrent db . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.17 tbl concurrent unix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.18 UnMASK Triggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
LIST OF FIGURES
3.1 UnMASK: Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Sample Raw Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 UnMASK: Tables, Triggers and Dataflow . . . . . . . . . . . . . . . . . . . . 19
4.3 User Defined Function sp fetch tools result . . . . . . . . . . . . . . . . . . . 41
4.4 Trigger Function func client socket . . . . . . . . . . . . . . . . . . . . . . . 44
5.1 Workflow: Automating email analysis . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Implementation of User Defined Function sp email parser . . . . . . . . . . . 48
6.1 Communication mechanism between the database server and Unix tools server. 52
9.1 UnMASK User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.2 Segment of a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
9.3 Part of header field report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
viii
ABSTRACT
Phishing scams which use emails to trick users into revealing personal data have become
pandemic in the world. Analyzing such emails to extract maximum information about
them and make intelligent forensic decisions based on such an analysis is a major task
for law enforcement agencies. To date such analysis is done by manually checking various
headers of a raw email and running various Unix tools on its constituent parts such as
IP addresses, links, domain names. This thesis describes the design and development of a
database system used for automation of a system called the Undercover Multipurpose Anti-
Spoofing Kit (UnMASK) that will enable investigators to reduce the time and effort needed
for digital forensic investigations of email-based crimes. It also describes how the database
is used to perform such automation. UnMASK uses a database for organizing a workflow
to automatically launch Unix tools to collect additional information from the Internet. The
retrieved information is in turn added to the database. UnMASK is a working system. To
the best of our knowledge, UnMASK is the first comprehensive system that can automate
the process of analyzing emails using a database and then generate forensic reports that can
be used for subsequent investigation and prosecution.
ix
CHAPTER 1
Introduction
Phishing scams which use emails to trick users into revealing personal data have become
pandemic in the world. Analyzing such emails to extract maximum information about them
and make intelligent forensic decisions based on such an analysis is a major task for law
enforcement agencies. To date such analysis is done by manually checking various headers
of a raw email and Unix tools like traceroute, whois, dig etc. are run to fetch information
about these IP addresses, domain names and links from the internet. Currently, there are
Programs/Tools like Sam Spade[1] and Domain Tools[2] that share the same goal of running
Unix tools like traceroute, dig, whois etc. on these IP addresses, domain names. However,
this process is not a complete system in itself. It still needs human effort to parse raw emails
manually and then feed its constituent IP address, email address, domain names etc to these
Unix tools. Also, there is no correlation between the results obtained from various runs of
Unix tools and the data is not stored in a central repository for future reference.
This thesis aims at describing the design and development of a database system that
would lead to the creation of a self sufficient system called UnMASK[3]. UnMASK aims at
parsing a raw email, break it into its constituent headers and body, pick IP addresses, email
addresses, domain names and links from these headers and the body and then automatically
invoke a Unix tools server that would run various Unix tools like traceroute, dig, whois
and GeoIP to fetch information from the internet and store this information back into
the database. The database system described in this thesis aims at automating all the
processes that would take place in UnMASK and making UnMASK a self-sufficient system
for analyzing emails. We accomplish such automation using PostgreSql database and a novel
use of database triggers to create a workflow manager; and the automation of the use of Unix
1
tools to automatically retrieve additional desired information from the Internet and store it
into the database.
Chapter 2 throws light on the background and related work in the field of database
driven automation and analysis of phishing emails. Once we are aware of the current trends
and techniques of database driven automation and analysis of phishing emails, we get a fair
idea of our direction of research and the methodologies we would use to design the database
to accomplish the goal of UnMASK.
Chapter 3 gives an overview of UnMASK project. Here we discuss its problem statement
and various plausible solutions from the database point of view. Then we talk about the
final solution adopted for accomplishing UnMASK’s goal of automatic email analysis. Before
explaining the database driven automation, we also throw light on UnMASK’s software
architecture to get an idea about the UnMASK system. Email Parser and the Unix Tools
Server are the two major components of UnMASK based on which, we designed our database
system. So, it becomes necessary to understand these two components of UnMASK before
discussing the database design. We refer [3] and explain these two important components
of UnMASK in Sections 3.5 and 3.4. It should be noted that the design and development
of Email Parser and the Unix Tools Server used in UnMASK is not a part of this thesis work.
PostgreSql is a powerful database system because of its provision for various programming
language interfaces. In Chapter 4, we discuss the reasons for choosing Postgresql database
for UnMASK. Here we also discuss UnMASK database design in detail.
In Chapter 5, we discuss in depth the concept of automation of email analysis using a
database. Section 5.1 talks about UnMASK Unix Tools Connection (UUTC) protocol. This
protocol is the basis of database driven automation in UnMASK and explains the connection
mechanism between the database server and the Unix tools server. In Section 5.2 we discuss
the process of automating email analysis in detail.
Performance is a major issue in database driven automation. Various processes running
to achieve this automation in UnMASK make the system very slow. In Chapter 6, we discuss
2
how the database driven automation is made efficient and how a speed up is achieved in this
process.
Database server and Unix tools server are two distinct entities with various inter-related
processes running at both ends. In Chapter 7, we throw light on how the process of database
driven automation in UnMASK takes care of maintaining state between the database server
and the Unix tools server. Here we also talk about the need of maintaining such a state.
Results obtained after running various tools in the Unix tools server should be properly
tagged to various entities1 of an email. In Chapter 8, we discuss how results obtained from
the Unix tools server are tagged with various entities of an email.
Chapter 9 illustrates a case study and explains the results obtained from UnMASK
system after running it for a particular email. Here we talk about various reports that can
be generated after a raw email is automatically parsed and results obtained from the internet
by running various Unix tools stored back into the database. These reports are used by the
law enforcement officers to analyze the email and take forensic decisions accordingly.
In Chapter 10, we give a summary of contributions from this thesis work. We also give
incite of the future work that we can potentially conduct to add new features to the system
and further improve its performance.
1In this thesis, we use word ’entity’ for entities like email adress, IP adress, domain name, link, URL orURI found in an email.
3
CHAPTER 2
Background and Related Work
The goal of UnMASK project is to create a fully automated all-encompassing tool for
processing those emails that are submitted to or acquired by law enforcement agencies
as potential phishing emails. Such email messages sometimes contain spoofed, concealed,
incomplete, or otherwise flawed information, and are generally untrustworthy. In the current
practice, in order to extract reliable and useful information from an email message, it is
important to involve a computer forensics expert who understands the complexity of the
Internet email protocols and the loopholes in current email messaging systems. UnMASK
aims to provide automated email processing capability to law enforcement agents to help
them discern and filter any flaws or fraudulent information, reveal the true nature of the
email, and present derived evidence or leads to track down the transgressors. In UnMask,
we chose database as the automation engine for email analysis.
In current UnMASK system, as soon as raw email is submitted, it gets automatically
parsed in the database and various network querying tools currently available to Unix
users, including but not limited to: dig, whois, and traceroute, are automatically invoked at
appropriate times through the database itself. Database acts as a full fledged automation
engine for the whole system.
Currently, there are some working systems that involve database automation. Some of
them are explained as follows:
1. Microsoft’s AutoAdmin[4] project is an effort to make database systems self-tuning and
self-administering. With this database automation, database auto-tunes itself instead
of applications tracking and tuning it and thus the database becomes more responsive
4
to application needs. This project is used to self-tune components included in Microsoft
SQL Server.
2. As far as analyzing an email is concerned, there are some projects and tools developed.
Tools or websites such as Sam Spade[1] or Domain Tools[2], share a similar goal to
ours. These tools are used interactively to various degrees by the law enforcement
community. These tools/websites provide network-query functionality that lets users
probe domain names, IP addresses, etc. Sam Spade, for example, lets user crawl
websites to pull out a list of email-addresses/links. These tools also let users analyze
email headers to determine whether the email message was sent from a valid address or
forwarded via an open relay to cover the sender’s tracks. However, these tools expect
some reasonable networking expertise from a human being. More importantly, they do
not sufficiently automate the work nor do they provide a database for further analysis.
3. SPARTA’s Phisherman project[5] is more closely related to UnMASK, in that both
employ a database as central repository. However, the Phisherman project is a global
effort to simply collect and archive data related particularly to phishing scams and
disseminate this data to its subscribers. In contrast, our goal is to help users in a more
direct way, i.e., provide them an automated tool to process emails.
5
CHAPTER 3
Overview of UnMASK
The basis of design and development of the database system described in this thesis is the
goal of UnMASK. The database system is designed in such a way that it perfectly fits in the
requirements of UnMASK. This database system works hand in hand with other components
of UnMASK. So, it is very important to understand the goal of UnMASK and details about
its other components based on which we designed our database system. In this chapter, we
essentially discuss the problem statement of UnMASK, the proof of concept for the final
solution and other major components of UnMASK like the Raw Email Parser and the Unix
Tools Server. Please note that Raw Email Parser and the Unix Tools Server is not a part
of this thesis work. However, these two entities go hand in hand with the database server
described in this thesis. So it becomes essential to get an idea about these two entities in
UnMASK before discussing various aspects of the database server.
3.1 Problem Statement
UnMASK aims at automating the analysis of emails. This should be done by automatically
parsing a raw email and invoking various Unix tools like dig, traceroute, whois etc. and
storing the information retrieved by running these tools back into the database. Appropriate
reports should be automatically generated giving a full fledged analysis of the email. The
whole system should be a one click process.
6
3.2 Proof of Concept
To begin with, a raw email should be parsed and broken down into its constituent headers
and the body. In these headers and the body, entities like IP address, email address, domain
names would be picked, stored in the database tables and Unix tools like Traceroute, dig,
whois and Unix commands like ESMTP VRFY and ESMTP EXPN should automatically
run and give an analysis of this email in the form of detailed reports. A Proof of Concept
had to be made that would essentially explain a feasible way to achieve this automation.
Initial approach towards running the Unix tools automatically on email addresses, domain
names and IP addresses was to have a separate process noticing an insert in tables where
these entities are stored. Then this process would perform the task of picking up email
address, domain name / IP address from the newly inserted record(s) that are not serviced
by the Unix tools server yet and run Unix tools like traceroute, dig, whois etc. on them. This
was thought of being done by a periodic cron job, or using NOTIFY/LISTEN commands
in PostgreSql[6]. This is very conventional approach followed in such a scenario. However,
after a thorough research, a major flaw in this approach was highlighted. The cron job would
keep checking constantly if there is any new record being inserted in some database tables.
This cron job hence will be an infinite loop checking whether there is a new record inserted
and cause undesirable thrashing if there is too much contention. So from operating system
point of view, this approach proved to be very dangerous.
After some more research, we finally decided to use database triggers in a novel way to
automate the whole process. This method is really unconventional and highlights the beauty
of the way triggers can be used in PostgreSql. The final Proof of Concept is as follows:
As soon as an email address, IP address, domain names, links, URL’s or URI’s are
stored in the database, an action would be triggered to automatically parse the email and
store entities like email address, IP address, domain names in separate database tables.
These tables can further trigger actions to automatically connect to the Unix tools server
passing parameters required by the tools server to run various Unix tools like traceroute, dig,
whois etc. and store the results from the tools server back to the database. Using database
7
triggers to automatically parse the email and then automatically connect to the Unix tools
server was the final Proof of Concept for automating the email analysis.
3.3 Software Architecture
In this section, we describe the architecture of UnMASK system and show how different
components in the system interact with each other. Figure 3.1 shows the interaction of
the User Interface, PostgreSql database and the Unix Tools system. As seen in the figure,
a user uploads an email into the system using a User Interface. This invokes a server side
JSP script which opens an ODBC connection with PostgreSql database and the email gets
stored in a database table. At this point of time, a trigger is invoked which in turn calls
a User Defined Function in the database that parses the email. Various headers and body
of the email after being parsed correctly are further stored in appropriate tables. Tables in
which entities like IP addresses, domain names and email addresses are stored further have
triggers associated with them. As soon as a record is inserted in these tables, a trigger is
invoked which in turn calls a User Defined Function to establish a connection with the Unix
tools server passing required parameters to it. Unix tools server, using the parameters sent
by the database server, runs various Unix tools like traceroute, dig, whois etc. and returns
the result back to the database where it is stored in appropriate result tables.
3.4 Email Parser
The major part of the database design is based on how the parsing of an email is done in
UnMASK. Based on that, various database design decisions were taken and final design was
arrived at. In UnMASK, various parsers are used to deconstruct a raw email, analyze email
headers and the body, extract specific components from the email such as email addresses,
IP addresses, domain names, links, URL’s, URI’s etc. More details on Email parsers used
in UnMASK can be found in [3]. To re-iterate, development of Email Parser is not a part
of this thesis work. However, database related transactions inside the parser are part of this
thesis.
8
Figure 3.1: UnMASK: Software Architecture
3.5 Unix Tools Server
The Unix tools server is a daemon that runs programs (tools) invoked by the database server.
Some of the tools that were developed in UnMASK are shown in Table 3.1 1, which also shows
the parameters required by each tool. Parameter unmask id is a digital ID generated by our
database server, explained in detail in next section. Parameters ’domain’ and ’local name’
are the domain name and user name parts of an email address respectively. ’Source’ indicates
the source of any entity found in the email. ’Dns server’ is an optional parameter sent to
the Unix tools server. Refer [3] for more details on the Unix tools server used in UnMASK.
Again, development of Unix tools server is not a part of this thesis work. However, database
related transactions inside the server code are part of this thesis.
1Table acquired from [3], Page 6
9
Table 3.1: UnMASK Unix Tools
Tool Name Parameters Functiontool1 unmask id
domainlocal namesource[dns server]
To find the mail servers of the domain, and then ESMTPVRFY to verify the email address at one of the mailservers.
tool2 unmask idhost
To find reachability and routes to an IP address orcanonical host name.
tool3 unmask iddomain
To find registration data for a domain.
tool4 unmask iddomainsource[dns server]
To get full DNS information
tool5 (usesa packagecalledIPGEO[7])
unmask idhost
Find geographical location (currently only country) ofan IP address or a canonical host name
10
CHAPTER 4
Database Server: Automation Engine for UnMASK
To implement the UnMASK system, we choose to use the PostgreSql database. Our
requirements for a database were the ability to: (1) store all email related data after parsing
it to an appropriate level of granularity and (2) mechanisms to invoke a toolkit of various
Unix tools like traceroute, dig, whois, etc. to retrieve additional information related to the
email from the Internet. In this chapeter, we explaing why we choose PostgreSql database
for UnMASK. Also, we discuss in detail the database design keeping in mind the Email
Parser and the Unix Tools Server explained in Chapter 3.
4.1 Why PostgreSql
We choose PostgreSQL over other relational database management systems because it is free
/ open source and it has excellent support for many features including the following:-
1. Native Interfaces for Procedural Languages: PostgreSQL allows user defined functions
to be written in various programming languages besides the native PL/pgSQL.
Currently supported languages include Perl, C, Python etc. We extensively use Perl
in our database programming as there are Perl packages for email parsing available on
CPAN[8] website. Also, Perl is considered to be the most useful language for string
manipulation. Other popular relational databases like Oracle and Sql Server have
limited support for programming languages. Sql Server 2005 supports .Net compliant
languages like C# and Oracle supports Java. So PostgreSql has an edge over other
popular relational database management systems for its more versatile procedural
languages support.
11
2. Transactional Data Definition Language (DDL): DDL statements are used to build
and modify the structure of the tables and other objects in the database. For example,
create table, update table, alter table etc. A sample DDL statement is shown as
follows:-
. CREATE TABLE tbl l header
. (
. unmask id int4,
. header name text,
. header content text,
. time stamp timestamp
. );
This create table statement creates the structure of a database table tbl l header.
Another example of a DDL statement is as follows:-
. ALTER TABLE <tbl l header>
. ADD CONSTRAINT <fk tbl l header uid> FOREIGN KEY (<unmask id>)
. REFERENCES <tbl email> (<unmask id>);
Alter Table statement shown above modifies table tbl l header and adds a constraint
(Foreign Key) to it so that column unmask id in it references column unmask id in
table tbl email.
In Oracle and other major RDBMS, these statements when included in one single
transaction are not atomic. Let us consider the following transaction for an example:-
BEGIN
. DDL1;
. DDL2;
12
END
In Oracle, DDL1 would commit right after it is issued without waiting for the end of
the over all transaction making the transaction non-atomic. Now if DDL1 succeeds
and DDL2 fails, then there would be an inconsistency in the database if there is a
dependence between DDL1 and DDL2. However, In case of PostgreSql, there is a
concept of Transactional DDL and the above transaction would be strictly atomic and
would work the same way as it works in the case of transactions with DML (Data
Manipulation Language) statements like INSERT, SELECT etc. In UnMASK, to
migrate the database from development to the production environment, we need to
run various database scripts in the new database server. These scripts include DDL
statements with bunch of other DML statements. So while running the database script,
if any of the DDL statements fail, the whole script would abort instead of creating an
inconsistent database.
3. Version control of User Defined Functions : Another unique feature that PostgreSql has
is the version control of User Defined Functions. For example, a User Defined Function
is created and while some application is running using this User Defined Function, the
User Defined Function is updated. The application would still use the old version
of the User Defined Function instead of the updated one and this will not affect the
currently running application. On the other hand, applications running on Oracle
database would crash or hang or show some undsired output if Stored Procedures are
updated while they are being used by an application. It should also be noted here that
we did not use term Stored Procedure in context with PostgreSql because there is no
concept of Stored Procedures in PostgreSql and this is explained in detail in Section
4.2.3.2.
4.2 UnMASK Database Design
4.2.1 Raw Email Analysis: Basis of DB Design
UnMASK database is designed for storing raw email data and the results retrieved by running
various Unix tools. So before discussing the actual database structure, we need to analyze a
13
raw email. A sample raw email is shown Figure 4.1.
After analyzing the raw email, we figured out the following logical divisions in it:-
1. Limited header fields : These are the fields in raw email that appear only once. For
example, ’from’, ’to’, ’sender’, ’subject’ etc.
2. Unlimited header fields : These are the fields in raw email that appear one or more
number of times. For example, ’cc’, ’bcc’, ’received fields’, ’resent fields’ etc.
3. Body : Body of an email can be in different forms depending upon the MIME Version
header field. It can either be in the form of plain text, html or any other valid format
define in RFC 2822.
Various header fields like ’from’, ’to’, ’cc’, ’bcc’, ’received’ etc. and the body of raw email
carry information like email address, domain name, URL, URI on which we decided to run
various Unix tools. Keeping various header fields decoupled from each other in the database,
based on their logical division in raw email, would help in querying the database efficiently.
For example, if limited and unlimited fields are stored together in one single database table,
then various Sql queries would take longer time to fetch results as the number of records
keep increasing. So it is advisable to divide various header fields in different database tables
for a better throughput of Sql queries.
Among various unlimited header fields, there are header fields called ’received fields’
carrying vital information. These header fields are very important and used to trace the
trail of an email from sending to the receiving end. All the relay servers between the sending
end and the receiving end of an email add one received header field to it. Let us analyze one
such received header field as follows:-
. Received: from officialgiftcards.info (OFFICIALGIFTCARDS.INFO [66.240.223.56])
. by mx.google.com with ESMTP id g17si3029410nfd.2007.10.13.15.25.13;
. Sat, 13 Oct 2007 15:25:15 -0700 (PDT)
14
Here we see that the received header field is further divided into various name-value pairs.
Looking at the first line that carries information about the relay server, we see that this field
(’from’) is sub-divided into entities that we term as from-from (officialgiftcards.info), from-
domain (OFFICIALGIFTCARDS.INFO) and from-address (66.240.223.56). All these fields
should be distinctly stored in the database for an ease in creating a correlation logic for
report generation and various other Sql queries used within the database code.
As seen in Figure 4.1, all the header fields are logically divided into a name-value pair.
For example, ’from’ field carries information like From: [email protected]. ’Received’ fields,
as already discussed, are sub-divided as ’from’, ’by’, ’via’ in the form of name-value pairs.
Analysis of the body of raw email shows that the way different fields appear in the body
depends upon the content type of the email. If the email is in plain text format, then all
the entities like email address, domain names, URL’s etc. are found inline. If the email is in
html format, then these entities can exist is the form of clickable links. These links appear
in the form of anchor (<a>) tags. An anchor tag looks like as follows:-
. <a href=”http://192.145.2.1”>Bank of America</a>
Here we see that the sender claims to redirect the recipient of the email to Bank of
America’s website by clicking on this link. However, if we look at the href part of anchor
tag, it would take the recipient to some unknown IP address. Such links are very prevalant
in phishing emails and extremely useful for taking forensic decisions on an email. Hence,
such type of data should be decoupled from the rest of the email data.
4.2.2 Database Design Factors
Based on the raw email analysis discussed in the previous section, various factors were
considered before finalizing the design of the database. These are enumerated as follows:-
1. Database should have various logical entities decoupled from each other based on
various headers in raw email. Decoupling should also be done on the basis of body of
16
the email. Also, since ’received’ fields carry vital information about the relay servers,
these should be stored separately for an easy and efficient analysis of such fields.
2. Email address, IP address, domain names, URLs, links etc. being the entities on which
Unix tools would run, should be stored separately. Tables in which such entities are
stored should trigger appropriate actions to run Unix tools on these entities. Also,
these entities should be properly tagged to a particular email.
3. Results fetched by running various Unix tools should be stored separately in results
table. These results should be properly tagged to an email.
4. Data in result tables should not be redundant. For example, ’whois’ result for
yahoo.com is already there in the database and ’whois’ on yahoo.com is run again
after certain period of time. Now, if the result of current run of ’whois’ on yahoo.com
is same as what is already there in the database then, there is no point storing the same
’whois’ result as a separate record in the database. Instead of that, the old ’whois’
result could be logically tagged for yahoo.com found in the current email. This helps
a lot in saving the storage space in the database.
5. Unnecessary calls from the database server to the Unix tools server should be avoided.
For example, if yahoo.com is found multiple times in an email, then connection to the
Unix tools server for the same entity should not be established more than once for
the same email. This is because we can’t expect the Unix tools like whois, traceroute,
dig etc. to give different information within a period of few seconds or minutes. So if
an entity exists more than once in raw email, connection with the Unix tools server
should be established for the first occurrence of that entity in the email. For the rest
of the occurrences of same entity in the same email, results from the first run of Unix
tools should be used. This would help us in increasing the efficiency of the database
server by avoiding expensive and unnecessary calls to the Unix tools server. However,
there is an exception in this rule. If the last run of the tool on the same entity stored
a blank result in the database, then the tool should be run again even though it was
already run for the same entity in the same email.
6. Since network querying tools like whois, traceroute, GeoIp, Dig are not expected do
17
give different results for an entity in a period of 10 days1, across emails for the same
entity, Unix tools server should not be connected to more than once in a period of
10 days. For example, Email-1 was uploaded in the UnMASK system and IP address
192.145.0.1 was found in it. All the network querying tools were run on this entity
by establishing a connection with the Unix tools server. Now, let say after 6 days
Email-2 was uploaded in the system and same IP address i.e. 192.145.0.1 was found
in it. Connection with the Unix tools server should not be established for this entity
because it was already done within a period of 10 days. Its only after 10 days period,
connection with the Unix tools server would be established for the same entity across
different emails. This rule has 2 exceptions listed as follows:-
• Since email addresses are ephemeral, so the 10 days logic explained above is not
used for email addresses while running Unix commands like ESMTP VRFY and
ESMTP ESPN on them.
• If the result of last run of a tool on an entity was null, then in this case the 10
days rule is not used and the tool is run again on that entity.
7. As discussed in the previous section, since href’s carry vital information and need to
be analyzed separately, this entity should be decoupled from the rest of the entities
and stored separately. One of the requirements of reports displayed in UnMASK User
Interface (See Chapter 9) is to list various different websites found in an email and
various link(s) under those website. For this purpose, href, link, URL, URI etc. found
in the email should be deconstructed and website part stored separately. Also, all
the websites and the links from which these website names are taken out should be
properly mapped to each other.
8. All database inserts and retrievals should take place using User Defined Functions to
avoid SQL Injection [9]. Also all the business logic inside the database should be
embedded in User Defined Functions.
9. When the database inserts occur, they can initiate other database activities through
the use of database triggers. Activities can include parsing fields of records in tables,
1Choosing a period of 10 days is just a design decision and is not a standard followed across varioussystems.
18
Figure 4.2: UnMASK: Tables, Triggers and Dataflow
initiating a connection to the Unix tools server and entering new records into the tables.
4.2.3 Final Database Design
Based on various design factors explained in the previous section, we designed the database
so that the tables that contain raw email and its deconstructed components are ”write once”.
This helps in maintaining an evidentiary trail for subsequent prosecution. We divided the
flow of data in the database into 3 levels as shown in Figure 4.2.
Various database entities such as Tables, User Defined Functions and Triggers, used in
UnMASK are explained as follows:-
4.2.3.1 Tables
As discussed in the previous section, we divided the whole data flow into 3 levels. These
are:-
19
• Level 1 Tables : These are the initial database tables in the system that store the raw
email in text format and all the information related to the user uploading an email.
All basic information related to an email being uploaded and the user uploading it is
stored in Level 1 database tables. Level 1 database tables are enumerated as follows:-
1. tbl users : This database table stores information about the user uploading an
email into the system. See Table 4.1 for column details.
2. tbl email : This table stores the raw email in text format and generates a unique
ID called unmask id which is unique to every email uploaded in the system. See
Table 4.2 for column details.
• Level 2 Tables : These tables store data related to the raw email obtained after it is
parsed. This includes limited header fields, unlimited header fields and entities like
email address, IP address, domain, link, URLs etc. Level 2 tables are enumerated as
follows:-
1. tbl l header : All limited header fields in a raw email like ’from’, ’sender’, ’to’,
’subject’ etc. are stored in this table. This table is designed in such a way that
only those limited header fields are stored that exist in the email. See Table 4.3
for column details.
2. tbl ul header : All unlimited header fields in a raw email like ’cc’, ’bcc’, ’resent-
fields’ etc. excluding received fields are stored in this table. This table is designed
in such a way that only those unlimited header fields are stored that exist in
the email. See Table 4.4 for column details. Since these fields are unlimited in
number, a column called seq no is added to maintain the sequence of such fields
within an email. See Table 4.4 for column details.
3. tbl ul header received : As discussed in Section 4.2.1, received fields carry vital
information about the relay servers and need to be analyzed separately. So all the
received fields in a raw email are stored in this table as separate records. If a raw
email has 3 received fields then, 3 different records are inserted in this table with
the value of column seq no ranging from 1 to 3. As shown in Table 4.5, various
columns of this table are chosen in compliance with RFC 2822 [10].
20
4. tbl email address : This table stores email addresses found anywhere in an email.
The ’source’ column in the table shows which header this email address belongs to.
’Source’ column is mapped with ’source text’ column in table tbl source master
(See Table 4.10).
5. tbl uri : This table stores entities like IP address, domain names, links, URLs,
URIs found anywhere in the email. The ’source’ column in the table shows which
header an entity belongs to. ’Source’ column is mapped with ’source text’ column
in table tbl source master.
6. tbl href : As discussed in Section 4.2.1, href’s need to be analyzed separately
for the type of information they carry. So a separate database table is created
to store href information. Apart from href’s, any link, URL or URI found in
the email is also stored in this table. All such entities that are not a part of
anchor (<a>) tag have the value of ’display’ column as NULL. See Table 4.8 for
column details. All href’s, links, URLs, URIs have a website part. For example,
http://www.abc.com?ui=123. The website part of this link found in the email is
www.abc.com and gets stored in table tbl website (See Table 4.9). ’website id’
column of tbl href is mapped with ’website id’ column of tbl website.
7. tbl website: This table is created for the reasons as discussed in Section 4.2.2
point 7 . See Table 4.9 for column details.
8. tbl concurrent db: This table stores the number of times a successful connection is
established with the Unix tools server by the database server for a particular email
(unmask id). The db count value for the current unmask id gets incremented by
1 for every connection opened successfully with the Unix tools server (Right after
an insert, either in table tbl email address or tbl uri, in the parser code). As
soon as the parser code finishes its processing, the db count value for a particular
unmask id would give a count of total number of connections the database server
opened with the Unix tools server for a particular email. This value is compared
with the unix count column value in Table 4.17 for maintaining a state between
the database server and the Unix tools server.
• Level 3 Tables : Level 3 tables store the results fetched by running various Unix tools
in the Unix tools server. Theese results are stored in the database as it is without
21
any further parsing. One of the Level 3 tables i.e. tbl concurrent unix is used for
state maintenance (See Chapter 7) between the database server and the Unix tools
server. As already explained in Section 4.2.2 Point 4, data in all the result tables is
not duplicated. Level 3 tables are explained as follows:-
1. tbl verify mx : Result obtained by running Unix commands ESMTP VRFY and
ESMTP EXPN on an email address is stored in this table. See Table ?? for
column details.
2. tbl whois : Result of ’whois’ run on a particular entity is stored in this table. See
Table 4.11 for column details.
3. tbl dig : Result of ’dig’ run on a particular entity is stored in this table. See Table
4.13 for column details.
4. tbl traceroute: Result of ’traceroute’ run on a particular entity is stored in this
table. See Table 4.12 for column details.
5. tbl country : Result of ’GeoIP’ tool run on a particular entity is stored in this
table. See Table 4.14 for column details.
6. tbl concurrent unix : This table stores the number of times a tool in the Unix tools
server finishes its job for a particular email’s (unmask id) entity. The unix count
column value for the current unmask id gets incremented by 1 after completion of
running a Unix tool or command on a particular email entity. As soon as the Unix
tools server services all the requests from the database server side for a particular
email, the unix count value for a particular unmask id would give a count of the
total number of tools run by the Unix tools server for a particular email. This
value is matched with the db count column value in Table 4.16 for maintaining
a state between the database server and the Unix tools server.
Apart from Level 1, 2 and 3 tables that take care of the data flow in UnMASK system,
we also have a helper table in the database system explained as follows:-
1. tbl source master : This table is a master table for various ’source’ column values used
in the database in tables like tbl email address, tbl uri and tbl href. Source means part
of the raw email to which an entity belongs. For example, a limited header field like
22
’From: [email protected]’ stores an email address. Then this email address is stored
in table tbl email address with value of ’source’ column as ’from’. See Table 4.10 for
column details.
23
Table 4.1: tbl users
Columnname
Datatype
Description
username text A username under which the email is uploaded.firstname text First name of the user uploading the email.lastname text Last name of the user uploading the email.org serial Name of organization of the user uploading the email.phone text Phone number of the user uploading the email.email text Email address of the user uploading the email.time stamp timestamp Keeps a track of date-time when the email was uploaded
in the system. Automatically inserted by the system andhas a default value of current date-time.
Primary Key: username.
. Foreign Key: None.
24
Table 4.2: tbl email
Columnname
Datatype
Description
username text A username using which the email is uploaded.casename text Name of the case under which the email is uploaded.
Each email belongs to a particular case.email filename text Physical name of the raw email file uploaded.unmask id serial Unique identifier automatically generated by the system
and assigned to an email uploaded in the system.raw email text Text form of the raw email uploaded in the system.time stamp timestamp Keeps a track of date-time when the email was uploaded
in the system. Automatically inserted by the system andhas a default value of current date-time.
Primary Key: unmask id.
. Foreign Key: None.
Table 4.3: tbl l header
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs
header name text Name of the Limited header field present in the email.header content text Contents of the limited header present in the email.time stamp timestamp Date-time when record is inserted in the table. Auto-
matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, header name.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.
25
Table 4.4: tbl ul header
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.
seq no text Unique sequence number of various unlimited headerfields in a particular email (unmask id).
header name text Name of unlimited header field present in the email.header content text Contents of unlimited header field present in the email.
(received fields not included)time stamp timestamp Date-time when record is inserted in the table. Auto-
matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, header name, seq no.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.
26
Table 4.5: tbl ul header received
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.
seq no text Unique sequence number of various received headerfields for a particular email.
id text Id part of a received field header.from from text from-from part of a received field header.from domain text from-domain part of a received field header.from address text from-address part of a received field header.rec by text received-by part of a received field header.via text via part of a received field header.rec with text received-with part of a received field header.rec for text received-for part of a received field header.date time text date-time part of a received field header.comments text comments part of a received field header.time stamp timestamp Date-time when record is inserted in the table. Auto-
matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, seq no.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email.
27
Table 4.6: tbl email address
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to an email uploaded in thesystem to which the header belongs.
source text Part or name of the header in the email where the emailaddress was found.
email local text Local part of the email address.email domain text Domain part of the email address.ip address inet Ip address of the mail server for the domain of this email
address.vrfy mx id int4 Foreign key to vrfy mx id column of table tbl verify mx
that stores ESMTP VRFY and ESMTP EXPN recordsfor this email address
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, source, email local, email domain.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,
. source REFERENCES source text in Table tbl source master,
. vrfy mx id REFERENCES vrfy mx id in Table tbl verify mx.
28
Table 4.7: tbl uri
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to an every email uploaded inthe system to which the header belongs.
source text Part or name of the header in the email where the IPaddress, link, URL, URI or domain name was found.
canonical name text Name or value of the IP address, link, URL, URI ordomain that was found in the email
whois id int4 Foreign key to whois id column of table tbl whois thatstores ’who is’ records for this uri/url/ipaddress
traceroute id int4 Foreign key to traceroute id column of tabletbl traceroute that stores ’traceroute’ records forthis uri/url/ipaddress
dig id int4 Foreign key to dig id column of table tbl dig that storesthe ’dig’ records for this uri/url/ipaddress
country id int4 Foreign key to country id column of tabletbl country that stores the ’country’ records forthis uri/url/ipaddress
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, source, canonical name.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,
. source REFERENCES source text in Table tbl source master,
. whois id REFERENCES whois id in Table tbl whois,
. traceroute id REFERENCES traceroute id in Table tbl traceroute,
. dig id REFERENCES dig id in Table tbl dig,
. country id REFERENCES country id in Table tbl country.
29
Table 4.8: tbl href
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned an email uploaded in thesystem to which the header belongs.
seq no int4 Unique sequence number of various href’s or linkspresent in an email.
source text Part or header name of the email where the href / linkwas found.
href text Text stored in the href part of the anchor tag found inthe email
display text Display part of anchor tag found in the raw email.Remains null it its a normal
website id int4 Foreign key to website id column of table tbl websitethat stores ’who is’ records for this UIR/URL/domainname/IP address
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: unmask id, seq no, source.
. Foreign Key: unmask id REFERENCES unmask id in Table tbl email,
. source REFERENCES source text in Table tbl source master,
. website id REFERENCES website id in Table tbl website.
30
Table 4.9: tbl website
Columnname
Datatype
Description
website id int4 Unique identifier assigned to a website found in an email.It can also be the website part of any href or normal linkfound in the email.
website name text Name of the website.port int4 Port associated with any href / link found in raw emailtime stamp timestamp Date-time when record is inserted in the table. Auto-
matically inserted by the system and has a default valueof current date-time.
Primary Key: website id.
. Foreign Key: None.
Table 4.10: tbl source master
Columnname
Datatype
Description
source text text Name of the source of an entity found in an email.source header text Name of the header in raw email where the source is
found e.g. limited-header, body etc.
Primary Key: source text.
. Foreign Key: None.
31
Table 4.11: tbl whois
Column name Datatype
Description
whois id serial Unique identifier (auto incremented field) assigned to awhois result.
canonical parameter text Canonical parameter on which whois is run. Unix tool’whois’ runs on the host name. Canonical parameteri inthis thesis is defined as the host part of any link, URL,URI etc. found in an email.
whois result text Result of whois run on a host name(canonical parametercolumn).
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: whois id.
. Foreign Key: None.
Primary Key: traceroute id.
. Foreign Key: None.
32
Table 4.12: tbl traceroute
Column name Datatype
Description
traceroute id serial Unique identifier (auto incremented field) assigned to atraceroute result.
canonical name text Entity (domain name, IP address etc.) on which tracer-oute is run.
traceroute result text Result of traceroute run on an entity (canonical namecolumn)
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
33
Table 4.13: tbl dig
Column name Datatype
Description
dig id serial Unique identifier (auto incremented field) assigned to adig result.
canonical name text Entity (domain name, IP address etc.) on which dig isrun.
dig result text Result of dig run on an entity (canonical name column)time stamp timestamp Date-time when record is inserted in the table. Auto-
matically inserted by the system and has a default valueof current date-time.
Primary Key: dig id.
. Foreign Key: None.
Table 4.14: tbl country
Column name Datatype
Description
country id serial Unique identifier (auto incremented field) assigned toresult of GeoIP tool[7] run on an entity(domain name,IP address).
canonical name text Entity (domain name, IP address etc.) on which GeoIPtool is run.
country result text Result of GeoIP tool run on an entity (canonical namecolumn)
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: country id.
. Foreign Key: None.
34
Table 4.15: tbl vrfy mx
Columnname
Datatype
Description
vrfy mx id serial Unique identifier (auto incremented field) assigned toeach record inserted in this table.
email local text Local part of the email address that needs to be verified.email domain text Domain part of the email address that needs to be
verified.verify result text Result of running Unix command ESMTP VRFY on
the local part of email address (email local) against thedomain part (email domain).
expn result text Result of running Unix command ESMTP EXPN runon the local part of email address (email local) againstthe domain part (email domain).
mail server server Name of the mail server of the domain part(email domain) of the email address.
mx records text MX records of the mail server of domain to which theemail address belongs.
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: vrfy mx id.
. Foreign Key: None.
35
Table 4.16: tbl concurrent db
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to each and every email up-loaded in the system.
dbcount int4 Number of socket connections opened from the databaseserver side for a particular email (unmask id).
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: None.
. Foreign Key: None.
Table 4.17: tbl concurrent unix
Columnname
Datatype
Description
unmask id int4 Unique identifier assigned to each and every email up-loaded in the system.
unixcount int4 Number of tools run by the Unix tools server for aparticular email (unmask id).
time stamp timestamp Date-time when record is inserted in the table. Auto-matically inserted by the system and has a default valueof current date-time.
Primary Key: None.
. Foreign Key: None.
36
4.2.3.2 User Defined Functions
The entire database inserts and retrievals in UnMASK take place using User Defined
Functions. There are no inline queries in the email parser code and the User Interface
code. Using User Defined Functions helps us taking care of Sql Injection. User Defined
Functions in PostgreSql have a version control as explained in Section 4.1. Almost all the
business logic of UnMASK is embedded inside User Defined Functions.
While designing the database for UnMASK, one of the design decisions taken was not
to have inline Sql queries in the User Interface code, Email Parser code and the database
code used in the Unix tools server. This was necessary to avoid Sql Injection arising from
the User Interface side. Also, we wanted to write all our business logic in the database
itself for various security reasons. Thus, we wanted to create Stored Procedures that would
have all the inline Sql queries used at various tiers of the UnMASK system and database
related business logic. However, after a thorough research, we figured out that there is no
concept of Stored Procedures in PostgreSql. Instead of that, PostgreSql has User Defined
Functions and Stored Procedures are wrapped around in User Defined Functions. To throw
some more light on this fact, let us analyze difference between a Stored Procedure and a
User Defined Function in other database management systems like Microsoft’s Sql Server
and Oracle. These differences can be enumerated as follows:-
1. Stored Procedures are parsed, compiled and stored in compiled format in the database.
We can also say that Stored Procedures are stored as pseudo code in the database i.e.
compiled form. On the other hand, User Defined Functions are parsed, and compiled
at runtime.
2. A User Defined Function must return a value where as a Stored Procedure doesn’t
need to (it definitely can, if required).
3. A User Defined Function can be used with any Sql statement. For example, we have a
function ’FuncSal(int)’ that returns the salary of a person. This function can be used
in a Sql statement as follows:-
37
. SELECT * FROM tbl sal WHERE salary = FuncSal(x)
Here internally, a call is made to User Defined Function ’FuncSal’ with any integer x,
as desired, and compared with the ’salary’ column of the database table tbl sal.
We can have Data Manipulation Language (DML) statements like insert, update and
delete in a function. However, we can’t call such a function (having insert, update,
delete) in a Sql query. For example, if we have a function (FuncUpdate(int)) that
updates a table, then we can’t call that function from a Sql query.
. SELECT FuncUpdate(field) FROM sometable; will throw error.
On the other hand, Stored Procedures can’t be called inside a Sql statement.
4. Operationally, when an error is encountered, the function stops, while an error is
ignored in a Stored Procedure and proceeds to the next statement in the code (provided
one has included error handling support).
5. User Defined Functions return values of the same type, Stored Procedures return
multiple type values.
6. Stored Procedures support deferred name resolution. To explain this, let’s say we
have a stored procedure in which we use named tables tbl x and tbl y but these tables
actually don’t exist in the database at the time of this stored procedure creation.
Creating such a stored procedure doesn’t throw any error. However, at runtime,
it would definitely throw an error that tables tbl x and tbl y don’t exist in the
database. On the other hand, User Defined Functions don’t support such deferred
name resolution.
In PostgreSql, as already mentioned, there is no provision for Stored Procedures. It
only provides User Defined Functions. However, these functions have mixed behavior i.e.
in some scenarios they behave as a Stored Procedure and in others, they would behave as
a normal User Defined Function. To explain this, let us analyze each difference mentioned
above one by one and see how a PostgreSql User Defined Function behaves in those scenarios.
38
1. In PostgreSql, whether a User Defined Function gets parsed, interpreted or compiled
at run time strictly depends upon the type of language interface used in the function.
If the language interface used is that of ’plpgsql’ or ’C’, then such type of User Defined
Functions are compiled only once when they are created. This compiled version is kept
in the database for any future calls to the same function instead of compiling it every
single when it is called. On the other hand, if language interface used is that of ’Perl’
or any other interpreted language then, the rules for that language are followed. In
case of Perl, the User Defined Function would be parsed and interpreted every single
time it is called and follows Perl language standards. In nutshell, what happens to a
User Defined Function at the time when it is called by an application strictly depends
upon the type of language interface it is using. It may act like Sql Server’s or Oracle’s
stored procedures or their User Defined Functions depending upon the type of language
interface it is using.
2. As far as return values are concerned, a User Defined Function in PostgreSql behaves
strictly like Microsoft Sql Server’s or Oracle’s User Defined Functions i.e. it must
return a value.
3. A User Defined Function in PostgreSql, like any other RDBMS, can be used with
any Sql statement. At the same time, we can also use functions that have Data
Manipulation Language (DML) statements like insert, update, delete. Such functions
that use DML statements can’t be used with Sql statements in other RDBMS like
Oracle, as already explained above.
4. In PostgreSql, a User Defined Function has a provision for Exception Handling. For
example, in function sp client socket() in UnMASK, the following code snippet can be
seen using exception handling :-
. BEGIN
. PERFORM INET(new.canonical name);
. EXCEPTION WHEN invalid text representation THEN
. ip address present=0;
. END;
39
Here we check whether ’canonical name’ column in a table is in the form of IP address
or plain text. Function INET, which is a PostgreSql internal function, throws an
exception if a non IP address argument is provided to it. We catch this exception and
give a value of 0 to variable ip address present. This is one common feature in Stored
Procedures in other RDBMS that we see in PostgreSql User Defined Functions. In other
RDBMS like Microsoft’s Sql Server or Oracle, exception handling is not supported in
User Defined Functions. There it is only supported in stored procedures.
5. In PostgreSql, a User Defined Function may return multiple values, but of the same
type. This is strictly in accordance with User Defined Functions in other RDDMS like
Sql Server and Oracle. A User Defined Function in PostgreSql can’t return multiple
values of different types, unlike stored procedures in Sql Server and Oracle that can
return multiple values of different types.
6. User Define Functions in PostgreSql support deferred name resolution like stored
procedures in Sql Server and Oracle. Sql Server and Oracle don’t support deferred
name resolution in their User Defined Functions, unlike PostgreSql.
Based on the difference between a Stored Procedure and a User Defined Function in other
RDBMS like Sql Server and Oracle and then analyzing the way User Defined Functions in
PostgreSql behave in all such scenarios, we can say that in PostgreSql, a Stored Procedure
is wrapped around in a User Defined Function.
Figure 4.3 shows an example of a User Defined Function called sp fetch tools result
used in UnMASK. This User Defined Function is used to fetch Unix tools result (dig,
traceroute, whois, ESMTP VRFY & ESMTP EXPN) for entities like email address, URL,
URI, link, domain name etc. in an email (unmask id) and tags it with all these entities
in tables tbl email address and tbl uri. Unix tools results are fetched from tables tbl dig,
tbl traceroute, tbl whois, tbl country, tbl verify mx. As seen in the figure, this User Defined
Function uses the concept of database cursors[11] to iterate in various records in tables tbl uri
and tbl email address.
40
4.2.3.3 Triggers
Triggers in PostgreSql behave exactly the same way as in other RDBMS like Sql Server and
Oracle, although there is a slight difference between the trigger constructs. In Sql Server
and Oracle, the set of activities that a trigger has to perform is a part of the trigger body.
However, PostgreSql has special type of User Defined Functions called Trigger Functions,
with a return type ’trigger’. That’s how a Trigger Functions is distinguished from a regular
User Defined Function in PostgreSql. A trigger in PostgreSql always has an associated
Trigger Function that has a set of activities a trigger needs to perform. Trigger body in
PostgreSql looks like as follows:-
. CREATE TRIGGER trg email address
. AFTER INSERT
. ON tbl email address
. FOR EACH ROW
. EXECUTE PROCEDURE func client socket (’email address’)
As we can see in the code above, a trigger body contains the name of the trigger,
the type of trigger, name of the database table trigger is associated with and the name
of Procedure (Trigger Function) it needs to execute when the trigger is invoked. Here,
trigger trg email address is an ’After Insert2’ trigger associated with table tbl email address
and executes Trigger Function func client socket right after a row is inserted in table
tbl email address. A Trigger Function can internally call other User Defined Functions,
if required. Figure 4.4 shows an example of a Trigger Function func client socket used
in UnMASK. As seen in the figure, the Trigger Function doesn’t accept any parameters
in its signature even though we passed a text parameter ’email address’ (func client socket
(’email address’)) while calling it from the trigger body. A Trigger Functions also differs
from a normal User Defined Function in parameter passing. In PostgreSql, the number
of parameters one can pass to a Trigger Function is unlimited and dynamic. If passed,
these parameters become an element of a special array TG ARGV[ ], defined internally
in PostgreSql libraries. This array gets dynamically created and populated based on the
2After Insert triggers are invoked after every successful insertion of a row in the database table.
42
number of parameters passed to the Trigger Function. If no parameter is passed to a Trigger
Function then, the TG ARGV[ ] array associated with that Trigger Function is undefined.
In Figure 4.4, we can see that the parameter ’email address’ passed to the Trigger Function
is accessed using TG ARGV[0], i.e. the first and the only element of array TG ARGV[ ].
Note that Figure 4.4 is a code snippet taken out from Trigger Function sp client socket
used in UnMASK. It doesn’t show the complete implementation of Trigger Function. The
purpose of including this figure is just to show how Trigger Functions work in PostgreSql.
In UnMASK, all the triggers created are ’After Insert’ triggers used to invoke desired
action(s) after a record is inserted in tables tbl email, tbl email address and tbl uri. Table
4.18 gives a complete list of triggers used in UnMASK.
43
Table 4.18: UnMASK Triggers
Trigger name Associated ta-ble
Functionality
trg email tbl email This trigger is invoked after a record is inserted intable tbl email and calls User Defined Function sp emailwhich parses the raw email and deconstructs it downinto its constituent headers.
trg email address tbl email address This trigger is invoked after a record is inserted intable tbl email address and calls User Defined Functionsp client socket. This User Defined Function opens asocket connection between the database server and theUnix tools server to verify the the validity of an emailaddress.
trg uri whois tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the Unix tools server torun whois on the newly inserted canonical name columnin the table.
trg uri traceroute tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the Unix tools serverto run traceroute on the newly inserted canonical namecolumn in the table.
trg uri dig tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the unix tools server torun dig on the newly inserted canonical name column inthe table.
trg uri country tbl uri This trigger is invoked after a record is inserted in tabletbl uri and calls User Defined Function sp client socket.This User Defined Function opens a socket connectionbetween the database server and the unix tools serverto run GeoIP tool on the newly inserted canonical namecolumn in the table.
45
CHAPTER 5
Automation Using a Database
In this chapter, we discuss how automation of email analysis is achieved using a database
in context with UnMASK. Before giving a deep insight of such automation, we explain the
UUTC protocol that we developed in this project for establishing a connection between the
database server and the Unix tools server.
5.1 UnMASK Unix Tools Connection Protocol(UUTC)
The connection between the database server and the Unix tools server is through a new
protocol that we designed and implemented, called the UnMASK Unix Tools Connection
(UUTC) protocol. This protocol opens a socket connection, when needed, to a daemon
process (the Unix tools server) and allows parameters needed for invoking specific tools to
be sent across the connection, and permits return information to be properly put back into
the database. This protocol is a novel idea of opening connections from the database server
to an external daemon process.
5.2 Automating Email Analysis
Having discussed project UnMASK in detail and the Proof of Concept for accomplishing
the automation of email analysis in Chapter 3, design of the database server in Chapter 4,
Email Parser in Section 3.4 and the Unix tools server in Section 3.5, we now discuss the
automation process in detail.
46
Figure 5.1: Workflow: Automating email analysis
Figure 5.1 shows the work flow of the entire automation process. As soon as a raw
email is inserted in PostgreSql database, it gets stored in table tbl email as a text field.
Table tbl email has ’After Insert’ trigger trg email associated to it. As soon as a record
gets inserted in table tbl email, trigger trg email is invoked. This trigger internally calls
a User Defined Function sp email parser. As seen in Figure 5.2, User Defined Function
sp email parser is internally divided into 3 sequential transactions.
47
Figure 5.2: Implementation of User Defined Function sp email parser
Transaction 1 is a PostgreSql function written in Perl. The use of transactions 2 and 3
is explained later in Chapters 7 and 8. Coming back to Transaction 1, this User Defined
Function is the implementation of an email parser and iteratively parses a raw email into
finer and finer granularity. The part of this parser code which is work of this thesis is the
database transactions taking place inside.
In the parser, wherever an email address is found in raw email, it gets stored as a separate
record in table tbl email address in the database. Similarly, wherever an IP address, link,
domain name, URI or a URL is found in raw email, it gets stored in table tbl uri. Tables
tbl email address and tbl uri both have ’After Insert’ triggers associated with them, as shown
in Table 4.18. As soon as a record is inserted in these two tables, a User Defined Function
sp client socket is called. User Defined Function sp client socket is a PostgreSql function
48
written in Perl. This User Defined Function is an implementation of a client socket. The
socket is used to establish a communication channel between the database server and the
Unix tools server. Various parameters passed through this socket depends on which trigger
established the socket connection.
If the socket connection is established because of the action of trigger trg email address on
table tbl email address, then the ’local’ and the ’domain’ part of an email address are passed
to the Unix tools server through the socket. This methodology of data transfer through a
socket connection between the database server and the Unix tools server is what we call
the UUTC protocol defined in Section 5.1. Unix tools server, as explained in Section 3.5
is a daemon that runs programs (tools) invoked by the database server. Socket connection
established by trigger trg email address initiates a tool called ’Tool1’ in Unix tools server as
described in Table 3.1. Tool1 is implemented as follows: It runs ’dig’ on the domain part
of the email address to find out the mail server to which this domain belongs. If a valid
mail server is returned then, Unix commands ESMTP VRFY and ESMTP EXPN 1 are run
on that local and domain part of the email address. After Tool1 has completed its task, an
ODBC connection (Part of UUTC protocol) is opened between the Unix tools server and the
database server. Using this ODBC connection, results of Tool1 i.e. the mail server records
and result of ESMTP VRFY and EXPN are stored in table tbl verify mx.
Similarly, if the socket connection is established because of the action of triggers
trg uri whois, trg uri traceroute, trg uri dig, trg uri country on table tbl uri as shown in
Table 4.18 then, the canonical name field in table tbl uri is sent across to the Unix tools
server through the socket. All of these four triggers run independently of each other, even
though they are associated with the same database table, open a separate socket connection
with the Unix tools server and an appropriate tool in the Unix tools server is initiated. Whois,
dig, traceroute and country results are saved back in tables tbl whois, tbl dig, tbl traceroute,
tbl country by opening an ODBC connection between the Unix tools server and the database
server as explained above.
1ESMTP stands for ’Extended Simple Mail Transfer Protocol’. Only those mail servers that supportESMTP can run VRFY and EXPN commands. VRFY is used to verify the validity of a user in a mailserver. EXPN returns the whole mailing list in that mail server of which the verified user is a part.
49
One important point to note here is that the data from the Unix tools server back to the
database server is not passed using the same socket connection. The reason behind this is
that the Unix tools server closes this socket connection right after it receives the required
parameters from the database server. This is done in order to make the database server
process not wait for the Unix tools server process, which might take long at times to return
results, and return right after establishing a socket connection with the Unix tools server and
passing required parameters to it using this socket. The reason behind closing this socket
connection in such a manner is explained in detail in Chapter 6.
50
CHAPTER 6
Performance Improvement
Unix tools like dig, traceroute, whois etc. connect to the internet and return results. Results
returned are stored back in the database. Sometimes return time of these tools is very high
depending upon the network speed. In UnMASK, for one single email, connection to the
Unix tools server is made several times depending upon the number of email address, IP
address, domain names etc. found in the email. Each of these entities found in the email
means an insert in table tbl email address or tbl uri, opening a separate socket connection
with the Unix tools server.
We use PostgreSql’s ”Server Programming Interface”[12] technique to make all the
database inserts inside email parser code part of one single atomic transaction. We make
this transaction atomic to avoid any inconsistency in the database. This ensures that either
all or none of the database inserts in the parser code will commit. Let’s say there are 20
database inserts in the parser code for 20 different email addresses, IP address, domain names
etc. found the the email. All these inserts being a part of one single atomic transaction
are sequential. This means that nth insert statement in the parser code on tables table
tbl email address or tbl uri would start only when (n-1)th database insert, along with all
the actions initiated by it, is successful. Both the tables tbl email address and tbl uri have
triggers associated with them. These triggers internally implement UUTC protocol and open
a socket connection with the Unix tools server. Now if the client socket at the database server,
started by (n-1)th database insert inside the parser code, waits for the Unix tools server to
respond, which might take long to return results depending upon the network speed, then all
the database inserts in the parser code right after that would hang, waiting for the Unix tools
server to return results for (n-1)th insert. This delay from the Unix tools server would keep
51
Figure 6.1: Communication mechanism between the database server and Unix tools server.
accruing from every database insert inside the parser code, as each database insert opens a
new socket connection with the Unix tools server, making the over all system terribly slow.
This can be visualized in Figure 6.1.
To overcome this performance issue, we from the Unix tools server side, close the socket
connection between the database server and the Unix tools server right after passing the
52
required parameters to the Unix tools server through this socket. As a result, we disconnect
the Unix tools server from the database transaction and hence the latency of the Unix tools
server won’t affect the performance of the database process. All the database inserts can
run independent of the Unix tools server. Unix tools server after fetching results from the
internet for a particular tool opens a separate ODBC connection with the database server
and stores the results in result tables like tbl verify mx, tbl traceroute etc. So we have a
gain in the over all performance of the system by closing the socket connection between
the database server and the Unix tools server right after passing the required parameters
from the database server. This method of performance improvement by closing the socket
connection in a manner explained above opens gates for dealing with the state maintenance
problem between the database server and the Unix tools server, explained in next section.
To further improve the performance of the system, we do the following at the database
server end:-
• In one single email, none of the Unix tools is run more than once for a particular email
address, IP address, URI, domain name etc. even though the same entity has several
occurrences in that email. That means for the same entity in an email, the socket
connection with the Unix tools server would not be opened more than once.
• Across different emails uploaded in the system, if a Tool, except Tool1 (ESMTP VRFY
and EXPN), was already run in past 10 days for a particular entity then, that tool is
not run again for the same entity within a period of 10 days. Results from the old run
of that tool(s) on that entity would be tagged to it programmatically. An exception
to this rule is if the tool’s last result in past 10 days for that entity was null. In that
case the tool is run again.
The rationale behind adopting the above methodology (10 days rule) to improve the
performance of the system is that result from running tools like whois, dig, traceroute,
GeoIP is not expected to change drastically in a period of 10 days. So we save on
resources and time by running these tools just once in a period of 10 days. We
run Unix commands ESMTP VRFY and ESMTP EXPN every single time because
53
email addresses are ephemeral and we can’t expect the same results from these Unix
commands, for the same email address, even after one single day.
54
CHAPTER 7
State Maintenance
As discussed in the previous section, closing the socket connection right after passing the
required parameters to the Unix tools server through this socket opens gates for the state
maintenance problem between the database server and the Unix tools server. As soon as the
socket connection is closed by the Unix tools server, database server has no way of keeping
track of the success or the failure of tool(s) run at the Unix tools server side. This means
that the database sever becomes stateless with regards the Unix tools server.
In this section, we essentially discuss why database server needs to maintain state of the
Unix tools server and how we maintain such a state.
7.1 Why State Maintenance
As already discussed, the aim of this automation process is to generate reports that law
enforcement agencies would use to analyze an email for further investigations. Since the
database process and the Unix tools server process get disconnected by closing the socket
connection right after the required parameters are passed from the database server side to
the Unix tools server side, the email parser code keeps executing all its database inserts and
returns an OK signal to the User Interface right after all the database inserts are successfully
completed. At this point of time, even though the database server completed all its activities,
it is not sure whether the Unix tools server serviced all the requests sent by it. The User
Interface on receiving an OK signal from the database server can immediately show a screen
having links for the generation of different kinds of reports.(See Figure 9.2 for a report
segment). In this screen, the user can click on various links to see reports on various tool
55
results. However, if the Unix tools server is still not done with its job(s) on the email for
which the user wants to see reports, this page won’t show any results as the result data
has not yet been completely populated by the Unix tools server in result tables. This would
result in an ambiguity in the system. To avoid this ambiguity, there is a need for the database
server to know the state of the Unix tools server. The database server should send an OK
signal to the User Interface only when Unix tools server has serviced all the requests sent
from the database server side for a particular email. So we had to design a methodology to
deal with this state maintenance problem, explained in next section.
7.2 Accomplishing State Maintenance
To solve the state maintenance problem, we refer to Transaction 2 in Figure 5.2. We created
two database tables, table tbl concurrent db 4.16 having columns unmask id, dbcount and
table tbl concurrent unix 4.17 having columns unmask id, unixcount. Column dbcount
in table tbl concurrent db stores the number of database inserts done in the email parser
code at the datbase server side for a paticular unmask id (email) and column unixcount in
table tbl concurrent unix stores the number of requests Unix tools server serviced for that
unmask id (email). For a particular unmask id (email), value stored in column dbcount in
table tbl concurrent db is inncremented right after any insert statement inside the parser
code and the value of column unixcount in table tbl concurrent unix is incremented by the
Unix tools server right after it finishes running a tool. Fundamentally, for a particular
email, if the values of columns dbcount and unixcount in tables tbl concurrent db and
tbl concurrent unix respectively are equal then, that means the Unix tools server has serviced
all the requests sent by the database server for a particular unmask id (email). Also, the
Unix tools server would still increment the ’unixcount’ value even if the tool fails and doesn’t
return any results.
So, before updating tables tbl email address and tbl uri and sending an OK signal to the
User Interface, the database server keeps checking repeatedly in a loop whether the values
of ’dbcount’ and ’unixcount’ are equal for that email. This loop can be visualized as follows:
START LOOP
56
. IF dbcount = unixcount THEN
. BREAK AND SEND OK TO UI
. ELSE
. LOOP AGAIN
. END IF
. END LOOP
As seen in the code snippet above, the loop keeps executing infinitely until the dbcount
and the unixcount values in tables tbl concurrent db and tbl concurrent unix are equal for
a particular email(unmask id). However, after some research and testing, we figured out the
following flaws in this method:-
• This loop would keep executing infinitely for ever if the Unix tools server crashed by
any chance. In that scenario, the dbcount and unixcount values would never be equal
and hence the whole system would hang.
• This loop is a ’Busy Wait’ and keeps executing infinitely consuming alot of CPU
resources.
To solve these problems, we modified our loop as follows:
START LOOP
. IF dbcount = unixcount THEN
. BREAK AND SEND OK TO UI
. ELSE
. IF Unix Tools Server Still Alive THEN
. SLEEP 5 SECONDS
. LOOP AGAIN
. ELSE
. BREAK AND SEND ERROR CODE TO UI
. END IF
. END IF
57
. END LOOP
As seen in the modified code above, we added two extra conditions. These are enumerated
as follows:-
1. The first one i.e. ”IF Unix Tools Server Still Alive THEN” checks whether the Unix
tool server is still alive before re-iterating the loop. This check is done by pinging the
Unix tool server iteratively. If at all the Unix tools server crashed due any reasons
whatsoever, the loop ends and an appropriate error code is sent to the User Interface.
2. The second condition i.e. ”SLEEP 5 SECONDS” stops the loop for 5 seconds to
avoid busy waits and save CPU resources from infinite looping. There is a function
called pg sleep in PostgreSql. However, this function is not available in the version of
PostgreSql that we are using. pg sleep is available in PostgreSql 8.2 and above and
we use PostgreSql 8.1 in UnMASK. For a work around, to implement SLEEP, we call
a User Defined Function in Perl that runs SLEEP command in Perl for 5 seconds. In
future, when we upgrade our database to a latest version of PostgreSql, we will remove
the call to a Perl User Defined Function and use pg sleep.
58
CHAPTER 8
Tagging Unix Tools Result with an Email
Referring to Figure 5.2, Transaction 3 is used to tag the results of tools running in the Unix
tools server to a particular email (unmask id). These results are stored in result tables like
tbl verify mx, tbl dig, tbl traceroute, tbl whos and tbl country in the database. Before we
go ahead and explain this tagging process, we first need to understand why this tagging is
not done at the Unix tools server side when it stores the tools result in the database. Also,
as explained in Section 5.2, database server and the Unix tools server are disconnected and
hence, there is no way Unix tools server can maintain a state of the database server. Now,
the time when Unix tools server opens an ODBC connection with the database server to
store the tools result in results table, it has the unmask id and name of the entity of the email
for which a tool was run. Using the same ODBC connection, Unix tools server could upadte
table tbl emai address with column vrfy mx id and table tbl uri with columns traceroute id,
dig id, whois id and country id. However, since the all the inserts in Transaction 1 (see
Figure 5.2) are part of the main transaction that started with an insert in table tbl email,
all these inserts are not visible outside this main transaction. For example, Unix tools
server runs traceroute for unmask id 967 and canonical name yahoo.com. Then it opens an
ODBC connection with the database to update table tbl uri with column traceroute id. At
this point of time, there is a strong possibility that the main database transaction that we
just talked about, has not completed yet. This means that record for unmask id 987 and
canonical name yahoo.com in table tbl uri is still not visible to the Unix tools server. So the
Unix tools server may not be able to tag the traceroute result for yahoo.com for unmask id
987 in table tbl uri. This holds true for other tools as well. Thus, this tagging is not done at
the Unix tools server side. Instead of that, we have Transaction 3 (see figure 5.2) dedicated
for this tagging job.
59
We also need to understand the way results from the Unix tools server are stored in the
results table like tbl vrfy mx, tbl traceroute, tbl dig, tbl whois and tbl country. In these
tables, there is no unmask id column. So, we can’t associate any tools result with an email
directly from these tables. Also, records in these tables are not duplicated. For example,
if we already have a ’whois’ record for yahoo.com in tbl whois and whois was run again on
yahoo.com. If the latest whois result is same as the one already stored in the table, then it
is not stored. If it is different, then a new record is inserted with a new whois id and latest
whois result. Same holds true for other result tables. This way, we can save a lot of database
storage space by unnecessarily storing redundant information.
Now that we understand why the process of tagging a tool result with an email is not
done at the Unix tools server side, we explain how this tagging is done at the database
side. Referring to Figure 5.2 again, transaction 3 has unmask id and name of the entity on
which the tool was run, and using this information it can update tables tbl email address
and tbl uri. For a particular entity (columns email local and email domain in table
tbl email address and column canonical name in table tbl uri), this transaction picks the
latest ID’s of tool results from various result tables and updates tables tbl email address
and tbl uri with these ID’s. Column vrfy mx id is updated for table tbl email address and
and columns traceroute id, dig id, whois id and country id are updated for table tbl uri.
Therefore, at the end of Transaction 3, tools result are tagged with entities in an email
in tables tbl email address and tbl uri and can be used for report generation at the User
Interface level.
60
CHAPTER 9
Case Study
The user interface for UnMask supports a case management system for uploading of email
files for analysis as well as generation of reports based on information stored in the database.
UnMask uses a password-based user access control. In order to submit an email, the user
first logs into the system. The user can submit an email file (in eml format) by browsing for it
locally and uploading it as part of a new or an existing case. After the email is deconstructed
and processed as discussed in Section 4, the user is able to view the generated reports. Figure
9.1 illustrates the case management screen of the Unmask user interface. The three cases
that user liu is investigating are listed, and further information on each can be accessed by
clicking on the case name. The implementation of the user interface is through an interactive
web-based infrastructure rendered using dynamic web pages written in Sun Microsystems’
Java Server Pages (JSP) technology. The application logic in a JSP page uses the Java
Database Connectivity (JDBC) API to create dynamically-generated HTML output from
the contents of the database by providing a call-level API for SQL-based database access.
Also within a JSP page we have HTML code that displays static text and graphics. When
the page is displayed in a user’s browser, it contains both static HTML content and dynamic
information retrieved from the database about his or her specific case.
Reports are designed to support law-enforcement in analyzing email components. For
example, the sender email address in the raw email being analyzed may have been forged,
or a URL in the rendered email may be redirecting the recipient to a website different from
what is commonly inferred from its name. As part of requirements analysis, a brief survey
was done to ascertain what investigators would ideally like to see in a report. Some of these
desired features were determined to be: For each email address found in the phishing email,
determine the MX record for its domain, and also the results of executing ESMTP EXPN
61
Figure 9.1: UnMASK User Interface
and ESMTP VRFY on the mail server. It must be clearly mentioned as to what field (i.e.,
”From”, ”Cc”, ”Bcc”, etc.) the particular email address was found in. Determine the IP
address of the originating machine, and run the network utilities traceroute, dig, and whois
on this address. For each IP address/URL specified anywhere in the body of the raw email,
again run the aforementioned network utilities. The reports that UnMask generates include
all the above information organized in a structured fashion discussed in the next subsection.
See Figure 9.2 for a portion of a report illustrating the Registrant analysis of a domain
name found in the email body.
The report follows the structure of an email message. Starting with email header
information the report shows the specific header fields isolated for clarity and coupled
62
Figure 9.2: Segment of a Report
with information gathered by the UNIX tools. This additional information expands the
investigator’s understanding of that field. For example the trace fields ”Received:” would
appear with an analysis of the sending and receiving mail hosts (IP address, domain name,
traceroute result, DNS and whois records, etc). As mail hosts (represented by name or IP
address), email addresses, website links, and other items appear in the report at different
sections of the email, so does the information gathered about these items. This provides the
investigator with as much information as possible and aids in the decision making process
on what forensic leads to follow further.
To have a better understanding of the UnMask report structure and how it may be
used by law enforcement, we present an example section of a report that provides detailed
63
forensic information on the ”Received:” fields in an email header. Each email message carries
in its header a set of ”Received:” fields (the set can be empty), which collectively describe
the routes that the message takes from the sender to the recipient of the message at the
mail relay server level. It is important to note that, in order to mislead the recipient (or
investigator) of an email message about where the message originates, it is common for the
(spam and phishing) message sender to forge the first few ”Received:” fields. However, the
set of ”Received:” fields in a message still contains the true path that the message takes as
part of the route shown in the set of ”Received:” fields, so it provides valuable investigation
information for law enforcement.
As discussed above, an UnMask report follows the structure of an email message. For
each field, we provide additional forensic information gathered by the Unix Tools system. In
particular, for each ”Received:” field, we first extract the domain names and IP addresses
of the mail relay servers appearing in the field. To aid the law enforcement investigation, we
then determine the location and contact information of the organization (or person) that is
responsible for a domain name (or IP address), the route to the mail relay server, and the IP
address of a domain name (and vice versa), along with other information that we collect, by
launching the corresponding Unix tools. Discrepancies discovered during the analysis of the
”Received:” fields are also reported and highlighted. The following snippet is an example
”Received:” field from an email message that we received, which contains two domain names
(walking14.legessermon.com, mx.google.com) and one IP address (64.192.31.14). For each
of them, we collect proper forensic information by launching the corresponding tools. The
information shown was collected within one day after we received the message.
. Received: from walking14.legessermon.com (walking14.legessermon.com [64.192.31.14])
. by mx.google.com with ESMTP id e18si15752160qbe.2007.05.30.10.46.13;
. Wed, 30 May 2007 10:46:24 -0700 (PDT)
Figure 9.3 shows a snapshot of the report section related to the domain name walk-
ing14.legessermon.com found in the example ”Received:” field (the snapshot only captured
a segment of it). In this section of the report, we first determine location and contact
information of the organization that is responsible for the domain name (partially shown
in the figure), the MX and DNS records for the corresponding domain, the route to the
64
domain name, and the IP address of the domain name, among other things. We noted that
the corresponding IP address returned from our tool is 64.192.31.2, which is different from
the one listed in the ”Received:” field for the aforementioned domain name. No strong
conclusion can be drawn from this discrepancy (notice that the two IP addresses are on the
same subnet); however, we reported the fact as it is.
Similarly, we collected similar information about the IP address 64.192.31.14 and the
domain name mx.google.com. We do not discuss the results related to them here due to space
limitation. But we note that, the location and contact information returned from probing
the IP addresses tend to be more long-lived and reliable than the ones from probing the
domain names. Domain names (especially for phishing sites) and their associated registration
information tend to be short-lived. However, IP address allocation is normally delegated
to ISPs nowadays and is quite stable. Three days later we re-ran the tools to generate
another report on the message, containing the domain name walking14.legessermon.com and
its associated registration/contact information. The resulting information turned out to be
the same; however if it would have been different then further investigation may have been
warranted.
65
CHAPTER 10
Conclusion
The thesis work provided a method to automate the analysis of an email using a database.
Using database triggers and Perl language interface in Postgresql, emails can be parsed on the
fly and socket connection can be opened to a daemon (Unix tools server) running separately
on another server machine. We described our UUTC protocol that we developed for this
automation process. Using this protocol, we established a communication channel between
the database server and the Unix tools server. Unix tools server upon accepting required
parameters from the database server through the socket connection runs various tools like
traceroute, dig, whois etc. and stores the results back in the database. Various important
reports are generated in which the tools result are seen. These reports are used to analyse
an email for further investigations.
However, there is a scope of improvement. Some of them are explained in the next
section.
10.1 Future Work
In future, we can think of incorporating the follwoing in the automation process we explained
in this thesis work to make it more efficient :-
• Since Postgresql is an open source database, we can implement UUTC protocol in
the source code of Postgresql. So, instead of PostgreSql engine interpreting the client
socket code every single time, we can have ready to use libraries in the database itself.
Thus, instead of doing the same kind of expensive interpretation several timea for
one single email, we can embed the UUTC code in the PostgreSql source code, have
pre-compiled libraries for UUTC and hence, further improving the performance of this
automation process.
67
• Since the analysis of raw email is done manually by studying the final reports generated,
it makes the analysis process slow to finally come to a conclusion whether email
is a phishing email or not. However, if we incorporate the concept of uncertain
databases[13] [14], it would allow the system itself to judge whether an email is a
phishing email or not, giving a human analyst a lot of help during the analysis of this
email.
68
REFERENCES
[1] Sam Spade. http://www.pcworld.com/downloads/file/fid,4709-order,1-page,1-c,spamblockers/description.html. 1, 2
[2] Domain Tools. http://www.domaintools.com. 1, 2
[3] Sudhir Aggarwal, Jasbinder Bali, Zhenhai Duan, Leo Kermes, Wayne Liu, ShahankSahai, and Zhenghui Zhu. The design and development of an undercover multipur-pose anti-spoofing kit (unmask). Annual Computer Security Applications Conference(ACSAC), December 2007. 1, 3.4, 3.5, 1
[4] Auto Admin. http://research.microsoft.com/dmx/autoadmin/. 1
[5] Phisherman SPARTA Inc. http://www.isso.sparta.com/documents/phisherman.pdf. 3
[6] Postgresql. http://www.postgresql.org. 3.2
[7] IPGEO Tools. http://www.ipgeo.com. 3.1, 4.14
[8] CPAN. http://www.cpan.org. 1
[9] SQL Injection. http://en.wikipedia.org/wiki/sql injection. 8
[10] RFC 2822. http://www.ietf.org/rfc/rfc2822.txt. 3
[11] Database Cursors in PostgreSql. http://www.postgresql.org/docs/8.2/static/sql-declare.html. 4.2.3.2
[12] Postgresql: Server Programming Interface. http://www.postgresql.org/docs/8.2/static/spi.html.6
[13] Laks V.S. Lakshmanan and Nematollaah Shiri. A parametric approach to deductivedatabases with uncertainty. IEEE Transactions on Knowledge and Data Engineering,13(4):554–570, July/August 2001. 10.1
[14] S. McClean, B. Scotney, and M. Shapcott. Aggregation of imprecise and uncertaininformation in databases. IEEE Transactions on Knowledge and Data Engineering,13(6):902–912, November/December 2001. 10.1
69
BIOGRAPHICAL SKETCH
Jasbinder S. Bali
Jasbinder S. Bali was born in Kashmir, India in November 1980. He did his B.Tech in
Electrical Engineering from National Institute of Technology, Jamshedpur (India) graduating
in July, 2002. He came to USA in August 2005 to pursure Masters in Computer Science at
The Florida State University.
Before pursuing Masters in Computer Science, Jasbinder worked in the Information
Technology industry for about 3 years with companies like Computer Sciences Corporation,
New Delhi (India) and Patni Computer Systems, Bombay (India) as a Software Engineer
from October 2002 to July 2005, working on various software design and development
projects. Currently, he is working with Dr. Sudhir Aggarwal at FSU’s Electronic Crime
Investigative Technologies Laboratory on a phishing email project called UnMASK. He is
also working towards the completion of his Masters in Fall 2007.
Jasbinder is the current president of Florida State University’s Cricket Club and captain-
ing the team for the past two seasons.
70