Visual fingerprinting for malicious websites

BY IBRAHIM MOSAAD

SUPERVISED BY OSAMA KAMAL

VISUAL FINGERPRINTING FOR MALICIOUS DOMAINS

OUTLINE• Introduction

• Statistics of malicious Domains/URLs

• Goal

• How

• Conceptually

• Theoretically

• Practically

• Testing And Results

• Challenges

• Future Works

INTRODUCTION

• Statistics

• In 2014, Kaspersky Lab’s web antivirus detected 123,054,503 unique malicious objects: scripts, exploits, executable files, etc

INTRODUCTION

• Exploit kits

• How Common Are Exploit Kits?• 6000 infections/0.2 hour

• 2B visitors/month

• 2/3rd of all malwares delivered by exploit kits

GOAL

“Create an automated system to d iff erenti ate between benign and mal ic ious websi tes”

HOW - CONCEPTUALLY

• How do malicious websites behave?• Lack of a good training set

• How do benign websites behave? • Testing top 250 websites from different categories in Alexa

• Scoring system

HOW – THEORETICALLY

• Browsing websites using real/emulated system

• Store/Visualize The collected data

• Score it

HOW - PRACTICALLY

• Browsing websites using honeyclients• Low-interaction

• Thug

• HoneySpider Network 2.0

• High-interaction

• Capture-HPC

• HoneyClient

HOW - PRACTICALLY

• HSN• Modular Framework – Extendable

• Wappalyzer module (Developed)

• Peepdf Module (Developed)

• Cuckoo sandbox module (Updated)

• Yara module (Updated)

HOW - PRACTICALLY• Storing collected data

• Graph database neo4j

• GraphDB driver to HSN using Py2neo

• Scoring System• Mix of First and Second Degree functions

FIRST RUN - TRAINING

• Number of websites: 1500

MOZILLA.ORG

AVG.COM

ORACLE.COM

APPLE.COM

FIRST RUN

• Feature Extraction • Number levels

• Number resources

• Number redirections

• Number Iframes

• Website Topology

BABYLON.COM

SECOND RUN – REAL CASE

• Top domains looked malicious• http://dictionary.reverso.net

• http://n4hr.com

• http://s02.arab.sh

• http://dc11.arabsh.com

http://dictionary.reverso.net/

http://dictionary.reverso.net/

http://n4hr.com/

http://n4hr.com/

http://s02.arab.sh/

http://s02.arab.sh/

http://dc11.arabsh.com/

CHALLENGES

• HSN• Lack of good documentation

• Last version was released in 2013

• Code written in 3 languages C/Python/Java

• Lack of community support

CHALLENGES

CHALLENGES

• Graph Database (py2neo)• Insertion

• Library is still immature

• REST-API can’t handle it

• 7000 URL * 30 * 2 = 420000 ~ 0.5M Nodes

• Store the queries in one request?!

• Huge POST request

• Querying

• 7000 URL => 7000*20 = 140K

FUTURE WORKS

• HSN• Enhance the web-client module

• Enhance SWF emulation module

• Scoring System• Machine learning

• Graph Database• Adopt Giraph database rather neo4j

• Monitoring governmental websites

BIGGER PICTURE

Questions?

Internet

Visual fingerprinting for malicious websites