20
Harvesting SSL Certific ate Data to Identify We b-Fraud Reporter : 鄭鄭鄭 Advisor : Hsing-Kuo Pao 2010/10/04 1

Harvesting SSL Certificate Data to Identify Web-Fraud

  • Upload
    said

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Harvesting SSL Certificate Data to Identify Web-Fraud. Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/10/04. 1. Conference. 2. - PowerPoint PPT Presentation

Citation preview

Page 1: Harvesting SSL Certificate Data to Identify Web-Fraud

Harvesting SSL Certificate Data to Identify Web-Fraud

Reporter : 鄭志欣Advisor : Hsing-Kuo Pao

2010/10/04

1

Page 2: Harvesting SSL Certificate Data to Identify Web-Fraud

Mishari Al Mishari, Emiliano De Cristofaro, Karim El Defrawy, and Gene Tsudik. "Harvesting SSL Certificate Data to Identify Web-Fraud." ,Submitted to ICDCS’10,http://arxiv.org/abs/0909.3688

2

Conference

Page 3: Harvesting SSL Certificate Data to Identify Web-Fraud

Introduction X.509 certificates Measurements and Analysis of SSL Certificates Certificate-Based Classifier Conclusion

3

Outline

Page 4: Harvesting SSL Certificate Data to Identify Web-Fraud

Web-fraud is one of the most unpleasant features of today’s Internet. Phishing , Typosquatting

Can we use the information in the SSL certificates to identify web-fraud activities such as phishing and typosquatting , without compromising user privacy?

This paper presents a novel technique to detect web-fraud domains that utilize HTTPS.

4

Introduction

Page 5: Harvesting SSL Certificate Data to Identify Web-Fraud

5

Typosqatting

Page 6: Harvesting SSL Certificate Data to Identify Web-Fraud

The classifier achieves a detection accuracy over 80% and, in some cases, as high as 95%.

Our classifier is orthogonal to prior mitigation techniques and can be integrated with other methods.

Note that the classifier only relies on data in the SSL certificate and not any other private user information.

6

Contributions

Page 7: Harvesting SSL Certificate Data to Identify Web-Fraud

7

X.509 certificates

Page 8: Harvesting SSL Certificate Data to Identify Web-Fraud

A. HTTPS Usage and Certificate Harvest Legitimate Phishing and Typosquatting

B. Certificate Analysis Analysis of Certificate Boolean Features Analysis of Certificate Non-Boolean Features

8

Measurements and Analysis of SSL Certificates

Page 9: Harvesting SSL Certificate Data to Identify Web-Fraud

9

A. HTTPS Usage and Certificate Harvest

Page 10: Harvesting SSL Certificate Data to Identify Web-Fraud

Legitimate and Popular Domain Data Sets. Alexa: 100, 000 most popular domains according t

o Alexa. .com: 100, 000 random samples of .com domain zo

ne file, collected from VeriSign. .net: 100, 000 random samples of .net

domain zone file, collected from VeriSign. We find that 34% of Alexa domains use HTTPS; 2

1% in .com and 16% in .net. (Commercial)

10

A. HTTPS Usage and Certificate Harvest

Page 11: Harvesting SSL Certificate Data to Identify Web-Fraud

Phishing Data Set We collected 2, 811 domains considered to be hosting p

hishing scams from the PhishTank web site. 30% of these phishing web sites employ HTTPS.

Typosquatting Data Set we first identified the typo domains in our .com and .ne

t data sets by using Google’s typo correction service. We discovered that 9, 830 out of 38, 617 are parked dom

ains.

11

A. HTTPS Usage and Certificate Harvest

Page 12: Harvesting SSL Certificate Data to Identify Web-Fraud

Feature Name Type Used in Classifier Notes

F1 md5 boolean Yes The Signature Algorithm of the certificate is "md5WithRSAWncryoption"

F2 bogus subject boolean Yes The subject section of the certificate has bogus values (e.g., - . Somestate , somecity)

F3 self-signed boolean Yes The certificate is self-signedF4 expired boolean Yes The certificate is expired

F5 verification failed boolean No

The certificate passed the verification of OpenSSL 0.9.8k 25 Mar 2009 (for Debian Linux)

F6 common certificate boolean Yes The certificate of the given domain is the sa

me as a certificate of another domain

F7 common serial boolean Yes The serical number of the certificate is the same as the serial of another one.

F8 validity period > 3 yesrs boolean Yes The validity period is more than 3 years

F9 issuer common name string Yes The common name of the issuer

F10 issuer organization string Yes The organization name of the issuer

F11 issuer country string Yes The country name of the issuer

F12 subject country string Yes The country name of the subject

F13 exact validity duration integer No The number of days between the starting d

ate and the expiration date

F14 serial number length integer Yes The number of characters in the serial num

ber

F15 host-common name distance real Yes

The Jaccard distance value between host name and common name in the subject section

12

B. Certificate Analysis

Page 13: Harvesting SSL Certificate Data to Identify Web-Fraud

Analysis of Certificate Boolean Features

13

B. Certificate Analysis

Page 14: Harvesting SSL Certificate Data to Identify Web-Fraud

14

B.Certificate Analysis

Fig : CDF of Serial Number Length of Alexa, .com .net (c) phishing (d) typosquatting

F14 : Serial Number Length

Page 15: Harvesting SSL Certificate Data to Identify Web-Fraud

15

Certificate AnalysisF15 : Jaccard Distance

Page 16: Harvesting SSL Certificate Data to Identify Web-Fraud

Around 20% of legitimate popular domains are still using the signature algorithm “md5WithRSAEncryption“ despite its clear insecurity.

A significant percentage (> 30%) of legitimate domain certificates are expired and/or self-signed.

Duplicate certificate percentages are very high in phishing domains.

For most features, the difference in distributions between Alexa and malicious sets is larger than that between .com/.net and malicious sets.

16

Summary of certificate Feature Analysis

Page 17: Harvesting SSL Certificate Data to Identify Web-Fraud

A. Phishing Classifier

B. Typosquatting Classifier

17

Certificate-Based Classifier

Page 18: Harvesting SSL Certificate Data to Identify Web-Fraud

Classifier Positive Recall Positive PrecisionRandom Forest 0.74 0.77SVM 0.68 0.75Decision Tree 0.70 0.79Bagging Decision Tree 0.73 0.80Boosting Decision Tree 0.74 0.69Decision Tree 0.72 0.78Nesrest Neighbor 0.74 0.73Neural Networks 0.70 0.77

18

Phishing Classifier

Table IV Performance of classifiers - Data set consists of (A)420 phishing certificates and (B)420 non-phishing certificates (Alexa, .COM and .NET)

Classifier Positive Recall Positive PrecisionRandom Forest 0.90 0.89SVM 0.91 0.87Decision Tree 0.86 0.89Bagging Decision Tree 0.90 0.88Boosting Decision Tree 0.89 0.81Decision Tree 0.90 0.84Nesrest Neighbor 0.87 0.86Neural Networks 0.87 0.89

Table V Performance of classifiers - Data set consists of (A)420 phishing certificates and (B)420 non-phishing certificates (Alexa)

Page 19: Harvesting SSL Certificate Data to Identify Web-Fraud

19

Typosquatting ClassifierClassifier Positive Recall Positive PrecisionRandom Forest 0.86 0.88SVM 0.86 0.88Decision Tree 0.84 0.93Bagging Decision Tree 0.87 0.90Boosting Decision Tree 0.89 0.85Decision Tree 0.86 0.87Nesrest Neighbor 0.86 0.84Neural Networks 0.84 0.89

Table VI Preformance of classifiers - Data set consists of (A)486 typosquatting certificates and (B)486 non-typosquatting certificates (Top Alexa, .COM and .NET)

Classifier Positive Recall Positive PrecisionRandom Forest 0.95 0.93SVM 0.94 0.92Decision Tree 0.95 0.94Bagging Decision Tree 0.96 0.94Boosting Decision Tree 0.98 0.90Decision Tree 0.96 0.92Nesrest Neighbor 0.93 0.94Neural Networks 0.95 0.95

Table VII Preformance of classifiers - Data set consists of (A)486 typosquatting certificates and (B)486 popular domain certificates

Page 20: Harvesting SSL Certificate Data to Identify Web-Fraud

We design and build a machine-learning-based classifier that identifies fraudulent domains using HTTPS based solely on their SSL certificates, thus also preserving user privacy.

We believe that our results may serve as a motivating factor to increase the use of HTTPS on the Web.

Use of HTTPS can help identifying web-fraud domains.

20

Conclusion