36
Network Traffic Search using Apache HBase Evans Ye @ TWHUG 2014 Q1 2014/3/8

Network Traffic Search using Apache HBase

Embed Size (px)

Citation preview

Page 1: Network Traffic Search using Apache HBase

Network Traffic Search using Apache HBase

Evans Ye @ TWHUG 2014 Q1

2014/3/8

Page 2: Network Traffic Search using Apache HBase

• Evans Ye @

– Dumbo Team• Dumbo In Taiwan Blog

– Talk in TWHUG 2013 Q4• Building Hadoop Based Big Data Environment

– Apache Bigtop Contributor

Who am I

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 3: Network Traffic Search using Apache HBase

• Problem to Solve

• Solution Design

• Flume ETL Process

• Experience Sharing

• Future Work

Agenda

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 4: Network Traffic Search using Apache HBase

04/11/2023 Copyright 2013 Trend Micro Inc.

閃開讓專業的來!Security Department:Hey SPN, I have a big data problem…

Page 5: Network Traffic Search using Apache HBase

Network Traffic Analysis Example

04/11/2023 Copyright 2013 Trend Micro Inc.

TW branch US branch

INTRANET

INTERNET

VICTIM 1 VICTIM 2 VICTIM 3 VICTIM 4

C&C 1 C&C 3C&C 2

Page 6: Network Traffic Search using Apache HBase

• ArcSight Common Event Format– Volume: 250G/180 million record per day

Find Malicious Connections by Searching Netflow logs

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 7: Network Traffic Search using Apache HBase

• src: source ip

• dst: destination ip

• spt: source port

• dpt: destination port

• proto: protocol, TCP,UDP…

• rt: timestamp, 1386018915000

Valuable Fields in Netflow log

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 8: Network Traffic Search using Apache HBase

Search for Connections

04/11/2023 Copyright 2013 Trend Micro Inc.

NetflowLogger

Query

……

about 8~10min

Page 9: Network Traffic Search using Apache HBase

Big Data Problem

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 10: Network Traffic Search using Apache HBase

• Big data solutions

• Why HBase?– We want to try and figure out HBase Thrift limitation– How HBase performs when dealing with this kind of problem

Choosing The Right Tool

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 11: Network Traffic Search using Apache HBase

04/11/2023 Copyright 2013 Trend Micro Inc.

Solution Design

Page 12: Network Traffic Search using Apache HBase

Architecture

04/11/2023 Copyright 2013 Trend Micro Inc.

HBase Thrift

Server

Send Netflow via syslog

Data Soruce

Query

Talk to HBase using C++, Python, PHP, Ruby, Perl…

A simple Python web frameworkOnly one file under 150k

Page 13: Network Traffic Search using Apache HBase

• Searchable Fields – src: source ip– dst: destination ip– spt: source port– dpt: destination port– proto: protocol, TCP,UDP…– rt: timestamp, 1386018915000

• Values– in, cn2, ad.tcp__flags

User Requirement

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 14: Network Traffic Search using Apache HBase

• Compose searchable fields to be rowkey

• For client query, scan by applying HBase Filter– RowFilter (=, 'regexstring:^src#dst#[^#]*#spt#dpt#proto$')“– See HBase Thrift Filter doc

HBase Rowkey Design – First Attempt

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 15: Network Traffic Search using Apache HBase

RD Style Search Portal

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 16: Network Traffic Search using Apache HBase

• Test on 12 million sample data

• The search performance……

• Since we need to store at least 3 month data for query,The performance might not be good enough…

Performance

04/11/2023 Copyright 2013 Trend Micro Inc.

1.5~2min

Page 17: Network Traffic Search using Apache HBase

• Avoid full table scan– HBase Filters can only helps you to filter out un-wanted data to

client side– On server side, it still need to compare all the rowkeys when

applying filters– set STARTROW and STOPROW

Lesson Leaned

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 18: Network Traffic Search using Apache HBase

• Since HBase is natively designed to store data sorted by rowkey

• It’s fast to scan rows when rowkey prefix specified

– This can only be fast when source ip specified– How about destination ip, port, protocol,…?

Avoid Full Table Scan

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 19: Network Traffic Search using Apache HBase

• Searchable Fields– src: source ip– dst: destination ip– spt: source port– dpt: destination port– proto: protocol– rt: timestamp

• User want to track down suspicious connections– A query at least need to have an IP

Rethink The User Requirement 

04/11/2023 Copyright 2013 Trend Micro Inc.

required

Page 20: Network Traffic Search using Apache HBase

– Search on source ip

– Search on destination ip

– Put netflow timestamp into HBase timestamp to leverage HBase TimeRange Scan

– Set VERSION=>2147483647 to avoid collision

HBase Rowkey Design – Second Attempt !

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 21: Network Traffic Search using Apache HBase

• Search other searchable fields by applying Qualifier Filter:– QualifierFilter (=, 'regexstring:^spt#dpt#proto$')

HBase Rowkey Design – Second Attempt !

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 22: Network Traffic Search using Apache HBase

• Searchable Fields– src: source ip specifiy STARTROW/STOPROW– dst: destination ip specify

STARTROW/STOPROW– spt: source port apply qualifier filter– dpt: destination port apply qualifier filter– proto: protocol apply qualifier filter– rt: timestamp specify HBase TimeRange

Check The User Requirement 

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 23: Network Traffic Search using Apache HBase

Deliver New Portal

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 24: Network Traffic Search using Apache HBase

• Test on 70 million sample data

• The search performance……

• Enough?– Since malicious connections won’t have large volume, 80% of

query should be responsed in a second

• Duplicate issue:– Since we only store needed fields into HBase, the data volume

is only 150MB/day duplicated 300MB/day– Store 3 month data = 13.5GB duplicated 27GB (GZed)

(record count = 12 Billon)

Performance

04/11/2023 Copyright 2013 Trend Micro Inc.

<1s~1min

Page 25: Network Traffic Search using Apache HBase

• Test on 240 million sample data

• The search performance……

• The query time is robust on 80% query case

Test on Even Large Data

04/11/2023 Copyright 2013 Trend Micro Inc.

<1s~3min

Page 26: Network Traffic Search using Apache HBase

04/11/2023 Copyright 2013 Trend Micro Inc.

Fume ETL Process

Page 27: Network Traffic Search using Apache HBase

Architecture

04/11/2023 Copyright 2013 Trend Micro Inc.

Hbase Thrift

Server

Send Netflow via syslog Query

Data Soruce

Page 28: Network Traffic Search using Apache HBase

Flume Process

04/11/2023 Copyright 2013 Trend Micro Inc.

Flume Spooling Directory Source

Flume file Channel Flume HBase Sink

Serializer

Serializer1. Extract needed fields from Netflow log

To

2. Create Hbase put object for Sink to execute

Data Soruce

Page 29: Network Traffic Search using Apache HBase

Dual Table Write

04/11/2023 Copyright 2013 Trend Micro Inc.

Infosec

Flume Spooling Directory Source

flume.conf…agent1.sinks.sink1.serializer.rowKey = src, dstagent1.sinks.sink2.serializer.rowKey = dst, src

Channel1

Channel2

Sink1

Sink2

Duplicate, Again!

Data Soruce

Page 30: Network Traffic Search using Apache HBase

Step1 • A put trigger the prePut Coprocessor

Step2 • Put to dst table in dst#src format in coprocessor

Step3 • Do regular put to src table in src#dst format

More Elegant Way

04/11/2023 Copyright 2013 Trend Micro Inc.

Infosec

Flume Spooling Directory Source

Channel1 Sink1

Infosec

Data Soruce

src table

dst table

Hook a prePut Coprocessor

Page 31: Network Traffic Search using Apache HBase

04/11/2023 Copyright 2013 Trend Micro Inc.

Experience Sharing& Future Work

Page 32: Network Traffic Search using Apache HBase

• Thrift– Thrift is not the first-class citizen of HBase, for example, thrift do

not support Scan with TimeRange and Version– Do not support New Filters since thrift has it’s own

Filter Language (for example, FuzzyRowFilter)

• Bottle– It won’t be hurt when you delete you web backend code which is

implement by bottle

Experience Sharing

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 33: Network Traffic Search using Apache HBase

• Flume– There is also a Flume Syslogudp Source, but can not work well

with out extra works• 768bytes/per message limitation(fixed in FLUME-2130)• Still has 2048bytes limitation on netty event decoder• Data may loss due to messages concatenated...

– Spooling Directory Source is much more stable

Experience Sharing

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 34: Network Traffic Search using Apache HBase

• Transparent index table to clients– Use coprocessor to hook on the client scan and decide which

table is going to scan

• Make thrift scan support specifying version:– Now I use scan to fetch rows and qualifiers,

then use getVer to fetch different versions(thrift do support “version” on get)

Future Work

04/11/2023 Copyright 2013 Trend Micro Inc.

Page 35: Network Traffic Search using Apache HBase

Questions?

Page 36: Network Traffic Search using Apache HBase

Thank you !