Gyorgy balogh modern_big_data_technologies_sec_world_2014

BIG DATAMODERN TECHNOLOGIES

György BaloghLogDrill Ltd.

SECWorld – 7 May 2014

AGENDA

• What is Big Data?• Why do we have to talk about it?• Paradigm shift in

informationmanagement• Technology and efficiency

WHAT IS BIG DATA?

• Data volume cannot be handled traditional solutions (eg.: relational database)

• More than 100 million data rows, typically multi billion

GLOBAL RATE OF DATA PRODUCTION (PER SECOND)

• 30 TB/sec (22000 films)• Digital media• 2 hours of YouTube video

• Communication• 3000 business emails• 300000 SMS

• Web• Half million page views

• Logs• Billions

BIG DATA MARKET

HYPE OR REALITY?

WHY NOW?

● Long term trends○ Size of stored data doubles every 40 months

since 1980s○ Moore’s law: number of transistors on

integrated circuits doubles every 18 months

DIFFERENT EXPONENTIAL TRENDS

HARD DRIVES IN 1991 AND 2012

● 1991● 40 MB● 3500 RPM● 0.7 MB/sec● full scan: 1

minutes

● 2012● 4 TB ( x 100000)● 7200 RPM● 120 MB/sec ( x 170)● full scan: 8 hours (x

480)

DATA ACCESS BECOMES THE SCARCE RESOURCE!

GOOGLE’S HARDWARE IN 1998


• 12 data centers worldwide• More than a million nodes• A data center costs $600 million to build• Oregon data center• 15000 m2• power of 30 000 homes


• Cheap commodity hardware • each has its own battery!

• Modular data centers• Standard container• 1160 servers per container

• Efficiency: 11% overhead (power transformation, cooling)

THE BIG DATA PARADIGM SHIFT

TECHNOLOGIES

• Hadoop 2.0• Google BigQuery• Cloudera Impala• Apache Spark

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

HADOOP MAP REDUCE

HADOOP

• Who uses Hadoop?• Facebook: 100 PB• Yahoo: 4000 nodes• More than half of Fortune 50 companies!

• History• Replica of Google architecture (GFS, BigTable)

in Java under Apache licence

• Hadoop 2.0• Full High Availability• Advanced resource managements (YARN)

GOOGLE BIG QUERY

• SQL queries on terabytes of data in seconds

• Data is distributed over thousands of nodes

• Each node processes one part of the dataset

• Thousands of nodes work for us for a few milliseconds

select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;

GOOGLE BIG QUERY

• SQL queries on terabytes of data in seconds

• Data is distributed over thousands of nodes

• Each node processes one part of the dataset

• Thousands of nodes work for us for a few milliseconds

CLOUDERA IMPALA

• Same as BigQuery on top of Hadoop• Standard SQL on Big Data.• On a 10 million Ft cluster terabytes of

data can be analyzed interactively• Scales to thousands of nodes• Technology sugars• Run-time code generation with LLVM• Parquet format (column oriented)

APACHE SPARK

• Berkeley University • Achieves 100 times speed up compared

to Hadoop on certain tasks• In cluster memory computation

INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES

• 300 node cluster • Hadoop• Hive

= • 300 node cluster • One node• Vectorwise• Vectorwise holds world

speed record in analytical database queries on a single node

CLEVER WAYS TO IMPROVE EFFICIENCY

• Lossless data compression (even 50x!)• Clever lossy compression of data (e.g.:

olap cubes)• Cache aware implementations

(asymmetric trends, memory access bottleneck)

LOSSLESS DATA COMPRESSION

• compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec)• Less data -> less I/O operation• One CPU can decompress data even at 5

GB/sec

• gzip decompression is very slow• snappy, lzo, lz4 can reach 1 GB/sec

decompression speed• decompression used by column oriented

databases can reach 5 GB/sec (PFOR)• two billion integers per second! (almost one

integer per clock cycle!!!)

EXAMPLE: LOGDRILL

2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562





2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134

2011-01-08 00:00 GET 200 2

2011-01-08 00:01 GET 200 2

2011-01-08 00:02 GET 404 1

2011-01-08 00:02 POST 200 1

CAHE AWARE PROGRAMMING

• CPU speed increasing about 60% a year• Memory speed increasing only 10% a

year• The increasing gap is covered with multi

level cache memories• Cache is under-exploited

100x speed up!!!

LESSONS LEARNED

• Big Data is not a hype at least from the technological viewpoint

• Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration

• Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions

THANK YOU!Q&A?