of 30/30
BIG DATA MODERN TECHNOLOGIES György Balogh LogDrill Ltd. SECWorld – 7 May 2014

Gyorgy balogh modern_big_data_technologies_sec_world_2014

  • View
    250

  • Download
    0

Embed Size (px)

DESCRIPTION

György Balogh has held a presentation at the SECWorld 2014 conference about the cutting-edge yet also affordable Big Data technologies.

Text of Gyorgy balogh modern_big_data_technologies_sec_world_2014

  • 1. BIG DATA MODERN TECHNOLOGIES Gyrgy Balogh LogDrill Ltd. SECWorld 7 May 2014
  • 2. AGENDA What is Big Data? Why do we have to talk about it? Paradigm shift in informationmanagement Technology and efficiency
  • 3. WHAT IS BIG DATA? Data volume cannot be handled traditional solutions (eg.: relational database) More than 100 million data rows, typically multi billion
  • 4. GLOBAL RATE OF DATA PRODUCTION (PER SECOND) 30 TB/sec (22000 films) Digital media 2 hours of YouTube video Communication 3000 business emails 300000 SMS Web Half million page views Logs Billions
  • 5. BIG DATA MARKET
  • 6. HYPE OR REALITY?
  • 7. WHY NOW? Long term trends Size of stored data doubles every 40 months since 1980s Moores law: number of transistors on integrated circuits doubles every 18 months
  • 8. DIFFERENT EXPONENTIAL TRENDS
  • 9. HARD DRIVES IN 1991 AND 2012 1991 40 MB 3500 RPM 0.7 MB/sec full scan: 1 minutes 2012 4 TB ( x 100000) 7200 RPM 120 MB/sec ( x 170) full scan: 8 hours (x 480)
  • 10. DATA ACCESS BECOMES THE SCARCE RESOURCE!
  • 11. GOOGLES HARDWARE IN 1998
  • 12. GOOGLES HARDWARE IN 2013 12 data centers worldwide More than a million nodes A data center costs $600 million to build Oregon data center 15000 m2 power of 30 000 homes
  • 13. GOOGLES HARDWARE IN 2013 Cheap commodity hardware each has its own battery! Modular data centers Standard container 1160 servers per container Efficiency: 11% overhead (power transformation, cooling)
  • 14. THE BIG DATA PARADIGM SHIFT
  • 15. TECHNOLOGIES Hadoop 2.0 Google BigQuery Cloudera Impala Apache Spark
  • 16. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
  • 17. HADOOP MAP REDUCE
  • 18. HADOOP Who uses Hadoop? Facebook: 100 PB Yahoo: 4000 nodes More than half of Fortune 50 companies! History Replica of Google architecture (GFS, BigTable) in Java under Apache licence Hadoop 2.0 Full High Availability Advanced resource managements (YARN)
  • 19. GOOGLE BIG QUERY SQL queries on terabytes of data in seconds Data is distributed over thousands of nodes Each node processes one part of the dataset Thousands of nodes work for us for a few milliseconds select year, SUM(mother_age * record_weight) / SUM(record_weight) as age from publicdata:samples.natality where ever_born = 1 group by year order by year;
  • 20. GOOGLE BIG QUERY SQL queries on terabytes of data in seconds Data is distributed over thousands of nodes Each node processes one part of the dataset Thousands of nodes work for us for a few milliseconds
  • 21. CLOUDERA IMPALA Same as BigQuery on top of Hadoop Standard SQL on Big Data. On a 10 million Ft cluster terabytes of data can be analyzed interactively Scales to thousands of nodes Technology sugars Run-time code generation with LLVM Parquet format (column oriented)
  • 22. APACHE SPARK Berkeley University Achieves 100 times speed up compared to Hadoop on certain tasks In cluster memory computation
  • 23. INEFFICIENCY CAN WASTE HUGE AMOUNT OF RESOURCES 300 node cluster Hadoop Hive = 300 node cluster One node Vectorwise Vectorwise holds world speed record in analytical database queries on a single node
  • 24. CLEVER WAYS TO IMPROVE EFFICIENCY Lossless data compression (even 50x!) Clever lossy compression of data (e.g.: olap cubes) Cache aware implementations (asymmetric trends, memory access bottleneck)
  • 25. LOSSLESS DATA COMPRESSION compression can boost sequential data access even 50 times! (100 MB/sec -> 5 GB/sec) Less data -> less I/O operation One CPU can decompress data even at 5 GB/sec gzip decompression is very slow snappy, lzo, lz4 can reach 1 GB/sec decompression speed decompression used by column oriented databases can reach 5 GB/sec (PFOR) two billion integers per second! (almost one integer per clock cycle!!!)
  • 26. EXAMPLE: LOGDRILL 2011-01-08 00:00:01 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 22957 562 2011-01-08 00:00:09 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 2957 321 2011-01-08 00:01:04 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 43422 522 2011-01-08 00:01:08 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 234 425 2011-01-08 00:02:23 X1 Y1 1.2.3.4 GET /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 404 0 0 234 432 2011-01-08 00:02:45 X1 Y1 1.2.3.4 POST /a/b/c - 1.2.3.4 HTTP/1.1 Mozilla 200 0 0 4353 134 2011-01-08 00:00 GET 200 2 2011-01-08 00:01 GET 200 2 2011-01-08 00:02 GET 404 1 2011-01-08 00:02 POST 200 1
  • 27. CAHE AWARE PROGRAMMING CPU speed increasing about 60% a year Memory speed increasing only 10% a year The increasing gap is covered with multi level cache memories Cache is under-exploited 100x speed up!!!
  • 28. LESSONS LEARNED Big Data is not a hype at least from the technological viewpoint Modern technologies (Impala, Spark) can reach theoretical limits of the cluster hardware configuration Deep understanding of both the problem and the technologies are required to create efficient Big Data solutions
  • 29. THANK YOU! Q&A?