18
Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung

Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Embed Size (px)

Citation preview

Page 1: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Data Storage Infrastructure at Facebook

Spring 2018 Cleveland State University

CIS 601 PresentationYi Dong

Instructor: Dr. Chung

Page 2: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Outline

● Strategy of data storage, processing, and log collection

● Data flow from the source to the data warehouse

● Storage systems and optimization

● Data discovery and analysis

● Challenges in resource sharing

Page 3: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Facebook’s Architecture

Page 4: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Facebook’s Architecture

PHPHipHop compiler

ScribeThrift

Hadoop HbaseHayStack

HiveMySQL

Memcached

Page 5: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 1: Strategy for Data Storage, Processing, Log collection

● Apache Hadoop

● Apache Hive

● Scribe

Page 6: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Hadoop, Why?

● Scalability

– Able to process multi petabyte datasets● Fault Tolerance

– Node failure is expected everyday

– Number of nodes is not constant● High Availability

– User can access from nearest node● Cost Efficiency

– Open source

– Use commodity hardware as a node in Hadoop clusters

– Eliminates particular technology dependency

Page 7: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Hadoop Architecture

● HDFS (Hadoop Distributed File System)

● Map-Reduce Infrastructure

Page 8: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Hive

● SQL-like analysis tool (HiveQL) on top of Hadoop

● Dramatically improve the productivity and usage for Hadoop

– With Hive, users without programming experience can use Hadoop for their work

– Without Hive, one basic Hadoop data manipulation, like GROUP BY will take >100 lines of Java/Python code

– Even worse, if the programmer does not have database knowledge, the code will likely use sub-optimal algorithm, often it is pretty sub-optimal

Page 9: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Hive Architecture

Page 10: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Scribe – Scalable Logging System

● Distributed and scalable logging system

● Combined with HDFS

● Aggregate logs from thousands of web servers

Page 11: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 2: Data Flow Architecture

● Two Sources of Data

– Web Server

● Log data● Copy every 5-15 minutes

– Federated MySQL

● Information data● Copy daily

● Two different clusters

– Production Hive-Hadoop cluster

– Ad-hoc Hive-Hadoop cluster

Page 12: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Deal with Data Delivery Latency

● Even log data copied at 5-15 minutes interval, the loader will only load data into Hive native table at the end of the day

● Solution at Facebook:

– Use Hive’s external table feature, create table meta data on the raw HDFS files

– After data loaded into Hive native table at the end of day, remove raw HDFS files from the external table

– New solutions are needed to enable continuously log data loading

Page 13: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 3: Storage Optimization

● All data need to compressed to save space

– Hadoop allows user specific codecs, Facebook using gzip codec to get compression factor at 6-7

● HDFS by default use 3 copies of data to prevent data loss

– Using erasure codes, 2 copies of data and 2 copies of error correction code, this multiple can be brought down to 2.2

– Using Hadoop RAID on older data sets and keeping the newer data sets replicated 3 ways

Page 14: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 3: Storage Optimization

● Reduce the memory usage by HDFS NameNode

– Trade off latency to reduce memory pressure

– Implement file format to reduce map tasks

● Data federation

– Distribute data based on time● Data across time boundary will need more join

– Distribute data based on application● Some of the common data have to be replicated

Page 15: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 4: Data Discovery and Analysis

● Hive

– Provide immense scalability to non-engineering users, such as business analysts, product managers

● Data discovery

– Internal tool to enable wiki approach for metadata creation

– Tools to extract lineage information from query log● Periodic Batch Jobs

– For such job, inner job dependencies and ability to schedule such job are critical

Page 16: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Part 5: Resource Sharing

● Support the co-existence of interactive jobs and batch jobs on the same Hadoop cluster

– Implement Hadoop Fair Share Scheduler

– Isolate ad-hoc queries and periodic batch queries

– Implement Scheduler to make it more aware of system resource usage caused by poorly written ad-hoc queries

Page 17: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

Take Home Message

● For a data warehouse design

– What kind of data source, flow architecture

– What kind of storage architecture

– What kind of user, what kind of task

– How to make usage easier

– How to share the resource between jobs

Page 18: Data Storage Infrastructure at Facebook - eecs.csuohio.edueecs.csuohio.edu/~sschung/CIS601/FaceBookBigDataInfrastructure_Yi.… · Data Storage Infrastructure at Facebook Spring 2018

End

Thank you