A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers...

A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data

Centers

K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D.

Borthakur, K. Ramchandran

UC Berkeley, Facebook

ACM SIGCOMM 2014

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems : A Study on the Facebook

Warehouse Cluster

K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K.

Ramchandran

UC Berkeley, Facebook

The 5th USENIX Workshop on Hot Topics in File and Storage Technologies, HotStorage 2013

http://www.camdemy.com/media/11869

Outline

• Introduction• Hitchhiker’s erasure code • Evaluation results• Conclusion

Need for Redundant Storage in Data Centers

• Frequent unavailability events in data centers– Unreliable components– Software glitches, maintenance shutdowns, power

failures, etc.• Redundancy necessary for reliability and

availability

Popular Approach for Redundant Storage: Replication

• Distributed file systems used in data centers store multiple copies of data on different machines

• Machines typically chosen on different racks– to tolerate rack failures

• E.g., Hadoop Distributed File System (HDFS) stores 3 replicas by default

• HDFS

Massive Data Sizes: Need Alternative to Replication

• Small to moderately sized data: disk storage is inexpensive– Replication viable

• No longer true for massive scales of operation– e.g., Facebook data warehouse cluster stores

multiple tens of Petabytes (PBs)

“Erasure codes” are an alternative

Erasure Codes in Data Centers

• Facebook data warehouse cluster – Uses Reed-Solomon(RS) codes instead of 3

replication on a portion of the data – Savings of multiple Petabytes of storage space

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

Erasure Codes Replication Reed-Solomon (RS) code

block 4

Redundancy

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

block 4

Redundancy

First order comparison:

tolerates any one failure

tolerates any two failures

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

block 4

Redundancy

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

block 4

Redundancy

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

block 4

Redundancy

block 1

block 2

block 3

block 4

parity blocks

data blocks block 1

block 2

block 3

block 4

Redundancy

In general:

Tolerates any one failure

Lower MTTDL (Mean Time To Data Loss), High storage requirement

Tolerates any two failures

Order of magnitude higher MTTDL with much lesser storage

Erasure Codes

• Using RS codes instead of 3-replication on less-frequently accessed data has led to savings of multiple Petabytes in the Facebook warehouse cluster

• Facebook warehouse cluster employs an (k=10, r=4) RS code, thus resulting in a 1.4x storage requirement

• (#data, #parity) RS code: –tolerates failure of any #parity blocks –these (#data + #parity) blocks constitute a “stripe”

• Facebook warehouse cluster uses a (10, 4)

RS code

#data = 2 (data blocks)

#parity = 2 (parity blocks)

4 blocks In a stripe

Reed-Solomon (RS) Codes

Example: (2, 2) RS code

Existing Systems

• Need additional storage– Huang et al. (Windows Azure) 2012, Sathiamoorthy et al.

(Xorbas) 2013, Esmaili et al. (CORE) 2013• Add additional parities to reduce download

– Hu et al. (NCFS 2011) • Highly restricted parameters – Khan et al. (Rotated-RS) 2012: #parity≤3– Xiang et al., Wang et al. 2010, Hu (NCCloud) et al. 2012:

#parity≤2 – Hitchhiker performs as good or better for these restricted

settings as well

Erasure codes in Data Centers:HDFS-RAID

Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09

Erasure codes in Data Centers:HDFS-RAID

Borthakur, “HDFS and Erasure Codes (HDFS-RAID)”Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09

(10, 4) Reed-Solomon code • Any 10 blocks sufficient• Can tolerate any 4 failures

Impact on Data Center Network

RS codes significantly increase network usage during reconstruction

Impact on Data Center Network

Burdens the already oversubscribed Top Of Rack(TOR) switch and higher Router

Machine Unavailability Events

• From HDFS Name Node ‐ logs • Logged when no heart-beat for > 15min– machines unavailable for more than 15 minutes in

a day– 15 minutes is the default wait-time of the cluster

to flag a machine as unavailable• Blocks marked unavailable, periodic recovery

process• The period 22nd Jan. to 24th Feb. 2013

http://hadoop.apache.org/

Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”, Usenix HotStorage Workhsop 2013. http://www.camdemy.com/media/11869

Machine Unavailability Events

Median of ≈50 machine-unavailability events logged per dayhttp://www.camdemy.com/media/11869

Missing blocks per stripe

Dominant scenario: Single block recoveryhttp://www.camdemy.com/media/11869

Facebook Data Warehouse Cluster

• Median of 180 TB transferred across racks per day for recovery operations• Reduction of more than 50TB of cross-rack traffic per day• Around 5 times that under 3-replication

RS codes: The Good and The Bad

• Maximum possible fault tolerance for given storage overhead– Storage capacity optimal– (“maximum-distance-separable” in coding theory

parlance)• Flexibility in choice of parameters– Supports any number of data and parity blocks

• Not designed to handle reconstruction operations efficiently– Negative impact on the network

• To build a system with:

Hitchhiker

• Is a system with:

At an Abstract Level

HITCHHIKER

Block 1

Block 2

Block 3

Block 4

a1+2a2

1 Byte

b1+2b2

1 Byte

data blocks

parity blocks

Hitchhiker’s Erasure Code: Toy ExampleTake a (2, 2) Reed-Solomon code

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2

Hitchhiker’s Erasure Code: Toy Example (In (2,2) RS code: recovery download & IO = 4 Bytes)

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+b2 + a1

Hitchhiker’s Erasure Code: Toy Example

Add information from first group on to parities of the second group

No additional storage!

Fault-Tolerance (Toy Example)

Same fault tolerance as RS code: can tolerate failure of any 2 nodes

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

subtract

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

a1 a2 b1 b2

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Efficient Reconstruction

Data transferred (Download & IO) only 3 Bytes (instead of 4 Bytes as in RS)

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Subtract

Block 1

Block 2

Block 3

Block 4

a1+2a2

b1+2b2+a1

Subtract

Hitchhiker’s Erasure Code

• Builds on top of RS codes• Uses our theoretical framework of

“Piggybacking”*• Three versions– XOR– XOR+– Non-XOR

* K.V. Rashmi, Nihar Shah, K. Ramchandran, “A Piggybacking Design Framework for Read-and Download efficient Distributed Storage Codes”, in IEEE International Symposium on Information Theory, 2013.

Hop and couple (disk layout)

• Way of choosing which bytes to mix– couples bytes farther apart in block– to minimize the degree of discontinuity in disk

reads during data reconstruction• Translate savings in network-transfer to

savings in disk-IO as well– By making reads contiguous

RS vs Hitchhiker from the Network’s Perspective…

Data Transfer during Reconstruction in RS-based System

Transfer: 10 full blocksConnect to 10 machines

Data Transfer during Reconstruction in Hitchhiker

Reconstruction of data blocks 1-9:

Transfer: 2 full blocks + 9 half blocks (=6.5 blocks total) Connect to 11 machines

Data Transfer during Reconstruction in Hitchhiker

Reconstruction of data block 10:

Transfer: 13 half blocks (=6.5 blocks total)Connect to 13 machines

Hop-and-Couple

• Technique to pair bytes under Hitchhiker’s erasure code

• Makes disk reads during reconstruction contiguous

Hop-and-Couplehop length = 1 byte = 1

Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes are read and downloaded for the reconstruction of the first unit. While both methods require the same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to be read is contiguous.

Hop-and-Couplehop length = half the size of a unit

Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes are read and downloaded for the reconstruction of the first unit. While both methods require the same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to be read is contiguous.

Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS) code. Ten units of data (first ten rows) are encoded using the RS code to generate four parity units (last four rows).

Figure 3: The theoretical framework of Piggybacking [22] for parameters (k=10, r=4). Each row represents one unit of data.

Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS) code. Ten units of data (first ten rows) are encoded using the RS code to generate four parity units (last four rows).

Hitchhiker-XOR

• The XOR-only feature of these erasure codes significantly reduces the computational complexity of decoding, making degraded reads and failure recovery faster.

• Hitchhiker's erasure code optimizes only the reconstruction of data units; reconstruction of parity units is performed as in RS codes.

Hitchhiker-XOR

Figure 4: Hitchhiker-XOR code for (k=10, r=4). Each row represents one unit of data.

Hitchhiker-XOR+

• Hitchhiker-XOR+ reduces the amount of data required for reconstruction and employs only additional XOR operations.

• This property, which we term the all-XOR-parity property, requires at least one parity function of the RS code to be an XOR of all the data units.

Hitchhiker-XOR+

Figure 5: Hitchhiker-XOR+ for (k=10, r=4). Parity 2 of the underlying RS code is all-XOR.

Hitchhiker-nonXOR

• Hitchhiker-nonXOR guarantees the same savings as Hitchhiker-XOR+ even when the underlying RS code does not possess the all-XOR-parity property, but at the cost of additional finite-field arithmetic.

• Hitchhiker-nonXOR can be built on top of any RS code.

Hitchhiker-nonXOR

Figure 6: Hitchhiker-nonXOR code for (k=10, r=4). This can be built on any RS code. Each row is one unit of data.

Implementation & Evaluation Setup(1)

• Implemented on top of HDFS-RAID– Erasure coding module in HDFS based on RS– Used in the Facebook data warehouse cluster

• Deployed and tested on a 60-machine test cluster at Facebook– Verified 35% reduction in the network transfers

during reconstruction

Implementation & Evaluation Setup(2)

• Evaluation of timing metrics on the Facebook data warehouse cluster in production– under real-time production traffic and workloads– using Map-Reduce to run encoding and

reconstruction jobs, just as HDFS-RAID

Decoding Time

• RS decoding on only half portion of the blocks• Faster computation for degraded reads and recovery• XOR versions: 25% lesser than non-XOR

36% reduction

Read &Transfer Time

• Read & transfer time 30% lower in Hitchhiker (HH)• Similar reduction for other block sizes(4、 64MB) as well

35% less

MedianThe 95th percentile of the read time

Encoding Time

Benefits outweigh higher encoding cost in many systems (e.g., HDFS):• encoding of raw data into erasure-coded data is one time operation• often run as a background job

72% higher

Hitchhiker

Conclusion

• We present Hitchhiker, an new erasure-coded storage system.

• Hitchhiker reduces both network and disk traffic during reconstruction by 25% to 45% as RS-based systems.

References• [6] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network

coding for distributed storage systems. IEEE Trans. Inf. Th., Sept. 2010.• [17] S. Mahesh, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D.

Borthakur. Xoring elephants: Novel erasure codes for big data. In VLDB, 2013.• [19] D. Papailiopoulos, A. Dimakis, and V. Cadambe. Repair optimal erasure codes

through hadamard designs. IEEE Trans. Inf. Th., May 2013.• [20] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran. A

solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proc. USENIX HotStorage, June 2013.

• [21] K. V. Rashmi, N. B. Shah, and P. V. Kumar. Optimal exact-regenerating codes for the MSR and MBR points via a product-matrix construction. IEEE Trans. Inf. Th., 2011.

• [22] K. V. Rashmi, N. B. Shah, and K. Ramchandran. A piggybacking design framework for read-and download-ecient distributed storage codes. In IEEE International Symposium on Information Theory, 2013.

• [26] N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Distributed storage codes with repair-by-transfer and non-achievability of interior points on the storage-bandwidth tradeoff. IEEE Trans. Inf. Theory, 2012.

Figure 9: Data read patterns during reconstruction of blocks 1, 4 and 10 in Hitchhiker-XOR+: the shaded bytes are read and downloaded.

A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers...

Documents

A Hitchhiker’s guide to Lambda - SundayResearchA Hitchhiker’s guide to Lambda (λ) Shitohichi Umaya 2000-4-11(Sat) Contents 1 Introduction 1 2 λ calculus and Hitchhiker’s guide

Incentives Presented by Namami Borthakur & K. Gayatri

The Hitchhiker’s Guide to StackOverflow

The Hitchhiker’s Guide to Redux

WELCOME TO NIHAR INFO GLOBAL

Apache Hadoop Goes Realtime at Facebook Borthakur, Sarma

Petabyte Scale Data at Facebook - Borthakur

The Hitchhiker’s Guide to HCV Therapy

Failing Fontans - by Nihar Mehta

ASSAM PUBLIC SERVICE COMMISSION - apsc.nic.in metrology_21Oct2017.pdf · 8 Lakhya Jyoti Borthakur Ajit Borthakur do 9 Tenzing Lama Pasang Lama Underage 10 Pabitra Das Naren Ch.Das

Nihar Patel - Priceline HF Report

The Hitchhiker’s Guide to Assay Central

Nihar Gupta Horoscope

Hadoop and Hive Development at Facebook - Borthakur Inc - Home

Hitchhiker’s Guide to the Early Universe

Kiriti_Omnibus-1 by Nihar Ranjan Gupta

Hitchhiker’s Guide to Grammar

The Hitchhiker’s Guide to Kaggle

Hitchhiker’s Guide to FLOW3

The Hitchhiker’s Guide to the Programming Contests