New Protocols for Remote File Synchronization Based on Erasure Codes
Utku Irmak
Svilen Mihaylov
Torsten Suel
Polytechnic University
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
Introduction
Remote File Synchronization Problem: How to update the outdated version of a file over a network with minimal amount of communication
When the versions are very similar, the total data transmitted should be significantly smaller than the file size
Machine A Machine B
Current Version Outdated Version
Common Applications Synchronization of User Files
Synchronization between different machines that may only be connected over over a slow network (home and work machine)
Both rsync and unison are widely used tools Web and Ftp Site Mirroring
Significant similarities between successive versions Including sites distributing new versions of a software rsync is widely used
Common Applications Content Distribution Networks
File synchronization is a natural approach to for updating content replicated at the network edge
Web Access over Slow Links A user revisiting a webpage may already have a previous
version in the browser cache It would be desirable to avoid the entire transmission This idea is implemented in rproxy which uses rsync
algorithm
Problem Formalization We have two files (strings) over some alphabet : fnew
(current file), fold (outdated file) We have two machines: C (the client), S (the server)
connected by a communication link C only has a copy of fold, and S only has a copy of fnew
Goal: Design a protocol between the parties that result C holding a copy of fnew while minimizing the total communication cost
Problem Formalization The communication cost should depend on the
degree of similarity between the two files The Hamming distance The edit distance The edit distance with block moves
We focus mainly on the edit distance with block moves. We assume that each block move operation adds 3 to the distance, while other operations add 1
Problem Formalization We focus on single-round protocols between client
and server Single-round protocols can be more easily integrated into
existing tools currently relying on rsync Multiple rounds are undesirable in many scenarios
involving small files or large latencies Multi-round protocols can introduce other complications
due to state that may have to be kept at the server for best performance
Assumptions The collection consists of unstructured files We are not concerned with issues of
consistency in between synchronization steps A simple two-party scenario where it is
known which files need to be updated and which is the current version
Contributions We describe a new approach to single-round file
synchronization based on erasure codes We derive a protocol that communicates at most
O(k lg(n) lg(n/k)) bits on files with edit distance with block moves of at most k
We derive another practical algorithm and optimized implementation that achieves very promising improvements over rsync
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
A Simple Multi-Round Protocol Runs in a number of rounds In the first round, server partitions the file
into blocks of size bmax and sends a hash (MD5) for each block
Client attempts to match the received hashes to all possible alignments in the outdated file.
Client responds with a bit vector to notify the server which of the hashes are understood
Server repeats the process for the blocks whose hashes did not find a match
Once block size bmin is reached, the server sends all the unmatched blocks
A Simple Multi-Round Protocol
A Simple Multi-Round Protocol Given two files with edit distance with block moves of k, if
we choose bmax = next smaller power of 2 of n/k bmin = lg(n) hash size = 4lg(n) bits
Lemma: If we partition fnew into some number of blocks, then at most k of these blocks do not occur in fold On each level, at most k hashes do not find a match
The algorithm transmits at most O(k lg(n) lg(n/k) ) bits and correctly updates the file with probability at least 1-1/n
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
An Efficient Single-Round Protocol First, we define complete multi-round algorithm:
Sends hashes for all blocks
Second, we describe Systematic Erasure Code briefly
Erasure Code Erasure Code: Given k source
data items of size s, which are encoded into n>k encoded items of same size s.
If any n-k of the encoded items are lost they can be recovered
A systematic erasure code is the one where the encoded data items consist of k source items plus n-k additional items
Figure by Luigi Rizzo
An Efficient Single-Round Protocol
Any hash value sent in the complete multi-round algorithm that would not be sent in the simple multi-round algorithm is not transmitted
An Efficient Single-Round Protocol
Any hash value that would be sent by the simple multi-round algorithm is also not sent to the client, but considered lost
An Efficient Single-Round Protocol
On each level there can be at most 2k lost blocks Client can recreate the entire level of hashes using the 2k
erasure hashes and recovering the lost hashes
An Efficient Single-Round Protocol Theorem: Given a bound k on the edit distance between fold
and fnew, the erasure-based file synchronization algorithm correctly updates fold to fnew with probability at least 1-1/n, using a single message of O(k lg(n) lg(n/k)) bits
We note that there are highly efficient single-message protocols for estimating the file distance k
Another property of the protocol is that by broadcasting a single message, the current version can be communicated to several clients that have different outdated versions
Outline Introduction and Common Applications Problem Formalization Contributions An Approach Based on Erasure Codes
A Simple Multi-Round Protocol An Efficient Single-Round Protocol A Practical Protocol Based on Erasure Codes
Implementation Overview Preliminary Results Conclusions
A Practical Protocol Based on Erasure Codes Previous protocol has two main shortcomings:
The protocol requires us to estimate an upper bound on the file distance k. An underestimation would make the recovery impossible at the client
More importantly, the algorithm does not support compression of unmatched literals
To address these problems we design another erasure-based algorithm that works better in practice
A Practical Protocol Based on Erasure Codes The hashes are sent from client to server For level i, mi erasure hashes are sent The server identifies the common blocks and then sends
unmatched literals in compressed form
Implementation Overview We included three additional optimizations over rsync :
Server now transmits the resulting delta and bit vector to allow the client create the same reference file
1) We replace gzip algorithm used for transmission of the unmatched literals and match tokens with an optimized delta compressor
Implementation Overview
3) We integrate decomposable hashes:
This technique allows the hash of a child block to be computed from the hashes of its parent and sibling, halving the number of erasure hashes transmitted
2) We make a better choice of the number of bits per hash:
We assume some upper bound on the probability of a collision, say 1/2^d for some d, then we use lg(n)+lg(y)+d bits per hash
n is the file size
y is the total number of hashes sent from client to server
Preliminary Results For the experiments we used the gcc and emacs datasets,
consisting of 2.7.0 and 2.7.1 of gcc and 19.28 and 19.29 of emacs
Conclusions We have described a new approach to remote
file synchronization based on erasure codes Using this approach, we derived a single-
round protocol that is feasible and communication efficient w.r.t a common file distance measure