Blob sync. Optimized updating of blobs on Azure

Preview:

DESCRIPTION

Optimize the way you update blobs on Azure Blob Storage. Only upload/download the deltas instead of wasting your bandwidth.

Citation preview

SAVE BANDWIDTH(AND LEARN TO LOVE BLOBS)

CLOUD STORAGE 101

• DURABLE

• HIGHLY AVAILABLE

• ACCESS ANYWHERE (WITH CREDENTIALS)

• SCALABLE

CLOUD STORAGE 101

CHEAP!!

CLOUD STORAGE 101: BLOB BASICS

• USE BLOCKS OF DATA TO CONSTRUCT BLOB

• REPLACE BLOCKS IN EXISTING BLOBS

CLOUD STORAGE 101

CLOUD STORAGE 101

CLOUD STORAGE 101

CLOUD STORAGE 101

UPLOAD ENTIRE BLOB AGAIN

CLOUD STORAGE 101

UPLOAD ENTIRE BLOB AGAIN

WHY?

CLOUD STORAGE 101

TRY AGAIN

CLOUD STORAGE 101

CLOUD STORAGE 101

CLOUD STORAGE 101

CLOUD STORAGE 101

UPLOAD SINGLE BLOCK

BLOBSYNC AWESOMESAUCE

• DETECTS CHANGES

• DOES NOT NEED ORIGINAL FILE TO DETECT CHANGES

• UPLOADS/DOWNLOADS CHANGES ONLY

• A TRANSPARENT BLACKBOX… OPEN SOURCE BUT CAN TREAT AS A BLACK BOX

THEORY VS REALITY

• THEORY

Azure Blob Storage

Local machine

THEORY VS REALITY

• THEORY

Azure Blob Storage

Local machine

0 100 200 300 400

THEORY VS REALITY

• THEORY

Azure Blob Storage

Local machine

0 100 200 300 400

THEORY VS REALITY

• THEORY

Azure Blob Storage

Local machine

0 100 200 300 400

THEORY….

• IS ALL GOOD IN THEORY

THEORY VS REALITY

• REALITY

Azure Blob Storage

Local machine

0 100 200 300 400

THEORY VS REALITY

• REALITY

Azure Blob Storage

Local machine

0 100 200 300 400

A DB C

A B’ C D

FINDING COMMON GROUND

• HOW DO WE FIND MOVED BLOCKS?

FINDING COMMON GROUND

• HOW DO WE FIND MOVED BLOCKS?

• USE HASH/SIGNATURES FOR EACH BLOCK

• SEARCH FOR SIGNATURE ALL THROUGHOUT FILE

THEORY VS REALITY

• SEARCH LOCAL

Azure Blob Storage

Local machine

0 100 200 300 400

A DB C

A B’ C D

THEORY VS REALITY

• SEARCH LOCAL

• EG. SEARCH FOR ‘C’Local machine

0 100 200 300 400

A B’ C D

SUCCESS!

• CAN NOW FIND BLOCKS EVEN WHEN MOVED

SUCCESS!

• CAN NOW FIND BLOCKS EVEN WHEN MOVED

• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT

SUCCESS!

• CAN NOW FIND BLOCKS EVEN WHEN MOVED

• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT

• BUT…….

SUCCESS!

• CAN NOW FIND BLOCKS EVEN WHEN MOVED

• IF WE CAN FIND A BLOCK WE CAN DETERMINE IF WE CAN REUSE IT

• BUT…….

• MD5/SHA ETC ARE TOO SLOW TO DO THIS

• TOO SLOW? NO WAY!

• EG

• 100MB FILE/BLOB

• BLOCK OF 100K

• > 104M HASH CALCULATIONS. JUST TO FIND THAT ONE BLOCK

YOU HAVE TO ROLL WITH IT.

• ROLLING SIGNATURE

• EXTREMELY QUICK.

YOU HAVE TO ROLL WITH IT.

• ROLLING SIGNATURE

• EXTREMELY QUICK.

• DUE TO FALSE POSITIVES USE MD5/SHA AS CONFIRMATION STEP

YOU HAVE TO ROLL WITH IT.

• SIG = FUNC( 0 .. 4 )

YOU HAVE TO ROLL WITH IT.

• SIG = FUNC( 0 .. 4 )

• CALCULATE SIG OF 1..5 BASED OFF OLD SIG

• NEW SIG = OLDSIG – ARRAY[0] + ARRAY[5]

YOU HAVE TO ROLL WITH IT.

• CAN SEARCH ENTIRE FILE WITH MINIMAL CALCULATIONS. IE FAST!

SO WHAT NOW?

• CAN NOW SEARCH FILES QUICKLY FOR SIGNATURE MATCHES

• MEANS WE CAN FIGURE OUT WHAT IS COMMON BETWEEN CLOUD AND LOCAL

• CAN DOWNLOAD/UPLOAD ONLY THE DIFFERENCES.

PROVE IT!

FILE INTERNALS

FILE INTERNALS

ADDDELETE

REPLACE

LIES, MORE LIES AND STATISTICS

• SMALL DB (14M).

• CLEARED A SMALL TABLE.

• UPDATE 340K

• LARGE DB (555M).

• CLEARED A SMALL TABLE

• UPDATE 720K

• VM (8G).

• DELETED SOME FILES

• UPDATE 800M

UPCOMING CHANGES

• DEFRAG

• DYNAMICALLY DETERMINE BLOCK SIZE

• BETTER PARALLEL UPLOAD/DOWNLOAD

• 32 BIT VERSION

LINKS

• BLOG ON BLOBSYNC:

• HTTPS://KPFAULKNER.WORDPRESS.COM/CATEGORY/BLOBSYNC/

• NUGET PACKAGE:

• HTTPS://WWW.NUGET.ORG/PACKAGES/BLOBSYNC/

• GITHUB WITH SOURCE:

• HTTPS://GITHUB.COM/KPFAULKNER/BLOBSYNC/

Recommended