27
Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Embed Size (px)

Citation preview

Page 1: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Security and Deduplication in the Cloud

Danny Harnik - IBM Haifa Research Labs

Page 2: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

What is Deduplication Deduplication: storing only a single copy of redundant data

Applied at the file or block level

Major savings in backup environments (saves more than 90% in common business scenarios)

“most impactful storage technology” April 2008: IBM acquires Dilligent July 2009: EMC acquires DataDomain July 2010: DELL acquires Ocarina

2

Page 3: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

How are files deduped? Fingerprint each file using a hash function

Common hashes used: Sha1, Sha256, others… Store an index of all the hashes already in the system

New file: Compute hash Look hash up in index table If new → add to index If known hash → store as pointer to existing data

3

Page 4: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Client-side deduplication Save bandwidth as well as storage.

Also know as “source-based dedupe” or “WAN deduplication”

Client computes hash and sends to server If new → server requests client for the file (upload data) Otherwise (dedupe) → skip upload and register the client as

another owner of the file

4

Client

Let it be.mp3

hash

2fd4e1

Server

2fd4e1

Index

2fd4e1

Let it be.mp3

Page 5: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Deduplication and privacy Our attacks are relevant to the following setting:

Client-side deduplication Cross-user deduplication

If two or more users store the same file, only a single copy is stored.

5

Page 6: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Cloud storage and deduplication Cloud storage services are gaining popularity

Online file backup and synchronization is huge Lots to gain from deduplication

Use/used cross-user client-side deduplication Mozy Dropbox Memopal …

MP3Tunes

6

Page 7: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Deduplication and privacy I Harnik, Pinkas & Shulman-Peleg,

IEEE Journal of Security and Privacy, Vol 8. 2010

Client learns if an object is already in system A narrow “peep hole” to contents of other users

Discussed attacks and partial solutions Illegal content searching “Salary attack” Covert channel

Several ways to prevent: Encrypt or dedupe server side only Dedupe only on long files Noisy dedupe…

7

Page 8: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Deduplication and privacy II Halevi, Harnik, Pinkas & Shulman-Peleg,

ACM CCS 2011

A more direct attack Starting point: Suppose I get the hash value of your file…

8

Page 9: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

The attack Attacker obtains hash of victim’s file Signs up for the service with own account Attempts to upload a file, but swaps the hash value with

that of the victim’s file.

File is now registered to attacker Download file…

99

Client

Any file

hash

e3b890

Server

2fd4e1

Index

2fd4e1

Let it be.mp3

2fd4e1

Page 10: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Obtaining the hash

1. Hash used for other services Hash does not reveal “anything” on the file – not meant to be secret

2. Malicious software Easier to send a small signature undetected Also true for break-in at the server side

3. CDN attack Alice sends all her friends the hash of a movie

Friends can download it from the server Server essentially serves as a Content Distribution Network (CDN).

Might break its cost structure, if it planned on serving only a few restore ops.

10

Page 11: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Swapping the hash [Dorrendorf & Pinkas 2011]

Implemented the attacks against two major storage servers One services uses SHA256 to identify files Another uses a 160 bit hash value which was not identified

Dropship (April, 2011) implementation of the CDN over dropbox “written in Python. Allow you to download to your Dropbox any

file, which description we got in JSON format (similar as description propagated in .torrent files).”

[Mulazzani, Schrittwieser, Leithner, Huber & Weippl 2011] Implemented the attack on Dropbox In Usenix Security 2011

A non-issue in upcoming cloud storage standards 11

Page 12: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

SOLUTIONS !

12

Page 13: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Naïve Solutions

Use a non-standard hash (e.g. Hash(“service name” | file) ) But all clients must know hash function Irrelevant in most scenarios (CDN/malicious software etc..)

13

Page 14: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Better naïve Solutions

Use a challenge-response phase For every upload, server picks a random nonce, and

asks client to compute Hash( nonce | file ) This requires client to have the file But the server, too, must now retrieve the file from secondary

storage, and compute the hash

Alternative: Pre-compute Hash( nonce | file) and store together with hash Back to root cause of problem: short hash represents file

entirely.

14

Page 15: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Proofs of Ownership (POWs)

Server preprocesses the file Stores some short information per file (few bytes only)

Proof stage: a challenge response – done only during file upload Honest client has access to the file Server has only access to preprocessed information. cannot retrieve files

from secondary storage. Must be bandwidth efficient Client computation should be efficient (time & memory)

Security definition: Malicious client may have: Partial knowledge of file (file has k min-entropy to it) May receive additional information from accomplices (m bits)

If k – m > security parameter, then proof fails whp.

15filePrior knowledge kAccomplice data

s

Page 16: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Proofs of Retrievability (PORs) Role reversal: Server proves to client that it actually store its file

Strong extraction based definition (we use a relaxed notion) State of the art solutions all send a pre-processed file to the server.

E.g. [NR05],[JK07],[SW08],[DVW09] Cannot be done in our setting

In general, POR without preprocessing is a good POW Our first solution is a Merkle tree based POR

16

Page 17: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Solution – first attempt

17

File

Merkle Tree

Page 18: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Solution – first attempt

18

File

Merkle Tree

Preprocessing: server stores root of tree

Page 19: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Solution – first attempt

19

File

Merkle Tree

Proof: server asks client to present paths to t random leaves

A client which knows only a p fraction of the file, succeeds with prob < pt.

√ very efficient

Page 20: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Problem and solution Does not suffice when min-entropy is low (e.g. 90% of the file) Solution: Apply tree to an erasure coding of the file Satisfies security of POW and POR.

20

FileErasure code

Merkle TreeMerkle Tree

Efficient encoding? Must pay either:

Large memory Multiple disk accesses

Bad for large files

Page 21: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Protocols with small space Limit solution to use an L byte buffer for all the

computation For example: L=64MB

Relax security guarantees: Can only tolerate L bytes of accomplice data.

2323

filePrior knowledge Accomplice

sL

Page 22: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Second protocol: hash to small space

First hash file to a buffer of L bytes. Then construct Merkle-tree over the buffer.

Reducer: use pairwise-independent hashing

Security: POW will fail (w.h.p.) adversary that Has at least k bits min-entropy on the file Receives less than Min(L, k-s) bits

from an accomplice

24File

Reduced file

Merkle Tree

Reducer

Page 23: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Is this efficient enough ? Still not really practical

File size M Buffer size L Reducer requires Ω(M·L) time

We want to push it further down…

25

Page 24: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Third protocol: Reduce and Mix

In Reducer: XOR each block to a constant number of random locations Runs in O(M+L) time

Add a mixing phase

26File

Reduced file

Merkle Tree

Reducer

Reduced & mixed file

Mixer

Hypothesis: reduce + mix forms a good code

Security defined against a generalized block fixing source distribution

Page 25: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Performance of the different phases of the low space PoW

27

Page 26: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

When is it worth the effort?

Page 27: Security and Deduplication in the Cloud Danny Harnik - IBM Haifa Research Labs

Summary Identified security implications of client-side deduplication

Introduced POWs to enable client-side deduplication in the cloud The challenge: offer meaningful privacy guarantees with a limited

toll on the resources

2929

Merkle Tree

Mixer

Reducer