33
Fine Grain Access Control for Big Data: ORC Column Encryption Owen O’Malley [email protected] @owen_omalley March 2019

ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O’[email protected]@owen_omalley

March 2019

Page 2: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

2 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Who Am I?• Worked on Hadoop since Jan 2006• Worked at Yahoo on Web Search team• First committer added to Hadoop project

• MapReduce, Security, Hive, and ORC• Worked on many different file formats• Sequence File, Avro, RC File and ORC File

• Spun Hortonworks out of Yahoo in 2011

Page 3: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

3 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

What is the Problem?

Page 4: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

4 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Controlling Sensitive Data• Some data is very sensitive• Personally Identifiable Information• Credit card• Medical information

• Companies run on data• Need controls on data• GDPR is a BIG deal

Page 5: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

5 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

What is the Problem?• Related data, different security requirements

• Authorization – who can see it

• Audit – track who read it

• Encrypt on disk – regulatory

• File-level (or blob) granularity isn’t enough

• File systems don’t understand columns

Page 6: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

6 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Requirements• Readers should transparently decrypt data• If and only if the user has access to the key• The data must be decrypted locally

• Columns are only decrypted as necessary• Master keys must be managed securely• Support for Key Management Server & hardware• Support for key rolling

Page 7: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

7 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Partial Solutions

Page 8: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

8 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Partial Solution – HDFS Encryption• Transparent HDFS Encryption• Encryption zones• HDFS directory trees• Unique master key for each zone• Client decrypts data

• Key Management via KeyProvider API

Page 9: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

9 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

HDFS Encryption Limitations• Very coarse protection• Only entire directory subtrees• No ability to protect columns• A lot of users need access to keys

• Moves between zones is painful• When writing with Hive, data is moved

multiple times per a query

Page 10: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

10 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Partial Solution – Hive Server 2• All queries sent to Hive Server 2• Only ‘hive’ user has access to data in HDFS• Supports LLAP

• Integrates with Apache Ranger• Control access to rows & columns• Dynamically mask data

Page 11: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

11 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Hive Architecture with Hive Server 2

Page 12: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

12 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Hive Server 2 Limitations• Limits access to Hive SQL• Breaks Hadoop’s multi-paradigm data access• Many customers use both Hive & Spark

• JDBC is not distributed• Throughput is limited to 1 machine• New Spark to LLAP connector addresses this

Page 13: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

13 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Partial Solution – Separate tables• Split private information out of tables• Separate directories in HDFS• HDFS and/or HS2 authorization• Enables HDFS encryption

• Limitations• Need to join with other tables• Higher operational overhead

Page 14: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

14 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Partial Solution – Encryption UDF• Hive has user defined functions• aes_encrypt and aes_decrypt

• Limitations• Key management is problematic• Encryption is not seeded• Size of value leaks information

Page 15: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

15 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Solution

Page 16: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

16 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Columnar Encryption• Columnar file formats (eg. ORC)• Write data in columns• Column projection• Better compression

• Encryption works really well• Only encrypt bytes for column• Can store multiple variants of data

Page 17: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

17 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

ORC File Format

File Footer

Postscript

Index Data

Row Data

Stripe Footer

~200

MB

Strip

e

Index Data

Row Data

Stripe Footer

~200

MB

Strip

e

Index Data

Row Data

Stripe Footer

~200

MB

Strip

e

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Column 1

Column 2

Column 7

Column 8

Column 3

Column 6

Column 4

Column 5

Stream 2.1

Stream 2.2

Stream 2.3

Stream 2.4

Page 18: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

18 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

User Experience• Set table properties for encryption

• orc.encrypt.pii = ”ssn,email”

• orc.encrypt.credit = “card_info”

• Define where to get the encryption keys

• Configuration defines the key provider via URI

• Can use the Hadoop or Ranger KMS

• Compatible with public cloud KMS

Page 19: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

19 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Key Management• Uses Hadoop or Ranger KMS• Create a master key for each use• “pii”, “pci”, or “hippa”

• Each column in each file uses unique local key• Policies limit access to master keys• User never gets master keys

Page 20: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

20 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

KeyProvider API• Provides limited access to encryption keys• Encrypts or decrypts local keys• Key versions and key rolling• Allows 3rd party plugins• Supports Hadoop or Ranger KMS

Page 21: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

21 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Encryption Data Flow

Page 22: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

22 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Encryption Flow• Local key• Random for each encrypted column in file• Encrypted w/ master key by KMS• Encrypted local key is stored in file metadata

• IV is generated to be unique• Column, kind, stripe, & counter

Page 23: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

23 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Data Masking• What happens without key access?• Define static masks• Nullify – all values become null• Redact – mask values ‘Xxxxx Xxxxx!’• Can define ranges to unmask

• SHA256 – replace with SHA256• Custom - user defined

Page 24: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

24 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Data Masking• Anonymization is hard!• AOL search logs• Netflix prize datasets• NYC taxi dataset

• Always evaluate security tradeoffs• Tokenization is a useful technique• Assign arbitrary replacements

Page 25: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

25 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Key Disposal• Often need to keep data for 90 days• Currently the data is written twice• With column encryption:• Roll keys daily• Delete master key after 90 days

Page 26: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

26 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

ORC Encryption Design• Write both variants of streams• Masked unencrypted• Unmasked encrypted

• Encrypt both data and statistics• Maintain compatibility for old readers• Read unencrypted variant

• Preserve ability to seek in file

Page 27: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

27 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

ORC Write Pipeline• Streams go through pipeline• Run length encoding• Compression (zlib, snappy, or lzo)• Encryption

• Encryption is AES/CTR• Allows seek• No padding

Page 28: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

28 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Conclusions

Page 29: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

29 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Conclusions• ORC column encryptions provides• Transparent encryption• Multi-paradigm column security• Audit logging (via KMS logging)• Static masking

• Supports file merging• Different stripes with different local key

Page 30: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

30 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Integration with Other Tools• Apache Ranger• Provides security from a single control panel• Provides Attribute Based Access Control (ABAC)• Manages encryption settings based on policies• Controls access to decryption keys

• Apache Atlas• Metadata driven governance for enterprises• Provides ability to tag tables or columns

Page 31: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

31 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Integration with Other Tools• Hive & Spark• No change other than defining table properties

• Apache Hive’s LLAP• Cache and fast processing of SQL queries• Column encryption changes internal interfaces• Cache both encrypted and unencrypted variants• Ensure audit log reflects end-user and what they accessed

Page 32: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

32 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Limitations• Need encryption policy for write

• Current Atlas & Ranger tags lag data

• Auto-discovery requires pre-access

• Changes to masking policy

• Need to re-write files

• Need additional data masks

• Credit card, addresses, etc.

• Decrypted local keys could be saved

Page 33: ORC Encryption 2019 - SCALE · • Hive & Spark •No change other than defining table properties • Apache Hive’s LLAP •Cacheandfastprocessing of SQL queries •Column encryption

33 © Hortonworks Inc. 2011 – 2019. All Rights Reserved

Thank you!Twitter: @owen_omalleyEmail: [email protected]