Securing Spark Applications

  • View
    518

  • Download
    0

Embed Size (px)

Text of Securing Spark Applications

Securing Spark Applications Hadoop Summit 2016 - Dublin

Securing Spark Applications

Hadoop Summit 2016 - DublinMarcelo Vanzin

# Cloudera, Inc. All rights reserved.

What is Security?Security has many facetsThis talk will focus on three areas:EncryptionAuthenticationAuthorization

# Cloudera, Inc. All rights reserved.

Why do I need security?User identificationApplication isolationAccess control enforcementCompliance with government regulations

# Cloudera, Inc. All rights reserved.

Before we go further...Set up KerberosUse HDFS (or another secure filesystem)Use YARNConfigure them for security (enable auth, encryption)

Kerberos, HDFS, and YARN provide the security backbone for Spark.

# Cloudera, Inc. All rights reserved.

EncryptionIn a secure cluster, data should not be visible in the clearOn-the-wire dataAt-rest dataVery important to financial / government institutionsOr anyone who works with sensitive data

# Cloudera, Inc. All rights reserved.

What a Spark app looks likeDriverExecutorExecutorControl RPCFile DownloadShuffle / Cached BlocksShuffle ServiceShuffle BlocksUI

Disk

DiskShuffle Blocks / Metadata

# Cloudera, Inc. All rights reserved.

Prior to Spark 1.6Different channel, different methodControl planeSSLFile distributionSSLShuffle BlocksSASLUser UI / REST APINothingSpilled/Shuffle BlocksUse ecryptfs (or equivalent)

# Cloudera, Inc. All rights reserved.

What is wrong with SSL?

# Cloudera, Inc. All rights reserved.

Why not SSL?SSL can be hard to set upNeed certificates readable on every nodeSharing certificates not as secureHard to have per-user certificates

# Cloudera, Inc. All rights reserved.

Spark 1.6Standardizes around a common transport libraryReplaces Akka RPC (SPARK-6028)Replaces HTTP File service (SPARK-11140)Uses SASL encryption

But..WebUI still has no encryptionShuffle / Spilled blocks still require FS-level encryptionSASL in JVM restricted to 3DES encryption not very strong

# Cloudera, Inc. All rights reserved.

Spark 2.0REPL class distribution using transport lib (SPARK-11563)HTTPS Support for WebUI and History Server (SPARK-2750)Encrypting shuffle blocks is almost in (SPARK-5682)Depends on third party Chimera library for encryptionWork is being done to add Chimera to Apache Commons

Future:Use Chimera to encrypt over-the-wire data

# Cloudera, Inc. All rights reserved.

AuthenticationWho is reading my data?

Spark relies on Kerberos the necessary evilUbiquitous in HadoopYARN, HDFS, Hive...

# Cloudera, Inc. All rights reserved.

Who is reading my data?Kerberos provides secure authentication.

KDCApplicationHi Im Bob.Hello Bob. Heres your TGT.Heres my TGT. I want to talk to HDFS.Heres your HDFS ticket.User

# Cloudera, Inc. All rights reserved.

Now with a distributed app...KDCExecutorExecutorExecutorExecutorExecutorExecutorExecutorExecutorHi Im Bob.Hi Im Bob.Hi Im Bob.Hi Im Bob.Hi Im Bob.Hi Im Bob.Hi Im Bob.Hi Im Bob.Something is wrong.

# Cloudera, Inc. All rights reserved.

Kerberos in Hadoop / SparkHadoop services use delegation tokens to avoid KDC limitations.DriverNameNodeExecutorDataNode

# Cloudera, Inc. All rights reserved.

Delegation TokensLike Kerberos tickets, they have a TTL.OK for most batch applications.Not OK for long running applicationsStreamingSpark SQL Thrift Server

Since 1.4, Spark can manage delegation tokens, but very limited.Full support only for HDFS.Limited support for Hive, HBase.

# Cloudera, Inc. All rights reserved.

How about Secure Kafka?

# Cloudera, Inc. All rights reserved.

Spark Streaming with KafkaKafka 0.9 supports some security featuresRequires the use of a new consumer API (SPARK-12177)Kafka 0.9 does not support delegation tokens! (KAFKA-1696)

# Cloudera, Inc. All rights reserved.

AuthorizationHow can I share my data?

Simplest form of authorization: file permissions.Unix-style user/group/other or ACLsSimple, but high maintenance. umaskmanually change new filesTrusted entity (OS kernel) enforces access control

# Cloudera, Inc. All rights reserved.

More than just FS semanticsNot all applications operate on files...Tables, columns, partitions instead of files and directoriesTrusted service needs to understand apps semantics

# Cloudera, Inc. All rights reserved.

Trusted Service Example: HiveClientHiveServer2DataNodeDataNodeHMS

# Cloudera, Inc. All rights reserved.

Untrusted App Example: SparkUser CodeDataNodeDataNodeHMS

# Cloudera, Inc. All rights reserved.

Apache SentryRole-based access control to resourcesIntegrates with HMS / HS2 to control access to dataFine-grained (up to column level) controls

HDFS plugin synchronizes file permissions.Permission to read table = permission to read tables filesPermission to create table = permission to write to databases directory

# Cloudera, Inc. All rights reserved.

Still restricted to FS view of the world!Files, directories, etcCannot provide column-level and row-level access control.Whole table or nothing.

Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment.But...

# Cloudera, Inc. All rights reserved.

A Simple ExampleAssume we had a table accounts

column_namecolumn_typenamestringcountrystringbalanceint

# Cloudera, Inc. All rights reserved.

Untrusted App Example: SparkUser CodeHDFSHMS

Wheres table accounts?In path /accountsGive me the files in /accountsHeres the file

namecountrybalance

# Cloudera, Inc. All rights reserved.

Future: RecordServiceA distributed, scalable, data access service for unified authorization in Hadoop.Drop in replacement for Hive InputFormatsIntegration with Spark SQL Data Sources APIPredicate pushdown, projection

# Cloudera, Inc. All rights reserved.

RecordServiceUsers can enforce row- and column- level permissions using views.namecountrybalanceAliceUS1000BobBR1500EveUS2000

> create view customers as select customer, country from accounts

> create view balances_us as select customer, amount from accounts where country = US

# Cloudera, Inc. All rights reserved.

Untrusted App Example: SparkUser CodeRS WorkerRS Planner

Wheres table accounts?Sorry, you cant read it.Wheres table customers?In Worker XGive me table customersHeres a list of (name, country)

namecountrybalancenamecountry

# Cloudera, Inc. All rights reserved.

TakeawaysSpark can be made secure today!Builds on top of security features in HadoopStill work to be doneStronger encryptionEasier to use SSLBetter integration with Sentry / RecordService

# Cloudera, Inc. All rights reserved.

ReferencesEncryption: SPARK-6017, SPARK-5682Delegation tokens: SPARK-5342Sentry: http://sentry.apache.org/HDFS synchronization: SENTRY-432RecordService: http://cloudera.github.io/RecordServiceClient/

# Cloudera, Inc. All rights reserved.

Thanks!Questions?

# Cloudera, Inc. All rights reserved.