33
Apache Spark & Apache Zeppelin: Enterprise Security for production deployments Director, Product Management Dec 8, 2016 Twitter: @neomythos Vinay Shukla

Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

Embed Size (px)

Citation preview

Page 1: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

Apache Spark & Apache Zeppelin: Enterprise Security for production deployments

Director, Product Management Dec 8, 2016Twitter: @neomythos

Vinay Shukla

Page 2: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Page 3: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

whoami

Product Management Spark for 2.5 + years, Hadoop for 3+ years Blog at www.vinayshukla.com Twitter: @neomythos Addicted to Yoga, Hiking, & Coffee Smallest contributor to Apache Zeppelin

Programmer > Product Management > Programmer > Product Management

Page 4: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What are the enterprise security requirements?

Spark user should be authenticated Integrate with corporate LDAP/AD Allow only authorized users access Audit all access Protect data both in motion & at rest Easily manage all security Make security easy to manage …

Page 5: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security: Rings of Defense

Perimeter Level Security•Network Security (i.e. Firewalls)

Data Protection•Wire encryption•HDFS TDE/DARE•Others

Authentication•Kerberos•Knox (Other Gateways)

OS Security

Authorization•Apache Ranger/Sentry•HDFS Permissions•HDFS ACLs•YARN ACL

Page 6: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Interacting with Spark

Ex

Spark on YARN

Zeppelin

Spark-Shell

Ex

Spark Thrift Server

Driver

REST ServerDriver

Driver

Driver

Page 7: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Context: Spark Deployment Modes

• Spark on YARN–Spark driver (SparkContext) in YARN AM(yarn-cluster)–Spark driver (SparkContext) in local (yarn-client):

• Spark Shell & Spark Thrift Server runs in yarn-client only

Client

Executor

App Master

Spark Driver

Client

Executor

App Master

Spark Driver

YARN-Client YARN-Cluster

Page 8: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark on YARN

Spark Submit

John Doe

Spark AM

1

Hadoop Cluster

HDFS

Executor

YARN RM

4

2 3

Node Manager

Page 9: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Security – Four Pillars

Authentication Authorization Audit Encryption

Spark leverages Kerberos on YARNEnsure network is secure

Page 10: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Authenticate users with AD/LDAP

KDC

Use Spark ST, submit Spark Job

Spark gets Namenode (NN) service ticket

YARN launches Spark Executors using John Doe’s identity

Get service ticket for Spark,

John Doe

Spark AMNN

Executor reads from HDFS using John Doe’s delegation token

kinit

1

2

3

4

5

6

7

Hadoop Cluster

AD/LDAP

Page 11: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark – Kerberos - Example

kinit -kt /etc/security/keytabs/johndoe.keytab [email protected]

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10

Page 12: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDFS

Allow only authorized users access to Spark jobs

YARN Cluster

A B C

KDC

Use Spark ST, submit Spark Job

Get Namenode (NN) service ticket

Executors read from HDFS

Client gets service ticket for Spark

Ranger/Sentry

Can John launch this job?Can John read this file

John Doe

Page 13: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Secure data in motion: Wire Encryption with Spark

Spark Submit

RM

Shuffle Service

AMDriver

NM

Ex 1 Ex N

Shuffle Data

Control/RPC

ShuffleBlockTransfer

DataSource

Read/Write Data

FS – Broadcast,File Download

Page 14: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Communication Encryption Settings

Shuffle Data

Control/RPC

ShuffleBlockTransfer

Read/Write Data

FS – Broadcast,File Download

spark.authenticate.enableSaslEncryption= true

spark.authenticate = true. Leverage YARN to distribute keys

Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS

NM > Ex leverages YARN based SSL

spark.ssl.enabled = true

Page 15: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sharp Edges with Spark Security SparkSQL – Only coarse grain access control today Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop

– Lowers security, forces STS to run as Hive user to read all data– Use SparkSQL via shell or programmatic API– https://issues.apache.org/jira/browse/SPARK-5159

Spark Stream + Kafka + Kerberos– No SSL support yet

Spark Shuffle > Only SASL, no SSL support Spark Shuffle > No encryption for spill to disk or intermediate data

Page 16: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SparkSQL: Fine grained security

Page 17: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key Features: Spark Column Security with LLAP

Fine-Grained Column Level Access Control for SparkSQL.

Fully dynamic policies per user. Doesn’t require views.

Use Standard Ranger policies and tools to control access and masking policies.

Flow:1.SparkSQL gets data locations known as “splits” from HiveServer and plans query.2.HiveServer2 authorizes access using Ranger. Per-user policies like row filtering are applied.3.Spark gets a modified query plan based on dynamic security policy.4.Spark reads data from LLAP. Filtering / masking guaranteed by LLAP server.

HiveServer2

Authorization

Hive MetastoreData Locations

View Definitions

LLAPData Read

Filter Pushdown

Ranger Server

Dynamic Policies

Spark Client

12

4

3

Page 18: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: Per-User Row Filtering by Region in SparkSQL

Spark User 2(East Region)

Spark User 1(West Region)

Original Query:SELECT * from CUSTOMERS

WHERE total_spend > 10000

Query Rewrites based onDynamic Ranger Policies

LLAP Data AccessUser ID Region Total Spend1 East 5,1312 East 27,8283 West 55,4934 West 7,1935 East 18,193

Dynamic Rewrite:SELECT * from CUSTOMERS

WHERE total_spend > 10000AND region = “east”

Dynamic Rewrite:SELECT * from CUSTOMERS

WHERE total_spend > 10000AND region = “west”

Fine grained Security to SparkSQL

http://bit.ly/2bLghGzhttp://bit.ly/2bTX7Pm

Page 19: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin Security

Page 20: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authentication + SSL

Spark on YARNEx Ex

LDAP

John Doe

1

2

3

SSL

Firewall

Page 21: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security in Apache Zeppelin?

Zeppelin leverages Apache Shiro for authentication/authorization

Page 22: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example Shiro.ini

# =======================# Shiro INI configuration# =======================

[main]## LDAP/AD configuration

[users]# The 'users' section is for simple deployments# when you only need a small number of statically-defined# set of User accounts.

[urls]# The 'urls' section is used for url-based security#

Edit with Ambari or your favorite text editor

Page 23: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: AD Authentication Configure Zeppelin to use AD

[main]activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000shiro.loginUrl = /api/login

Page 24: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: LDAP Authentication Configure Zeppelin to use LDAP

[main]ldapRealm = org.apache.zeppelin.server.LdapGroupRealm ldapRealm = org.apache.shiro.realm.ldap.JndiLdapRealm ldapRealm.contextFactory.environment[ldap.searchBase] = DC=hdpqa,DC=example,DC=com ldapRealm.userDnTemplate = uid={0},OU=Accounts,DC=hdpqa,DC=example,DC=com ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636 ldapRealm.contextFactory.authenticationMechanism = simplesessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager securityManager.sessionManager = $sessionManager # 86,400,000 milliseconds = 24 hour securityManager.sessionManager.globalSessionTimeout = 86400000 shiro.loginUrl = /api/login

Page 25: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Don’t want passwords in clear in shiro.ini? Create an entry for AD credential

–Zeppelin leverages Hadoop Credential API–hadoop credential createactiveDirectoryRealm.systemPassword -provider jceks:///etc/zeppelin/conf/credentials.jceksEnter password: Enter password again: activeDirectoryRealm.systemPassword has been successfully created.org.apache.hadoop.security.alias.JavaKeyStoreProvider has been updated.

Make credentials.jceks only Zeppelin user readable chmod 400 with only Zeppelin process r/w access, no other user allowed access to

Credentials Edit shiro.in

activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks

Page 26: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Want to connect to LDAP over SSL? Change protocol to ldaps in shiro.ini

ldapRealm.contextFactory.url = ldaps://hdpqa.example.com:636

If LDAP is using self signed certificate, import the certificate into truststore of JVM running Zeppelin

echo -n | openssl s_client –connect ldap.example.com:389 | \ sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/examplecert.crt

keytool –import -keystore $JAVA_HOME/jre/lib/security/cacerts \

-storepass changeit -noprompt -alias mycert -file /tmp/examplecert.crt

Page 27: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin + Livy E2E Security

Zeppelin

Spark

Yarn

Livy

Ispark GroupInterpreter

SPNego: Kerberos Kerberos/RPC

Livy APIs

LDAP

John Doe

Job runs as John Doe

LDAP/LDAPS

Page 28: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Authorization

Note level authorization Grant Permissions (Owner, Reader, Writer)

to users/groups on Notes LDAP Group integration

Zeppelin UI Authorization Allow only admins to configure interpreter Configured in shiro.ini

For Spark with Zeppelin > Livy > Spark– Identity Propagation Jobs run as End-User

For Hive with Zeppelin > JDBC interpreter Shell Interpreter

– Runs as end-user

Authorization in Zeppelin Authorization at Data Level

[urls]/api/interpreter/** = authc, roles[admin]

/api/configurations/** = authc, roles[admin]

/api/credential/** = authc, roles[admin]

Page 29: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Map admin role to AD Group

Allows mapped AD group access to Configure Interpreters

[main]activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000

Page 30: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

User reports: Can’t see interpreter Page

Zeppelin has URL based access control enabled User does not have the role Or Role incorrectly mapped[main]activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = XXXXX activeDirectoryRealm.systemPassword = XXXXXXXXXXXXXXXXX activeDirectoryRealm.searchBase = DC=hdpqa,DC=Example,DC=com activeDirectoryRealm.url = ldap://hdpqa.example.com:389 activeDirectoryRealm.principalSuffix = @hdpqa.example.com activeDirectoryRealm.groupRolesMap = "CN=hdpdv_admin,DC=hdpqa,DC=example,DC=com":"admin" activeDirectoryRealm.authorizationCachingEnabled = true sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager securityManager.cacheManager = $cacheManager securityManager.sessionManager = $sessionManager securityManager.sessionManager.globalSessionTimeout = 86400000

Page 31: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

User reports: Livy interpreter fails to run with access error

Ensure Livy has ability to proxy user

Ensure Livy has Impersonation enabled In /etc/livy/conf/livy.conf

livy.impersonation.enabled true

Edit HDFS core-site.xml via Ambari: <property> <name>hadoop.proxyuser.livy_qa.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.livy_qa.hosts</name> <value>*</value> </property>

Page 32: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin: Credentials

LDAP/AD account Zeppelin leverages Hadoop Credential API

Interpreter Credentials Not solved yet

Credentials in Zeppelin

This is still an open issue

Page 33: Apache Spark & Apache Zeppelin: Security for Enterprise Deployments

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank You

Vinay Shukla @neomythos