Hadoop, Hive, Spark and Object Stores

  • View
    1.260

  • Download
    7

Embed Size (px)

Text of Hadoop, Hive, Spark and Object Stores

Hadoop, Spark Hive & Cloud Storage

Hadoop, Hive, Sparkand Object StoresSteve Loughranstevel@hortonworks.com @steveloughranNovember 2016

# Hortonworks Inc. 2011 2016. All Rights Reserved

Steve Loughran,Hadoop committer, PMC member, ASF MemberChris Nauroth, Apache Hadoop committer & PMC; ASF memberRajesh BalamohanTez Committer, PMC Member

# Hortonworks Inc. 2011 2016. All Rights ReservedNow people may be saying "hang on, these aren't spark developers". Well, I do have some integration patches for spark, but a lot of the integration problems are actually lower down: -filesystem connectors-ORC performance-Hive metastore

Rajesh has been doing lots of scale runs and profiling, initially for Hive/Tez, now looking at Spark, including some of the Parquet problems.Chris has done work on HDFS, Azure WASB and most recently S3AMe? Co-author of the Swift connector. author of the Hadoop FS spec and general mentor of the S3A work, even when not actively working on it. Been full time on S3A, using Spark as the integration test suite, since March

Make Apache Hadoop at home in the cloud Step 1: Hadoop runs great on AzureStep 2: Beat EMR on EC2

# Hortonworks Inc. 2011 2016. All Rights ReservedSimple goal. Make ASF hadoop at home in cloud infra. It's always been a bit of a mixed bag, and there's a lot with agility we need to address: things fail differently.

Step 1: Azure. That's the work with Microsoft on wasb://; you can use Azure as a drop-in replacement for HDFS in AzureStep 2: EMR. More specifically, have the ASF Hadoop codebase get higher numbers than EMR

ORCdatasets

inbound

Elastic ETLHDFS

external

# Hortonworks Inc. 2011 2016. All Rights ReservedThis is one of the simplest deployments in cloud: scheduled/dynamic ETL. Incoming data sources saving to an object store; spark cluster brought up for ETL. Either direct cleanup/filter or multistep operations, but either way: an ETL pipeline. HDFS on the VMs for transient storage, the object store used as the destination for data now in a more efficient format such as ORC or Parquet

ORC, Parquetdatasets

externalNotebooks

library

# Hortonworks Inc. 2011 2016. All Rights ReservedNotebooks on demand. ; it talks to spark in cloud which then does the work against external and internal data;

Your notebook itself can be saved to the object store, for persistence and sharing.

Streaming

# Hortonworks Inc. 2011 2016. All Rights ReservedExample: streaming on Azure

/workpendingpart-00part-01000000010101completepart-01rename("/work/pending/part-01", "/work/complete")A Filesystem: Directories, Files Data

# Hortonworks Inc. 2011 2016. All Rights Reserved

0000000101s01s02s03s04hash("/work/pending/part-01") ["s02", "s03", "s04"]copy("/work/pending/part-01", "/work/complete/part01")01010101delete("/work/pending/part-01")hash("/work/pending/part-00") ["s01", "s02", "s04"]Object Store: hash(name)->blob

# Hortonworks Inc. 2011 2016. All Rights Reserved

0000000101s01s02s03s04HEAD /work/complete/part-01PUT /work/complete/part01x-amz-copy-source: /work/pending/part-0101DELETE /work/pending/part-01PUT /work/pending/part-01... DATA ...GET /work/pending/part-01Content-Length: 1-8192 GET /?prefix=/work&delimiter=/REST APIs

# Hortonworks Inc. 2011 2016. All Rights Reserved

0000000101s01s02s03s0401DELETE /work/pending/part-00HEAD /work/pending/part-00

GET /work/pending/part-00

200200200Often Eventually Consistent

# Hortonworks Inc. 2011 2016. All Rights Reserved

org.apache.hadoop.fs.FileSystem

hdfs

s3a

wasb

adl

swift

gsSame API

# Hortonworks Inc. 2011 2016. All Rights ReservedEverything usies the Hadoop APIs to talk to both HDFS, Hadoop Compatible Filesystems and object stores; the Hadoop FS API. There's actually two: the one with a clean split between client side and "driver side", and the older one which is a direct connect. Most use the latter and actually, in terms of opportunities for object store integration tweaking, this is actually the one where can innovate with the most easily. That is: there's nothing in the way.

Under the FS API go filesystems and object stores.

HDFS is "real" filesystem; WASB/Azure close enough. What is "real?". Best test: can support HBase.

Just a different URL to read

val csvdata = spark.read.options(Map( "header" -> "true", "inferSchema" -> "true", "mode" -> "FAILFAST")) .csv("s3a://landsat-pds/scene_list.gz")

# Hortonworks Inc. 2011 2016. All Rights ReservedWriting looks the same

val p = "s3a://hwdev-stevel-demo/landsat"csvData.write.parquet(p)

val o = "s3a://hwdev-stevel-demo/landsatOrc"csvData.write.orc(o)

# Hortonworks Inc. 2011 2016. All Rights Reservedyou used to have to disable summary data in the spark context Hadoop options, but https://issues.apache.org/jira/browse/SPARK-15719 fixed that for you

HiveCREATE EXTERNAL TABLE `scene`( `entityid` string, `acquisitiondate` timestamp, `cloudcover` double, `processinglevel` string, `path` int, `row_id` int, `min_lat` double, `min_long` double, `max_lat` double, `max_lon` double, `download_url` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION s3a://hwdev-rajesh-new2/scene_list' TBLPROPERTIES ('skip.header.line.count'='1');

(needed to copy file to R/W object store first)

# Hortonworks Inc. 2011 2016. All Rights Reserved

> select entityID from scene where cloudCover < 0 limit 10;

+------------------------+--+| entityid |+------------------------+--+| LT81402112015001LGN00 || LT81152012015002LGN00 || LT81152022015002LGN00 || LT81152032015002LGN00 || LT81152042015002LGN00 || LT81152052015002LGN00 || LT81152062015002LGN00 || LT81152072015002LGN00 || LT81162012015009LGN00 || LT81162052015009LGN00 |+------------------------+--+

# Hortonworks Inc. 2011 2016. All Rights ReservedSpark Streaming on Azure Storageval streamc = new StreamingContext(sparkConf, Seconds(10))val azure = "wasb://demo@example.blob.core.windows.net/in"val lines = streamc.textFileStream(azure)val matches = lines.map(line => { println(line) line })matches.print()streamc.start()

# Hortonworks Inc. 2011 2016. All Rights Reserved

s3:// inode on S3s3n://Native S3s3a://Replaces s3nswift://OpenStackwasb://Azure WASBPhase I Stabilizeoss://Aliyungs://Google CloudPhase IISpeed & Scaleadl://Azure Data Lake200620072008200920102011201220132014201520162017?

s3://Amazon EMR S3

Where did those object store clients come from?Phase IIISpeed & Consistency

# Hortonworks Inc. 2011 2016. All Rights ReservedThis is the history

Problem: S3 work is too slowAnalyze benchmarks and bug-reports Fix Read pathFix Write pathImprove query partitioningThe Commitment Problem

# Hortonworks Inc. 2011 2016. All Rights Reserved

# Hortonworks Inc. 2011 2016. All Rights Reserved

Hortonworks: Powering the Future of Data18

getFileStatus()read()LLAP (single node) on AWS TPC-DS queries at 200 GB scale

readFully(pos)

# Hortonworks Inc. 2011 2016. All Rights ReservedHere's a flamegraph of LLAP (single node) with AWS+HDC for a set of TPC-DS queries at 200 GB scale; we should stick this up online

only about 2% of time (optimised code) is doing S3 IO. Something at start partitioning data

The Performance KillersgetFileStatus(Path) (+ isDirectory(), exists())

HEAD path // file? HEAD path + "/" // empty directory? LIST path // path with children?

read(long pos, byte[] b, int idx, int len)

readFully(long pos, byte[] b, int idx, int len)

# Hortonworks Inc. 2011 2016. All Rights ReservedPositioned reads: close + GET, close + GETread(long pos, byte[] b, int idx, int len) throws IOException { long oldPos = getPos(); int nread = -1; try { seek(pos); nread = read(b, idx, len); } catch (EOFException e) { } finally { seek(oldPos); } return nread;} seek() is the killer, especially the seek() back

# Hortonworks Inc. 2011 2016. All Rights ReservedWhy so much seeking? It's the default implementation of read

HADOOP-12444 Support lazy seek in S3AInputStreampublic synchronized void seek(long pos) throws IOException { nextReadPos = targetPos;}

+configurable readhead before open/close()

fs.s3a.readahead.range 256K But: ORC reads were still underperforming

# Hortonworks Inc. 2011 2016. All Rights Reserved

HADOOP-13203: fs.s3a.experimental.input.fadvise// BeforeGetObjectRequest req = new GetObjectRequest(bucket, key) .withRange(pos, contentLength - 1);

// afterfinish = calculateRequestLimit(inputPolicy, pos, length, contentLength, readahead);

GetObjectRequest req = new GetObjectRequest(bucket, key) .withRange(pos, finish);

bad for full file reads

# Hortonworks Inc. 2011 2016. All Rights ReservedA big killer turned out to be the fact that if we had to break and re-open the connection, on a large file this would be done by closing the TCP connection and opening a new one.

The fix: ask for data in smaller blocks; the max of (requested-length, min-request-len).Result: significantly lower cost back-in-file seeking and in very-long-distance forward seeks, at the expense of an increased cost in end-to-reads of a file (gzip, csv).

It's an experimental option for this reason; I