AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)

Amazon Athena 및 Glue를 통한빠른 데이터 질의 및 처리 기능 소개

김상필 솔루션즈 아키텍트

목차

• 서버리스 대화식 쿼리 서비스, Amazon Athena 소개• 완전 관리형 ETL 서비스, AWS Glue 소개

2

Ingest/Collect

Consume/visualize

Store Process/analyze

Data1 40 9

5 Answers & insights

AWS 빅데이터 분석 아키텍처

AWS Data PipelineAWS Database Migration Service

EMR

분석

AmazonGlacierS3

저장수집

Amazon Kinesis

Direct Connect

AmazonMachine Learning

AmazonRedshift

DynamoDB AWS IoT

AWS Snowball

QuickSight

Amazon Athena

EC2Amazon

ElasticsearchService

Lambda

AWS Glue

Amazon Athena 소개

기존의 어려움

• Significant amount of work required to analyze data in Amazon S3

• Users often only have access to aggregated data sets

• Managing a Hadoop cluster or data warehouse requires expertise

Amazon Athena 란?

Amazon Athena is an interactive query servicethat makes it easy to analyze data directly from

Amazon S3 using Standard SQL

Serverless

• No Infrastructure

or administration

• Zero Spin up time

• Transparent upgra

des

Highly Available• Connect to a

service endpoint or log into the console

• Uses warm compute pools across multiple AZs

• Your data is in Amazon S3

Easy to use• Log into the Console

• Create a table

• Type in a Hive DDL

Statement

• Use the console

Add Table wizard

• Start querying

Amazon Athena 특징

Amazon S3에 있는 데이터를 직접 쿼리

• No loading of data

• Query data in its raw format• Text, CSV, JSON, weblogs, AWS service logs• Convert to an optimized form like ORC or Parquet for the best performa

nce and lowest cost

• No ETL required

• Stream data from directly from Amazon S3

• Take advantage of Amazon S3 durability and availability

ANSI SQL 사용• Start writing ANSI SQL

• Support for complex joins, nested queries & window functions

• Support for complex data types (arrays, structs)

• Support for partitioning of data by any key

• (date, time, custom keys)• e.g., Year, Month, Day, Hour or Cu

stomer Key, Date

기존의 친숙한 기술들 사용

• Used for SQL Queries• In-memory distributed query engine• ANSI-SQL compatible with extensions

• Used for DDL functionality• Complex data types• Multitude of formats • Supports data partitioning

Amazon Athena 지원 데이터 포맷

• Text files, e.g., CSV, raw logs

• Apache Web Logs, TSV files

• JSON (simple, nested)

• Compressed files

• Columnar formats such as Apache Parquet & Apache ORC

• AVRO support – coming soon

Amazon Athena의 빠른 속도

• Tuned for performance

• Automatically parallelizes queries

• Results are streamed to console

• Results also stored in S3

• Improve Query performance

• Compress your data

• Use columnar formats

Amazon Athena의 비용 효율성

• Pay per query

• $5 per TB scanned from S3

• DDL Queries and failed queries are free

• Save by using compression, columnar formats, partitions

데이터 분석 파이프라인 예


Ad-hoc access to raw data using SQL


Ad-hoc access to data using Athena Athena can query aggregated datasets as well

기존 어려움들의 해결

• Significant amount of work required to analyze data in Amazon S3

• No ETL required. No loading of data. Query data where it lives

• Users often only have access to aggregated data sets

• Query data at whatever granularity you want

• Managing a Hadoop cluster or data warehouse requires expertise

• No infrastructure to manage

Amazon Athena 접속

Simple Query editor with key

bindings

Autocomplete functionality

Catalog

Tables and columns

Can also see a detailed view in the catalog tab

You can also check the properties. Note the location.

JDBC 드라이버 지원

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena

Amazon RDS

Amazon S3

Amazon Redshift

Amazon Athena

Amazon QuickSight를 통한 Athena 접속 지원

테이블 생성 및 데이터 쿼리

테이블 생성

• Create Table Statements (or DDL) are written in Hive • High degree of flexibility• Schema on Read• Hive is SQL like but allows other concepts such “external

tables” and partitioning of data• Data formats supported – JSON, TXT, CSV, TSV, Parquet a

nd ORC (via Serdes)• Data in stored in Amazon S3• Metadata is stored in an a metadata store

Athena의 내부 메타데이터 저장소

• Stores Metadata• Table definition, column names, partitions

• Highly available and durable

• Requires no management

• Access via DDL statements

• Similar to a Hive Metastore

간단한 쿼리 실행

Run time and data scanned

PARQUET• Columnar format • Schema segregated into footer• Column major format • All data is pushed to the leaf• Integrated compression and in

dexes• Support for predicate pushdo

wn

ORC• Apache Top level project• Schema segregated into footer• Column major with stripes• Integrated compression, indexe

s, and stats• Support for Predicate Pushdow

n

Apache Parquet 및 Apache ORC – 컬럼기반 포맷

쿼리 수행 당 비용 - $5/TB 스캔• Pay by the amount of data scanned per q

uery• Ways to save costs

• Compress• Convert to Columnar format• Use partitioning

• Free: DDL Queries, Failed QueriesDataset Size on Amazon S3 Query Run time Data Scanned Cost

Logs stored as Text files

1 TB 237 seconds 1.15TB $5.75

Logs stored in Apache Parquet format*

130 GB 5.13 seconds 2.69 GB $0.013

Savings 87% less with Parquet

34x faster 99% less data scanned 99.7% cheaper

Athena는 Amazon Redshift 및 Amazon EMR 보완

Amazon S3

EMR Athena

QuickSight

Redshift

완전 관리형 ETL 서비스AWS Glue

Fivetran

AWS의 많은 ETL 파트너들…

… 실제로는 툴보다 매뉴얼 코드

ETL Data Warehousing Business Intelligence

70% of time spent here

Amazon Redshift Amazon QuickSight

분석에서 ETL 이 가장 시간을 많이 소모

1990 2000 2010 2020

Generated DataAvailable for Analysis

Data Volume

The Data Gap

데이터의 갭 초래

ü Cataloging data sources ü Identifying data formats and data

types

ü Generating Extract, Transform, Load codeü Executing ETL jobs; managing dependencies

ü Handling errorsü Managing and scaling resources

Glue는 ETL 작업을 자동화

Data Catalog

§ Hive metastore compatible metadata repository of data

sources.

§ Crawls data source to infer table, data type, partition format.

Job Execution

§ Runs jobs in Spark containers – automatic scaling based on

SLA.

§ Serverless - only pay for the resources you consume.

Job Authoring

§ Generates Python code to move data from source to

destination.

§ Edit with your favorite IDE; share code snippets using Git.

AWS Glue 구성요소

Glue 데이터 카달로그Discover and organize your data sets

Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools such as Hive, Presto, Spark, etc.

We added a few extensions:§ Search metadata for data discovery

§ Connection info – JDBC URLs, credentials

§ Classification for identifying and parsing files

§ Versioning of table metadata as schemas evolve and other metadata are updated

Populate using Hive DDL, bulk import, or automatically through crawlers.

Glue 데이터 카달로그

Automatic schema inference:

• Built-in classifiers detect file type and extract schema: record structure and data types.

• Add your own or share with others in the Glue community - It's all Grok and Python.

Auto-detects Hive-style partitions, grouping similar files into one table.

Run crawlers on schedule to discovernew data and schema changes.

Serverless – only pay when crawls run.

크롤러 : 데이터 카달로그의 자동 생성

Glue에서의 작업 작성Make ETL job authoring like code development using your own tools

1. Pick sources and targets from the data catalog

2. Glue generates transformation graph and Python code3. Specify trigger condition

Every Fridayat 3PM GMT

Source table@ Amazon S3

TransformRelationalize

TransformFilter table

Target table@ Amazon Redshift

Target table@ Amazon Redshift

자동 코드 생성

§ Human-readable code run on a scalable platform, PySpark

§ Forgiving in the face of failures – handles bad data and crashes

§ Flexible: handles complex semi-structured data, and adapts to source schema changes

Glue ETL 스크립트의 유연성

Glue integrates job authoring and execution with your preferred Gitservices.

Push job code to your Gitrepository, automatically pulls the latest on job invocation.

Customize ETL jobs in your favorite IDE – no need to learn new tools

No need to start from scratch.

AWS CodeCommit

Git 통합

오케스트레이션 & 자원관리

Fully managed, serverless job execution

Compose jobs globally with event-based dependencies

§ Easy to reuse and leverage work across organization boundaries

Multiple triggering mechanisms

§ Schedule-based: e.g., time of day

§ Event-based: e.g., data availability, job completion

§ External sources: e.g., AWS Lambda

Marketing: Ad-spend bycustomer segmentData based

>10 MB new

Sales: Revenue bycustomer segment

Schedule

Data based

Central: ROI by customer segment

ad-click logs

weeklysales

Data based

작업 구성 및 트리거

Split by message

type

Application #1 – click logs3 different message types

…

summarize message type

summarize message type

Example: Dynamic number of jobs based on application type and number of message types

summarize message typeApplication #2 – click logs

5 different message types

Application #3 – click logs4 different message types

§ Add jobs dynamically as graph unfolds - makes data dependent orchestration possible

§ Glue provides fault-tolerant orchestration - retries on job failure

§ Monitoring and metrics - job run history and event tracking for debugging

동적 오케스트레이션

§ Warm pools: pre-configured fleets of instances to reduce job startup time

§ Auto-configure VPC and role-based access

§ Automatically scale resources to meet SLA and cost objectives

§ You pay only for the resources you consume while consuming them.

There is no need to provision, configure, or manage servers

Customer VPC Customer VPC

Warm pool of instances

서버리스 작업 실행

So that's the basics of what we are doing.

You can sign up for a preview at aws.amazon.com/glue.

We should start adding people soon.

Glue 프리뷰 신청

감사합니다

Technology

AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)