21
Data Integration with Embulk DATA SCIENCE WEEKEND 2016, YOGYAKARTA TEGUH NUGRAHA

Data integration with embulk

Embed Size (px)

Citation preview

Page 1: Data integration with embulk

Data Integration with EmbulkDATA SCIENCE WEEKEND 2016, YOGYAKARTATEGUH NUGRAHA

Page 2: Data integration with embulk

Multi Data Formats and Storages

MySQL

PostgreSQL

MongoDB

CSV files

BigQuery

Redshift

HDFS

Google Analytics

Mixpanel

Zendesk

Elasticsearch

Page 3: Data integration with embulk

Multi Data Sources Users data in MySQL

Offline data in CSV

Traffics data in Google Analytics

Log data

Bulk Data Loading: Load data from A to B

Page 4: Data integration with embulk

Problems Parsing files

Error handling

Idempotent Retrying

Performance

Scalability

Format compatibility

Page 5: Data integration with embulk

SolutionReliable framework with parallel execution, data validation, error recovery, auto guessing, resuming and extensive plugins

github.com/embulk/embulk

Page 6: Data integration with embulk

Embulk: Bulk Data Loader

Page 7: Data integration with embulk

Plugins by Category•Input plugins

•Output plugins

•Filter plugins

•File parser plugins

•File decoder plugins

•File formatter plugins

•File encoder plugins

•Executor plugins

Page 8: Data integration with embulk

Getting Started1. Embulk requires Java

2. Download embulk:http://dl.embulk.org/embulk-latest.jar

3. Make it executable$ embulk --version

4. Run an example:$ embulk example

Page 9: Data integration with embulk

Installing Embulk Plugin$ embulk gem install embulk-input-mysql

$ embulk gem install embulk-output-postgresql

List of plugins:

https://embulk.org/plugins

Page 10: Data integration with embulk

Embulk Configuration File

Page 11: Data integration with embulk

Embulk Configuration File (YAML)

in: Input plugin options. ◦ parser: If the input is file-based, parser plugin parses a file format (built-in csv, json,

etc).◦ decoder: If the input is file-based, decoder plugin decodes compression or

encryption (built-in gzip, bzip2, zip, tar.gz, etc).

out: Output plugin options. ◦ formatter: If the output is file-based, formatter plugin formats a file format (such

as built-in csv, JSON)◦ encoder: If the output is file-based, encoder plugin encodes compression or

encryption (such as built-in gzip or bzip2)

filters: Filter plugins options (optional).

exec: Executor plugin options. An executor plugin control parallel processing (such as built-in thread executor, Hadoop MapReduce executor)

Page 12: Data integration with embulk

Using Guess CommandGuess command guesses parser and decoder options

$ embulk guess seed.yml –o config.yml

Page 13: Data integration with embulk

Using guess command

Page 14: Data integration with embulk

Previewing and Running$ embulk preview config.yml

$ embulk run config.yml

Setup cron schedule

Page 15: Data integration with embulk

embulk-input-mysql https://github.com/embulk/embulk-input-jdbc/tree/master/embulk-input-mysql

$ embulk gem install embulk-input-mysql

Page 16: Data integration with embulk

embulk-output-postgresql

https://github.com/embulk/embulk-output-jdbc/tree/master/embulk-output-postgresql

$ embulk gem install embulk-output-postgresql

Page 17: Data integration with embulk

Using Variables configuration file name must end with .yml.liquid

Environment variables are set to env variable

Page 18: Data integration with embulk

Include fileFile will be searched from the relative path of the input configuration file and file name will be _<name>.yml.liquid

Page 19: Data integration with embulk
Page 20: Data integration with embulk

Thank YouTEGUH NUGRAHADATA SCIENCE LEAD, [email protected]: / /WWW.SLIDESHARE.NET/TEGUHN

Page 21: Data integration with embulk

References https://embulk.org

https://github.com/embulk/embulk

http://www.slideshare.net/frsyuki/fighting-against-chaotically-separated-values-with-embulk