Data integration with embulk

Preview:

Citation preview

Data Integration with EmbulkDATA SCIENCE WEEKEND 2016, YOGYAKARTATEGUH NUGRAHA

Multi Data Formats and Storages

MySQL

PostgreSQL

MongoDB

CSV files

BigQuery

Redshift

HDFS

Google Analytics

Mixpanel

Zendesk

Elasticsearch

Multi Data Sources Users data in MySQL

Offline data in CSV

Traffics data in Google Analytics

Log data

Bulk Data Loading: Load data from A to B

Problems Parsing files

Error handling

Idempotent Retrying

Performance

Scalability

Format compatibility

SolutionReliable framework with parallel execution, data validation, error recovery, auto guessing, resuming and extensive plugins

github.com/embulk/embulk

Embulk: Bulk Data Loader

Plugins by Category•Input plugins

•Output plugins

•Filter plugins

•File parser plugins

•File decoder plugins

•File formatter plugins

•File encoder plugins

•Executor plugins

Getting Started1. Embulk requires Java

2. Download embulk:http://dl.embulk.org/embulk-latest.jar

3. Make it executable$ embulk --version

4. Run an example:$ embulk example

Installing Embulk Plugin$ embulk gem install embulk-input-mysql

$ embulk gem install embulk-output-postgresql

List of plugins:

https://embulk.org/plugins

Embulk Configuration File

Embulk Configuration File (YAML)

in: Input plugin options. ◦ parser: If the input is file-based, parser plugin parses a file format (built-in csv, json,

etc).◦ decoder: If the input is file-based, decoder plugin decodes compression or

encryption (built-in gzip, bzip2, zip, tar.gz, etc).

out: Output plugin options. ◦ formatter: If the output is file-based, formatter plugin formats a file format (such

as built-in csv, JSON)◦ encoder: If the output is file-based, encoder plugin encodes compression or

encryption (such as built-in gzip or bzip2)

filters: Filter plugins options (optional).

exec: Executor plugin options. An executor plugin control parallel processing (such as built-in thread executor, Hadoop MapReduce executor)

Using Guess CommandGuess command guesses parser and decoder options

$ embulk guess seed.yml –o config.yml

Using guess command

Previewing and Running$ embulk preview config.yml

$ embulk run config.yml

Setup cron schedule

embulk-input-mysql https://github.com/embulk/embulk-input-jdbc/tree/master/embulk-input-mysql

$ embulk gem install embulk-input-mysql

embulk-output-postgresql

https://github.com/embulk/embulk-output-jdbc/tree/master/embulk-output-postgresql

$ embulk gem install embulk-output-postgresql

Using Variables configuration file name must end with .yml.liquid

Environment variables are set to env variable

Include fileFile will be searched from the relative path of the input configuration file and file name will be _<name>.yml.liquid

Thank YouTEGUH NUGRAHADATA SCIENCE LEAD, BUKALAPAKTEGUH@BUKALAPAK.COMHTTP: / /WWW.SLIDESHARE.NET/TEGUHN

References https://embulk.org

https://github.com/embulk/embulk

http://www.slideshare.net/frsyuki/fighting-against-chaotically-separated-values-with-embulk

Recommended