Upload
sumit-sarkar
View
451
Download
8
Tags:
Embed Size (px)
Citation preview
© 2014 Progress Software Corporation. All rights reserved.1
Supercharge Sqoop with magical JDBC drivers
Sumit Sarkar, Principal Systems Engineer
www.linkedin.com/in/meetsumit
@SAsinSumit
© 2014 Progress Software Corporation. All rights reserved.2
Supercharge Sqoop with magical JDBC drivers
What is Sqoop?
Who is Progress DataDirect and their magical JDBC drivers?
How do you supercharge Sqoop?
© 2014 Progress Software Corporation. All rights reserved.3
What is Sqoop?
© 2014 Progress Software Corporation. All rights reserved.4
What is Sqoop?
Apache Sqoop
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache
Hadoop and structured datastores such as relational databases.
Sqoop successfully graduated from the Incubator in March of 2012 and is now a Top-Level
Apache project
http://sqoop.apache.org/
© 2014 Progress Software Corporation. All rights reserved.5
What’s the deal with DataDirect?
© 2014 Progress Software Corporation. All rights reserved.6
We connecting the world’s application to the world’s data using industry standards
20+ years experience
100,000+ of end users
250+ ISV’s
200 team members
Relational
Databases
SaaS
Application
Data
NoSQL,
NewSQL,
Big Data
Custom
Connectivity
EDI / XML
Transformation
© 2014 Progress Software Corporation. All rights reserved.7
Disrupting the disruption from the Hadoop Ecoystem
Big Data ODBC drivers are for workloads too large for traditional drivers
Hive 0.8-0.13 CDH 4.0-5.1 HDP 1.3-2.1
Elastic Map Reduce
Hive 0.8-0.11 HDP 1.2-2.0
© 2014 Progress Software Corporation. All rights reserved.9
How do you supercharge Sqoop?
© 2014 Progress Software Corporation. All rights reserved.10
Closer look at magical JDBC with Sqoop
Unique support for SQL Server NTLM authentication from Linux
Breadth of JDBC drivers well beyond relational (NoSQL,SaaS,Cloud,NewSQL,etc)
Improve performance on LOAD and EXPORT
For EXPORT, hearing impressive performance w/ DataDirect bulk load for Sqoop
• Sample #s: 39 GB loaded in < 10 minutes (SQL Server)
For LOAD, we’ve seen 2x performance (Oracle)
For EXPORT/LOAD, single driver supports multiple versions (Oracle 9i-12c; SQL
Server 2000-2014; DB2 across z/OS; iSeries, Windows/Unix
© 2014 Progress Software Corporation. All rights reserved.11
Magical JDBC drivers for use with Sqoop
Big Data/NoSQL
Pivotal HAWQ
MongoDB
Cassandra
Data Warehouses
Amazon Redshift
SAP Sybase IQ
Teradata
Oracle Exadata
Pivotal Greenplum
Relational
Oracle DB
Microsoft SQL Server
IBM DB2
MySQL
MemSQL
PostgreSQL
IBM Informix
SAP Sybase
Pervasive SQL
Progress OpenEdge
Progress Rollbase
SaaS/Cloud
Salesforce.com
Database.com
FinancialForce
Veeva CRM
ServiceMAX
Hubspot
Marketo
Microsoft Dynamics CRM
Microsoft SQL Azure
Oracle Eloqua
Oracle Service Cloud
Google Analytics
© 2014 Progress Software Corporation. All rights reserved.12
Example of DataDirect JDBC bulk load for Sqoop
sqoop export --connect 'jdbc:datadirect:sqlserver://nc-
sqlserver:1433;database=test;user=test01;password=test01;EnableBulkLoad=true;Bulk
LoadBatchSize=1024;BulkLoadOptions=0' --driver
com.ddtek.jdbc.sqlserver.SQLServerDriver --table 'blah_1024MB' --export-dir
/user/hdfs/blah_1024MB/ --input-lines-terminated-by "n" --input-fields-terminated-by ',' --
batch -m 10Notes:
–batch mode is used for underlying insert statement execution.
–driver must be specified when using a Generic JDBC connector.
–connect is the JDBC URL. “EnableBulkLoad=true” authorizes the DataDirect SQL Server driver to utilize the bulk load protocol for the inserting of rows. The “BulkLoadBatchSize” value indicates to
the driver the number of rows it will attempt to bulk load on a single roundtrip to the server. If this value is less than the sqoop.export.records.per.statement value, then each call to “executeBatch”
will result in more than one round trip to the server in order to insert the batch of rows.
–table: the table to be populated in the target relational database as data is transferred from HDFS
–export-dir: identifies the HDFS directory which contains the Hadoop table to be exported.
–input-lines-terminated-by: identifies the character which separates rows in the HDFS files.
–input-fields-terminated-by: identifies the character which separates columns in the HDFS files.
-D sqoop.export.records.per.statement is not recommended nor the equivalent of JDBC batch size. Rather, it specifies the number of rows per SQL statements for data sources that support multi
row inserts such as Postgres.
© 2014 Progress Software Corporation. All rights reserved.13
Apache Hive ODBC Driver
© 2014 Progress Software Corporation. All rights reserved.14
Progress DataDirect“We have on life outside of data connectivity”