Click here to load reader

Distributed Systems Fall 2014 Zubair Amjad. Outline Motivation What is Sqoop? How Sqoop works? Sqoop Architecture Import Export Sqoop Connectors Sqoop

Embed Size (px)

Citation preview

Sqoop

Distributed SystemsFall 2014Zubair Amjad

OutlineMotivationWhat is Sqoop?How Sqoop works?Sqoop ArchitectureImportExportSqoop ConnectorsSqoop CommandsMotivationSQL servers are already deployed opulent worldwideNightly processing is done on SQL servers for yearsAs more organizations deploy Hadoop to analyze vast streams of information, they may find they need to transfer large amount of data between Hadoop and their existing databasesLoading bulk data into Hadoop or accessing it from map-reduce applications is a challenging taskTransferring data using script is an inefficient and time-consuming taskTraditional DB already have reporting, data visualization etc. applications built in enterpriseBringing processed data from Hadoop to these applications is the needWhat is SqoopA tool to automate data transfer between Hadoop and relational databasesTransform data in Hadoop with MapReduce or HiveExport data back into RDBAllows easy import and export of data from structured datastoresRD, Enterprise data warehouses, and NoSQL SystemsProvision data from external system on to HDFSSqoop integrates with Oozie to allow scheduling and automation of import and export tasksOozie is a workflow scheduler system to manage Apache Hadoop jobsSqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems

How Sqoop WorksRuns on a Hadoop cluster and has access to Hadoop coreSqoop uses Mappers to slice the incoming dataData is placed to HDFSThe dataset being transferred is sliced up into different partitionsA Map-only job is launched with individual mappers responsible of transferring a slice of the datasetEach record of the data is handled in a type safe manner since Sqoop uses the database matadata to infer the data typesMany data transfer formats supported

Sqoop Architecture

Sqoop ImportCommand$ sqoop import --connect jdbc:mysql://localhost/DB_NAME --table TABLE_NAME --username USER_NAME --password PASSWORDImportSubcommands--connect, --username & --passwordPart of connection string--tableDatabase table nameSqoop ImportStep 1Sqoop examines the database to gather the necessary matadata for the data being importedStep 2A Map-only Hadoop job submitted to cluster by SqoopA Map-only job performs data transfer using the matadata captured in step 1

Sqoop ImportThe imported data is saved in a directory on HDFS based on the table being importedUser can specify any alternative directory where the files should be populatedBy default these files contain comma delimited fields, with new lines separating different recordsUser can override the format in which data is copied over by explicitly specifying the field separator and record terminator charactersSqoop also supports different data formats for importing dataSqoop ExportCommand$ sqoop export --connect jdbc:mysql://SERVER/DB_NAME --table TARGET_TABLE_NAME --username USER_NAME --password PASSWORD --export-dir EXPORT_DIRExportSubcommands--connect, --username & --passwordPart of connection string--tableTarget database table name--export-dirHDFS directory from where the data will be exportedSqoop ExportStep 1Sqoop examines the database for matadata, followed by the second step of transferring the data

Step 2Data transferSqoop divides the input dataset into splitsSqoop uses individual map tasks to push the splits to the databaseEach map task performs this transfer over many transections in order to ensure optimal throughput and minimal resource utilization

Sqoop ConnectorsGeneric JDBC connectorCan be used to connect to any database that is accessible via JDBCDefault Sqoop connectorDesigned for specific databases such as MySQL, PostgreSQL, Oracle. SQL Server and DB2Fast-path connectorFast-path connectors specialized to use specific batch tools transferring data with high throughputMySQL and PostgreSQL databasesSqoop CommandsAvailable CommandsCodegenGenerate code to interact with database recordscreate-hive-tableImport a table definition into HiveEvalEvaluate a SQL statement and display the resultsExportExport an HDFS directory to a database tableHelpList available commandsImportImport a table from a database to HDFSAvailable Commandsimport-all-tablesImport tables from a database to HDFSJobWork with saved jobslist-databasesList available databases on a serverlist-tablesList available tables in a databaseMergeMerge results of incremental importsMetastoreRun a standalone Sqoop metastoreVersionDisplay version informationSqoop CommandsThank you