Teradata Hadoop Data Archival Strategy with Hadoop and Hive

  • View
    131

  • Download
    4

Embed Size (px)

DESCRIPTION

Teradata Hadoop Data Archival Strategy with Hadoop and Hive

Text of Teradata Hadoop Data Archival Strategy with Hadoop and Hive

  • 1. September 2014
  • 2. The Data Archival Proof of concept is currently underway under the direction and guidance of the Business Insurance (BI) Teradata 14.10 upgrade Program. This high level proof of concept design will focus on various techniques practices for achieving such data archival and retrieval of BI Warehouse data between Teradata and Hadoop environment. 2 2
  • 3. Case -1: Copy an existing Teradata base table and all its data to Hadoop DFS and verify that the data is structurally similar. Move the data back to Teradata database as a relational data table and verify that the structure and data are exactly similar. Case -2: Copy an existing Teradata SCD table and all its data to Hadoop DFS and verify that the data is structurally similar. Apply all the CDC values from the Teradata table and apply these changes to HDFS table and verify that the CDC values are reflected in the HDFS Table. Move the HDFS table back to Teradata to verify that the structure and data are similar between Teradata and HDFS. 3
  • 4. 4 Out of Scope Ab Initio Based ETL from TD to HDFS(Extracts to landing Zone may be considered) Non-Apache Hadoop Drivers and Connectors Cluster Hardware / Software configurations Security Layer Implementation (Data Encryption, Masking, etc.) Performance Tuning and Benchmarking High Availability and DR
  • 5. Component Installed Version Desired Version Features in Desired Version 5 DataFu pig-udf-datafu-0.0.4+11 Apache Flume flume-ng-1.4.0+96 Apache Hadoop hadoop-2.0.0+1554 Apache HBase hbase-0.94.15+86 Apache Hive hive-0.10.0+237 .14 Truncate, More Data Types Hue hue-2.5.0+217 Apache Mahout mahout-0.7+15 Apache Oozie oozie-3.3.2+100 Parquet parquet-1.2.5+7 Apache Pig pig-0.11.0+42 Apache Sentry sentry-1.1.0+20 Apache Sqoop sqoop-1.4.3+92 Apache Sqoop2 sqoop2-1.99.2+99 Apache Whirr whirr-0.8.2+15 Apache ZooKeeper zookeeper-3.4.5+25
  • 6. 6 6 Source Layer Storage Layer Custom Map- Reduce JDBC utility Cloudera Sqoop connector powered by Teradata Teradata Connector for Hadoop CLI utility. Import Export HDFS TDCH Sqoop HBASE Cloudera Management & Monitoring Services JDBC/ODBC Scripting (Pig) SQL Query (Hive) Hadoop Ecosystem Map Reduce Oozie JDBC
  • 7. # Solution Component Description 1 Source Layer - Teradata Contains Teradata tables that need to be migrated to Hadoop Storage. Tables could be Full Refresh tables or SCD tables. 2 Storage Layer CDH 4.x. Cloudera Distribution with Cloudera Manager for management and monitoring. Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop. 7 3 Sqoop Connector for Teradata / Teradata Connector for Hadoop - CLI Cloudera connector for Sqoop powered by Teradata developed by Cloudera and Teradata. Supports importing data split by AMP/VALUE/PARTITION/HASH. Supports exporting data via batch.insert, multiple.fastload, internal.fastload. Supports importing and exporting of data in Text / Sequence / Avro file format. Cloudera Recommendation Use Cloudera connector powered by Teradata versus Cloudera connector for Teradata (Older version). TDCH This is a command line interface utility provided by Teradata leveraging Teradata Java SDK (TeradataImportTool / TeradataExportTool) developed for data transfer between Hadoop and Teradata. 4 Hadoop/Other (Processing Layer) HDFS will be used to store the files and process it. Sqoop Imported Files could also be directly imported into Hive or could be loaded in HBASE through custom loading utility. SCD Load Fact tables into Hive. Load L Oozie Hadoop processing can be scheduled in a workflow through Oozie. 7
  • 8. Teradata Utility 8 8 Source ETL Layer Storage Layer Extract Load Extract Load File Landin g Zone HDFS HDFS File Copy TDCH Sqoop HBASE Cloudera Management & Monitoring Services JDBC/ODBC Scripting (Pig) SQL Query (Hive) Hadoop Ecosystem Map Reduce Oozie
  • 9. 9 # Solution Component Description 1 Source Layer - Teradata Contains Teradata tables that need to be migrated to Hadoop Storage. Tables could be Full Refresh tables or SCD tables. 2 Source Layer Ab-Initio / File Landing Zone Leverage Ab-Initio to extract data into a flat file while importing data to HDFS. Leverage Ab-Initio to load data into Teradata tables from files exported by HDFS. Files would copied to and from Ab-Initio and HDFS in a designated File Landing Zone. 3 Storage Layer CDH 4.x. Cloudera Distribution with Cloudera Manager for management and monitoring. Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop. 4 Hadoop/Other (Processing Layer) HDFS will be used to store the files and process it. Files could also be directly imported into Hive or could be loaded in HBASE through custom program. SCD - TBA 5 Oozie / Autosys Hadoop processing can be scheduled in a workflow through Oozie. Ab-Initio processing can be scheduled via Autosys. 9
  • 10. 10 # Solution Component Description 1 Cloudera Connector Powered by Teradata (Latest version 1.2c5) Does not support HCatalog Does not support import into HBASE. Does not support upsert functionality (parameter --update-mode allowinsert). Does not support the --boundary-query option. 2 Cloudera Connector for Teradata (Older Version) Does not support HCatalog. Does not support import into HBase. Does not support AVRO format. Does not support import-all-tables. Does not support upsert functionality (parameter --update-mode allowinsert). Does not support imports from views. Does not support data types INTERVAL, PERIOD, and DATE/TIME/TIMESTAMP WITH TIME ZONE. Optional query band is not set for queries executed by the Teradata JDBC driver itself (namely BT and SETSESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SR). 3 Hive Hive does not provide record level update, insert, or delete. Hive does not provide transactions. Compared to an OLTP database, Hive queries have higher latency due to start-up overhead of MapReduce jobs. 4 Sqoop Each execution requires input of password. Password can be passed in command line, as standard input, or from a password file. Password file is more secure way to automate Sqoop workflow. Encoding of NULL values during Import/Export needs to be considered. Incremental Updates would need to utilize Sqoop metastore for preserving last value. 1 0

Recommended

View more >