Upload
datatorrent
View
37
Download
0
Embed Size (px)
Citation preview
© 2016 DataTorrent
Chinmay Kolhatkar ([email protected])Committer, Apache Apex
Engineer, DataTorrentJuly 21, 2016
Data Ingestion Dedup-Enrich-ETL
© 2016 DataTorrent
Agenda
2
•About Apache Apex•Apex Platform Overview•Apex - Native Hadoop Integration•Apex Malhar Library•What is Data Ingestion?•Data Ingestion - Use Cases•Dedup-Enrich ETL Demo
© 2016 DataTorrent
About Apache Apex
3
•Platform and runtime engine that enables development of scalable and fault-tolerant distributed applications
•Hadoop native (Hadoop >= 2.2)No separate service to manage stream processingStreaming Engine built into Application Master and
Containers•Process streaming or batch big data•High throughput and low latency•Library of commonly needed business logic•Write any custom business logic in your application
© 2016 DataTorrent
Apex Platform Overview
4
© 2016 DataTorrent
Apex - Native Hadoop Integration
5
• YARN is the resource manager
• HDFS used for storing any persistent state
© 2016 DataTorrent
Apex Malhar Library
6
RDBMS• Vertica• MySQL• Oracle• JDBC
NoSQL• Cassandra, Hbase• Aerospike, Accumulo• Couchbase/ CouchDB• Redis, MongoDB• Geode
Messaging• Kafka• Solace• Flume, ActiveMQ• Kinesis, NiFi
File Systems• HDFS/ Hive• NFS• S3
Parsers• XML • JSON• CSV• Avro• Parquet
Transformations• Filters• Rules• Expression• Dedup• Enrich
Analytics• Dimensional Aggregations
(with state management for historical data + query)
Protocols• HTTP• FTP• WebSocket• MQTT• SMTP
Other• Elastic Search• Script (JavaScript, Python, R)• Solr• Twitter
© 2016 DataTorrent
What is Data Ingestion?
7
•Data IngestionA process of obtaining, importing, and analyzing data for
later use or storage in a database•Big Data Ingestion
Reading from data sources Importing the data Processing data to produce intermediate data Sending data out to durable data stores
•ETL + Big Data => Data ingestion
© 2016 DataTorrent
Data Ingestion - Use cases
8
•Data SyncRead data from sourceWrite to destinationKeep syncing data as per rules
•Real-time IoT Data ProcessingRead sensor data from sourcesDo some processing over the received dataStore/Publish the results over destination
© 2016 DataTorrent
Dedup-Enrich-ETL Application
9
•KafkaInput - Reads data from Kafka•CSVParser - Parses CSV data and converts to POJO•Dedup - Deduplicate the Data•Enrich - Enrich the data using external source•HDFSOut - Writes the data out to HDFS
© 2016 DataTorrent
Dedup-Enrich-ETL Live Demo
10
© 2016 DataTorrent
Resources
11
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
ᵒ @ApacheApex; Follow - https://twitter.com/apacheapexᵒ @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
ᵒ https://www.datatorrent.com/product/startup-accelerator/
© 2016 DataTorrent
We Are Hiring
12
• [email protected]• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders