Gizillions Thingsternet Living online Big Data Create Tables in Hive • Hive is a batch query tool,

  • View
    0

  • Download
    0

Embed Size (px)

Text of Gizillions Thingsternet Living online Big Data Create Tables in Hive • Hive is a batch query...

  • A World of Data

    “Thingsternet”

    Compete by asking bigger

    questions

    Living online

    Big Data

    “Gizillions” of mobile transactions

  • $

    $$$...

  • ???

  • SLA

  • Yaaaay – Hadoop to Save the Daaaay!!

    • But it’s not always easy to tame an elephant…

  • CUSTOMERS WEB CLIENT WEB SHOP BACKEND

    WEB SHOP DATA BASE

    ~100GB

    Product and Customer

    Transaction Data

    Introducing “DataCo”

    “We don’t really have a big data

    problem…”

  • > 6 months?

    CUSTOMERS WEB CLIENT WEB SHOP BACKEND

    WEB SHOP DATA BASE

    Mobile App Data

    Web App Click Stream

    Data

    IT/Ops and InfoSec Data

    Product and Customer

    Transaction Data

    Introducing “DataCo”

  • Hive

    Active Archive / Self Serve Ad-hoc BI

    • Top sold products last 6, 12, and 18 months?

    SQL

    HDFS

    Impala

  • Using Sqoop to Ingest Data from MySQL

    • Sqoop is a bi-directional structured data ingest tool

    • Simple UI in Hue, more commonly used from the shell

    $ sqoop import-all-tables -m 12 –connect

    jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba

    --password=yow!2014 --compression-codec=snappy --as-avrodatafile

    --warehouse-dir=/user/hive/warehouse

    $ sqoop import -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db

    --username=dataco_dba --password=yow!2014

    --table my_cool_table --hive-import --as-parquetfile

  • Create Tables in Hive

    • Hive is a batch query tool, but also the keeper of table structures

    • Remember: structure is stored _separate_ from data

    hive> CREATE EXTERNAL TABLE products

    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

    > STORED AS INPUTFORMAT

    'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

    > OUTPUTFORMAT

    'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

    > LOCATION 'hdfs:///user/hive/warehouse/products'

    > TBLPROPERTIES

    ('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

  • Use Impala via Hue to Query

  • $

    $$$...

  • Correlate Multi-type Data Sets

    • Top viewed products last 6, 12, and 18 months?

    Hive

    SQL

    HDFS

    Impala Flume

  • Ingest Data Using Flume

    • Pub/sub ingest framework

    • Flexible multi-level (mini-transformation) pipeline

    FLUME SOURCE

    FLUME SINK

    Continuously generated events, e.g. syslog, tweets

    Flume Agent, HDFS, HBase, Solr, or other destination

    Optional Logic

    FLUME AGENT

  • Create Hive Tables over Log Data

    • New use case, new data

    • Create new tables over semi-structured log data

    CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING,

    method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING,

    dash STRING, user_agent STRING) ROW FORMAT SERDE

    'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (

    "input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"",

    "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" )

    LOCATION '/user/hive/warehouse/original_access_logs';

    CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method

    STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash

    STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

    LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR

    /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE

    tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

  • Use Impala and Hue to Query

    Missing!!! 2 8 5 7 1 6 3 4 9

  • $

    $$$...

  • !!!

  • Solr

    Multi-Use-Case Data Hub

    • Why is sales dropping over the last 3 days?

    HDFS

    Search Queries

    Flume

  • Create your Index

    • Create an empty Solr index configuration directory

    • Edit the Solr Schema file to have the fields you want to search over

    $ solrctl --zk /solr instancedir --generate live_logs_dir

  • Create your Index cont.

    • Upload your configuration for a collection to ZooKeeper

    • Tell Solr to start serving up a collection and start indexing data for it

    $ solrctl --zk / solr collection --create live_logs -s 4

    $ solrctl --zk /solr instancedir --create live_logs

    ./live_logs_dir

  • Flume and Morphline Pipeline

  • Flume with Morphlines Configured

    • Configure Flume to use your Morphlines and post parsed data to Solr

    ….

    # Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

    agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000

    agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile =

    /opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline

    agent1.sinks.solrSink.threadCount = 1

    …..

  • Dynamic Search UI in Hue

  • Shared Storage!!

  • Benefits • Ad-hoc and

    faster insight • Reduced

    asthma related ICU visits

    • Total license fees < 3 processor licenses for EDW

    Solution • 50GB monitor

    data per week • 2TB capacity • Sqoop, Solr,

    Impala, HDFS

    Challenges • Only 3 days’ of

    monitoring data capacity

    • No ability to correlate large research data sets

    • No ability to ad- hoc study environment impact

    How Do We Improve Healthcare?

  • How Do We Feed The World?

    Global Warming Changes Conditions

    How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

  • Benefits • Streamlined

    processes • Time to results

    reduced from years to months!!!

    Solution • PB-scale • HBase, HDFS,

    Solr, MapReduce, Sqoop, Impala, …

    Challenges • Time to market

    for each new product: 5-10 years

    • 1,000+ scientists working in silos

    • Data processing bottlenecks slow development

    How Do We Feed The World?

  • Challenges • 100-200 B

    events/month • Real-time multi-

    type event correlation complex

    • No way to do ad-hoc game analytics

    Benefits • Ad-hoc insight

    on feature trends

    • Significant TTR reduction

    • ROI realized in the 1st week

    Solution • ~20 nodes • 256GB RAM

    servers • Flume, Solr,

    Impala, HDFS

  • Learn More?

    • Stop by the Cloudera booth today! 

    • Play on your own: cloudera.com/live

    • Get training: http://cloudera.com/content/cloudera/en/training.html

    • Join the Community: cdh-user@cloudera.org

    • Connect with me: @EvaAndreasson

  • Hope You Enjoyed This Talk!

    Don’t forget to VOTE!!!

  • Bonus Track…

  • My Advice for the Road…

  • Try Something Simple First…

  • Decide what to Cook!

  • Collect All Ingredients

  • Use the Right Tool for the Right Task

  • Prepare All Ingredients

  • Don’t Forget the Importance of Visualization!

  • Benefits • Faster, cheaper

    genome sequencing

    • Searchable index of variant call data for biologists to explore

    Solution • Integration &

    storage of multi- structured experimental data

    • Data access & exploration via Impala, R, HBase, Solr, Hive

    Challenges • Tons of

    information locked away in medical records & scientific studies

    • Different sources & systems can’t “talk” to each other

  • Using Sqoop to Ingest Data from MySQL

    • View your imported “tables”

    • View all Avro files constituting a table

    $ hadoop fs -ls /user/hive/warehouse/

    $ hadoop fs -ls /user/hive/warehouse/mytablename/

  • Hadoop - A New Approach to Data Management

    Schema on Read

    Distributed Storage

    Distributed Processing

    Active Archive

    Cost-Efficient Offload

    Flexible Analytics

  • Hadoop: Storage & Batch Processing

    The Birth of the Data Lake

  • • Core Hadoop • Core Hadoop • Core Hadoop • Hbase • ZooKeeper • Mahout

    • Core Hadoop • Hbase • ZooKeeper • Mahout • Pig • Hive

    • Core Hadoop • Hbase • ZooKeeper • Mahout • Pig • Hive • Flume • Avro • Sqoop