108
April 10-12 | Chicago, IL Big Data and NoSQL for Database and BI Pros Andrew J. Brust, Founder and CEO, Blue Badge Insights

Big Data and NoSQL for Database and BI Pros

Embed Size (px)

DESCRIPTION

Big Data and NoSQL for Database and BI Pros - PASS Business Analytics Conference 2013

Citation preview

Page 1: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Big Data and NoSQL for Database and BI Pros

Andrew J. Brust, Founder and CEO, Blue Badge Insights

Page 2: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Please silence cell phones

Page 3: Big Data and NoSQL for Database and BI Pros

3

Meet Andrew

CEO and Founder, Blue Badge Insights

Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust

Page 4: Big Data and NoSQL for Database and BI Pros

Andrew’s New Blog (bit.ly/bigondata)

Page 5: Big Data and NoSQL for Database and BI Pros

Lynn Langit (in absentia)

CEO and Founder, Lynn Langit consultingFormer Microsoft Evangelist (4 years)Google Developer ExpertMongoDB MasterMCT 13 years – 7 certificationsCloudera Certified Developer MSDN Magazine articles • SQL Azure• Hadoop on Azure• MongoDB on Azurewww.LynnLangit.com@LynnLangit

L

Page 6: Big Data and NoSQL for Database and BI Pros

Read all about it!

Page 7: Big Data and NoSQL for Database and BI Pros

Agenda

Overview / Landscape • Big Data, and Hadoop• NoSQL• The Big Data-NoSQL Intersection

Drilldown on Big DataDrilldown on NoSQL

Page 8: Big Data and NoSQL for Database and BI Pros

What is Big Data?

100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involvedHadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactional databasesAnalyzing interactions, rather than transactionsThe three V’s: Volume, Velocity, VarietyBig Data tech sometimes imposed on small data problems

Page 9: Big Data and NoSQL for Database and BI Pros

9

Big Data = Exponentially More DataRetail Example -> ‘Feedback Economy’• Number of transactions• Number of behaviors (collected every minute)

L

Page 10: Big Data and NoSQL for Database and BI Pros

10

Big Data = ‘Next State’ Questions

• What could happen?• Why didn’t this happen?• When will the next new thing

happen?• What will the next new thing

be?• What happens?

Collecting Behavio

raldata

L

Page 11: Big Data and NoSQL for Database and BI Pros

11

My Data: An Example from Health CareMedical records

• Regular• Emergency• Genetic data – 23andMeFood data • SparkPeoplePurchasing • Grocery card• credit cardSearch – GoogleSocial media• Twitter• FacebookExercise • Nike Fuel Band• Kinect• Location - phone

L

Page 12: Big Data and NoSQL for Database and BI Pros

12

Big Data = More DataL

Page 13: Big Data and NoSQL for Database and BI Pros

Big Data Considerations

Collection – get the

data

Storage – keep the

data

Querying – make

sense of the data

Visualization – see the business

value

L

Page 14: Big Data and NoSQL for Database and BI Pros

14

Data Collection

Types of Data• Structured, semi-structured, unstructured vs. data standards• Behavioral vs. transactional data

Methods of collection• Sensors everywhere• Machine-2-Machine• Public Datasets

• Freebase• Azure DataMarket• Hillary Mason’s list

L

Page 15: Big Data and NoSQL for Database and BI Pros

What’s MapReduce?

Partition the bulk input data and send to mappers (nodes in cluster)Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducerReducer aggregates; one output per key, with valueMap and Reduce code natively written as Java functions

Page 16: Big Data and NoSQL for Database and BI Pros

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1 , K4

K3 , K6

Output

Output

Output

K2 , K5

Page 17: Big Data and NoSQL for Database and BI Pros

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Merge tallies into one spreadsheet

• Collect the tallies

A MapReduce Example

Page 18: Big Data and NoSQL for Database and BI Pros

What’s a Distributed File System?

One where data gets distributed over commodity drives on commodity serversData is replicated• If one box goes down, no data lost• “Shared Nothing”

BUT: Immutable• Files can only be written to once• So updates require drop + re-write (slow)• You can append though• Like a DVD/CD-ROM

Page 19: Big Data and NoSQL for Database and BI Pros

Hadoop = MapReduce + HDFS

Modeled after Google MapReduce + GFSHave more data? Just add more nodes to cluster. • Mappers execute in parallel• Hardware is commodity• “Scaling out”

Use of HDFS means data may well be local to mapper processing• So, not just parallel, but minimal data movement, which

avoids network bottlenecks

Page 20: Big Data and NoSQL for Database and BI Pros

Comparison: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Updates Read / Write many times Write once, Read many times

Integrity High (ACID) Low

Query Response Time Can be near immediate Has latency (due to batch processing)

20

L

Page 21: Big Data and NoSQL for Database and BI Pros

Just-in-Time Schema

When looking at unstructured data, schema is imposed at query timeSchema is context specific• If scanning a book, are the values words, lines, or pages?• Are notes a single field, or is each word a value?• Are date and time two fields or one?• Are street, city, state, zip separate or one value?• Pig and Hive let you determine this at query time• So does the Map function in MapReduce code

Page 22: Big Data and NoSQL for Database and BI Pros

What’s HBase?

A Wide-Column Store NoSQL databaseModeled after Google BigTableUses HDFSTherefore, Hadoop-compatibleHadoop MapReduce often used with HBaseBut you can use either without the other

Page 23: Big Data and NoSQL for Database and BI Pros

L

Page 24: Big Data and NoSQL for Database and BI Pros

NoSQL Confusion

Many ‘flavors’ of NoSQL data storesEasiest to group by functionality, but…• Dividing lines are not clear or consistentNoSQL choice(s) driven by many factors• Type of data• Quantity of data• Knowledge of technical staff• Product maturity• Tooling

L

Page 25: Big Data and NoSQL for Database and BI Pros

So much wrong information

Everything is ‘new’

People are religious

about data storage

Lots of incorrect

information

‘Try’ before you ‘buy’ (or

use)

Watch out for over

simplification

Confusion over vendor

offerings

L

Page 26: Big Data and NoSQL for Database and BI Pros

Common NoSQL Misconceptions

Problems

Everything is ‘new’People are religious about data storageOpen source is always cheaperCloud is always cheaperReplace RDBMS with NoSQL

Solutions

‘Try’ before you ‘buy’ (or use)Leverage NoSQL communitiesAdd NoSQL to existing RDBMS solution

L

Page 27: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Drilldown on Big Data

Page 28: Big Data and NoSQL for Database and BI Pros

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

Page 29: Big Data and NoSQL for Database and BI Pros

What’s Hive?

Began as Hadoop sub-projectNow top-level Apache project

Provides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data

Page 30: Big Data and NoSQL for Database and BI Pros

Hadoop Distributions

ClouderaHortonworksHCatalog: Hive/Pig/MR Interop

MapRNetwork File System replaces HDFS

IBM InfoSphere BigInsightsHDFS<->DB2 integration

And now Microsoft…

Page 31: Big Data and NoSQL for Database and BI Pros

Microsoft HDInsight

Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for WindowsWindows Azure HDInsight and Microsoft HDInsight (for Windows Server)• Single node preview runs on Windows client

Includes ODBC Driver for HiveJavaScript MapReduce frameworkContribute it all back to open source Apache Project

Page 32: Big Data and NoSQL for Database and BI Pros

Hortonworks Data Platform for Windows

MRLib (NuGet

Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

Debugging

MR code in C#,

HadoopJob, MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Page 33: Big Data and NoSQL for Database and BI Pros

Some ways to work

Microsoft HDInsight• Cloud: go to www.windowsazure.com, request a cluster• Local: Download Microsoft HDInsight

• Runs on just about anything, including Windows XP• Get it via the Web Platform installer (WebPI)

• Local version is free; cloud billed at 50% discount during previewAmazon Web Services Elastic MapReduce• Create AWS account• Select Elastic MapReduce in Dashboard• Cheap for experimenting, but not freeCloudera CDH VM image• Download as .tar.gz file• “Un-tar” (can use WinRAR, 7zip)• Run via VMWare Player or Virtual Box• Everything’s free

Page 34: Big Data and NoSQL for Database and BI Pros

Some ways to work

HDInsight EMR CDH 4

Page 35: Big Data and NoSQL for Database and BI Pros

35

Microsoft HDInsight

Much simpler than the othersBrowser-based portal• Launch MapReduce jobs• Azure: Provisioning cluster, managing ports, gather external data

Interactive JavaScript & Hive console• JS: HDFS, Pig, light data visualization• Hive commands and metadata discovery• New console coming

Desktop Shortcuts:• Command window, MapReduce, Name Node status in browser• Azure: from portal page you can RDP directly to Hadoop head node for

these desktop shortcuts

Page 36: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoWindows Azure HDInsight

Page 37: Big Data and NoSQL for Database and BI Pros

Amazon Elastic MapReduce

Lots of steps!At a high level:• Setup AWS account and S3 “buckets”• Generate Key Pair and PEM file• Install Ruby and EMR Command Line Interface• Provision the cluster using CLI

• A batch file can work very well here

• Setup and run SSH/PuTTY• Work interactively at command line

Page 38: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoAmazon Elastic MapReduce

Page 39: Big Data and NoSQL for Database and BI Pros

Cloudera CDH4 Virtual Machine

Get it for free, in VMWare and Virtual Box versions.• VMWare player and Virtual Box are free too

Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP.Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to:• http://192.168.1.59:8888

Can also use browser in VM and hit:• http://localhost:8888

Work in “Hue”…

Page 40: Big Data and NoSQL for Database and BI Pros

Hue

Browser based UI, with front ends for:HDFS (w/ upload & download)MapReduce job creation and monitoringHive (“Beeswax”)And in-browser command line shells for:HBasePig (“Grunt”)

Page 41: Big Data and NoSQL for Database and BI Pros

Impala: What it Is

Distributed SQL query engine over Hadoop clusterAnnounced at Strata/Hadoop World in NYC on October 24th

In Beta, as part of CDH 4.1Works with HDFS and Hive dataCompatible with HiveQL and Hive drivers• Query with Beeswax

Page 42: Big Data and NoSQL for Database and BI Pros

Impala: What it’s Not

Impala is not Hive• Hive converts HiveQL to Java MapReduce code and executes it in

batch mode• Impala executes query interactively over the data• Brings BI tools and Hadoop closer together

Impala is not an Apache Software Foundation project• Though it is open source and Apache-licensed, but it’s still

incubated by Cloudera• Only in CDH

Page 43: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoCloudera CDH4, Impala

Page 44: Big Data and NoSQL for Database and BI Pros

Hadoop commands

HDFS• hadoop fs filecommand• Create and remove directories

• mkdir, rm, rmr

• Upload and download files to/from HDFS• get, put

• View directory contents• ls, lsr

• Copy, move, view files• cp, mv, cat

MapReduce• Run a Java jar-file based job

• hadoop jar jarname params

Page 45: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoHadoop (directly)

Page 46: Big Data and NoSQL for Database and BI Pros

HBase

Concepts:• Tables, column families• Columns, rows• Keys, valuesCommands:• Definition: create, alter, drop, truncate• Manipulation: get, put, delete, deleteall, scan• Discovery: list, exists, describe, count• Enablement: disable, enable• Utilities: version, status, shutdown, exit• Reference: http://wiki.apache.org/hadoop/Hbase/Shell

Moreover,• Interesting HBase work can be done in MapReduce, Pig

Page 47: Big Data and NoSQL for Database and BI Pros

HBase Examples

create 't1', 'f1', 'f2', 'f3'describe 't1'alter 't1', {NAME => 'f1', VERSIONS => 5} put 't1', 'r1', 'c1:f1', 'value'get 't1', 'r1'count 't1'

Page 48: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoHBase

Page 49: Big Data and NoSQL for Database and BI Pros

Submitting, Running and Monitoring JobsUpload a JARUse Streaming• Use other languages (i.e. other than Java) to write MapReduce

code• Python is popular option• Any executable works, even C# console apps• On MS HDInsight, JavaScript works too• Still uses a JAR file: streaming.jar

Run at command line (passing JAR name and params) or use GUI

Page 50: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoRunning MapReduce Jobs

Page 51: Big Data and NoSQL for Database and BI Pros

Hive

Used by most BI products which connect to HadoopProvides a SQL-like abstraction over HadoopOfficially HiveQL, or HQL

Works on own tables, but also on HBaseQuery generates MapReduce job, output of which becomes result setMicrosoft has Hive ODBC driverConnects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)

Page 52: Big Data and NoSQL for Database and BI Pros

Hive, Continued

Load data from flat HDFS files• LOAD DATA [LOCAL] INPATH 'myfile'

INTO TABLE mytable;

SQL Queries• CREATE, ALTER, DROP• INSERT OVERWRITE (creates whole tables)• SELECT, JOIN, WHERE, GROUP BY• SORT BY, but ordering data is tricky!• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce

steps utilizing Java or streaming code

Page 53: Big Data and NoSQL for Database and BI Pros

Data Explorer• Beta add-in for Excel• Acquire, transform

data• Data sources include

Facebook, HDFS• Visually- or script-

driven• Also includes Azure

BLOB storage backing up HDInsight

56

Page 54: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoHive, Data Explorer

Page 55: Big Data and NoSQL for Database and BI Pros

Pig

Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions• Do a combo of Query and ETL

“10 lines of Pig Latin ≈ 200 lines of Java.”Works with structured or unstructured dataOperations• As with Hive, a MapReduce job is generated• Unlike Hive, output is only flat file to HDFS or text at command line console• With HDInsight, can easily convert to JavaScript array, then manipulate

Use command line (“Grunt”) or build scripts

Page 56: Big Data and NoSQL for Database and BI Pros

Example

A = LOAD 'myfile' AS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO 'output';

Page 57: Big Data and NoSQL for Database and BI Pros

Pig Latin Examples

Imperative, file system commands• LOAD, STORE

•Schema specified on LOAD

Declarative, query commands (SQL-like)• xxx = file or data set• FOREACH xxx GENERATE (SELECT…FROM xxx)• JOIN (WHERE/INNER JOIN)• FILTER xxx BY (WHERE)• ORDER xxx BY (ORDER BY)• GROUP xxx BY / GENERATE COUNT(xxx)

(SELECT COUNT(*) GROUP BY)• DISTINCT (SELECT DISTINCT)Syntax is assignment statement-based:• MyCusts = FILTER Custs BY SalesPerson eq 15;Access Hbase• CpuMetrics = LOAD 'hbase://SystemMetrics' USING

org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');

Page 58: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoPig

Page 59: Big Data and NoSQL for Database and BI Pros

Sqoop

sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>

Page 60: Big Data and NoSQL for Database and BI Pros

Sqoop

sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"

Page 61: Big Data and NoSQL for Database and BI Pros

Flume NG

Source• Avro (data serialization system – can read json-encoded data files,

and can work over RPC)• Exec (reads from stdout of long-running process)

Sinks• HDFS, HBase, Avro

Channels• Memory, JDBC, file

Page 62: Big Data and NoSQL for Database and BI Pros

Flume NG (next generation)

Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1

From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

Page 63: Big Data and NoSQL for Database and BI Pros

Mahout Algorithms

Recommendation• Your info + community info• Give users/items/ratings; get user-user/item-item• itemsimilarityClassification/Categorization• Drop into buckets• Naïve Bayes, Complementary Naïve Bayes, Decision ForestsClustering• Like classification, but with categories unknown• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift

Page 64: Big Data and NoSQL for Database and BI Pros

Workflow, Syntax

Workflow• Run the job• Dump the output• Visualize, predict

mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2…Example:• mahout itemsimilarity

--input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD

Page 65: Big Data and NoSQL for Database and BI Pros

The Truth About Mahout

Mahout is really just an algorithm engineIts output is almost unusable by non-statisticians/non-data scientistsYou need a staff or a product to visualize, or make into a usable prediction modelInvestigate Predixion Software• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team• Excel add-in can use Mahout remotely, visualize its output, run

predictive analyses• Also integrates with SQL Server, Greenplum, MapReduce• http://www.predixionsoftware.com

Page 66: Big Data and NoSQL for Database and BI Pros

The “Data-Refinery” Idea

Use Hadoop to “on-board” unstructured data, then extract manageable subsetsLoad the subsets into conventional DW/BI servers and use familiar analytics tool to examineThis is the current rationalization of Hadoop + BI tools’ coexistenceWill it stay this way?

Page 67: Big Data and NoSQL for Database and BI Pros

Dremel-based service for massive amounts of dataPay for query and storageSQL-like query languageHas an Excel connector

Google BigQueryL

Page 68: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Google BigQuery

Page 69: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Drilldown on NoSQL

Page 70: Big Data and NoSQL for Database and BI Pros

NoSQL Data Fodder

AddressesPreference

s

NotesFriends,

Followers

Documents

Page 71: Big Data and NoSQL for Database and BI Pros

“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users

Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences

Page 72: Big Data and NoSQL for Database and BI Pros

NoSQL Common Traits

Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies

Page 73: Big Data and NoSQL for Database and BI Pros

More than just the Elephant in the roomOver 120+ types of noSQL databases

So many NoSQL optionsL

Page 74: Big Data and NoSQL for Database and BI Pros

Concepts

ConsistencyCAP TheoremIndexingQueriesMapReduceSharding

Page 75: Big Data and NoSQL for Database and BI Pros

Consistency

CAP Theorem

• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance

NoSQL does not offer “ACID” guarantees

• Atomicity, consistency, isolation and durability

Instead offers “eventual consistency”

Similar to DNS propagation

Page 76: Big Data and NoSQL for Database and BI Pros

Things like inventory, account balances should be consistent

• Imagine updating a server in Seattle that stock was depleted

• Imagine not updating the server in NY

• Customer in NY goes to order 50 pieces of the item

• Order processed even though no stock

Things like catalog information don’t have to be, at least not immediately

• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it

But catalog info must come up quickly

• Therefore don’t lock data in one location while waiting to update the other

Therefore, OK to sacrifice consistency for speed, in some cases

Consistency

Page 77: Big Data and NoSQL for Database and BI Pros

CAP Theorem

Consistency

Availability

Partition Tolerance

Relational

NoSQL

Page 78: Big Data and NoSQL for Database and BI Pros

Indexing

Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged

• Logged writes are batched

• File is re-created and sorted

Page 79: Big Data and NoSQL for Database and BI Pros

Queries

Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…

Page 80: Big Data and NoSQL for Database and BI Pros

MapReduce

This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB

Page 81: Big Data and NoSQL for Database and BI Pros

L

Page 82: Big Data and NoSQL for Database and BI Pros

Sharding

A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery

Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place

Page 83: Big Data and NoSQL for Database and BI Pros

NoSQL Categories

GraphWide ColumnDocumentKey/Value

L

Page 84: Big Data and NoSQL for Database and BI Pros

87

Key-Value Stores

The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row

Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage

MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak

Page 85: Big Data and NoSQL for Database and BI Pros

Key-Value Stores

Table: CustomersRow ID: 101

First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501

Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502

Table: Orders

Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Database

Page 86: Big Data and NoSQL for Database and BI Pros

Wide Column Stores

Has tables with declared column families

• Each column family has “columns” which are KV pairs that can vary from row to row

These are the most foundational for large sites

• BigTable (Google)

• HBase (Originally part of Yahoo-dominated Hadoop project)

• Cassandra (Facebook)

• Calls column families “super columns” and tables “super column families”

They are the most “Big Data”-ready

• Especially HBase + Hadoop

Page 87: Big Data and NoSQL for Database and BI Pros

Table: CustomersRow ID: 101

Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501

Table: Orders

Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428

Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502

Wide Column Stores

Page 88: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoWide Column Stores

Page 89: Big Data and NoSQL for Database and BI Pros

Document Stores

Have “databases,” which are akin to tablesHave “documents,” akin to rows

• Documents are typically JSON objects

• Each document has properties and values

• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)

• Can have attachments as well

Old versions are retained

• So Doc Stores work well for content management

Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:

• CouchDB

• Derivatives

• MongoDB

Page 90: Big Data and NoSQL for Database and BI Pros

Document Store Application Orientation

Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON

• Documents are JSON objects

• CouchDB/MongoDB use JavaScript as native language

In CouchDB, “view functions” also have unique URIs and they return HTML

• So you can build entire applications in the database

Page 91: Big Data and NoSQL for Database and BI Pros

Database: CustomersDocument ID: 101

First_Name: AndrewLast_Name: BrustAddress:

Orders:

Database: Orders

Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Number: 123Street: Main Street

Most_recent: 1501

Document ID: 202First_Name: JaneLast_Name: DoeAddress:

Orders:

Number: 321Street: Elm Street

Most_recent: 1502

Document Stores

Page 92: Big Data and NoSQL for Database and BI Pros

Com

pari

ng…

Page 93: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

DemoDocument Stores

Page 94: Big Data and NoSQL for Database and BI Pros

Graph Databases

Great for social network applications and others where relationships are importantNodes and edges• Edge like a join

• Nodes like rows in a table

Nodes can also have properties and valuesNeo4j is a popular graph db

Page 95: Big Data and NoSQL for Database and BI Pros

Database

Sent invitation to

Commented on photo by

Friend of

Address

Placed order

Item2

Item1

Joe Smith Jane Doe

Andrew Brust

Street: 123 Main StreetCity: New YorkState: NYZip: 10014

ID: 52134Type: DressColor: Blue

ID: 24457Type: ShirtColor: Red

ID: 252Total Price: 300 USD

George Washington

Graph Databases

Page 96: Big Data and NoSQL for Database and BI Pros

NoSQL on Windows Azure

Platform as a Service• Cloudant: https://cloudant.com/azure/

• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/

MongoDB, DIY: • On an Azure Worker Role:

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

Page 98: Big Data and NoSQL for Database and BI Pros

NoSQL + BI

NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”

• See next slide

Wide-column stores and column-oriented databases are similar technologically

Page 99: Big Data and NoSQL for Database and BI Pros

NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop

Why?• Lack of indexing not a problem

• Consistency not an issue

• Fast reads very important

• Distributed file systems important too

• Commodity hardware and disk assumptions also important

• Not Web scale but massive scale-out, so similar concerns

Page 100: Big Data and NoSQL for Database and BI Pros

NoSQL Compromises

Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)

Page 101: Big Data and NoSQL for Database and BI Pros

Common DBA Tasks in NoSQL

RDBMS NoSQL

Import Data Import Data

Setup Security Setup Security

Perform a Backup Make a copy of the data

Restore a Database Move a copy to a location

Create an Index Create an Index

Join Tables Together Run MapReduce

Schedule a Job Schedule a (Cron) Job

Run Database Maintenance Monitor space and resources used

Send an Email from SQL Server Set up resource threshold alerts

Search BOL Interpret Documentation

104

L

Page 102: Big Data and NoSQL for Database and BI Pros

Which Type of NoSQL for Which Type of Data?

Type of Data Type of NoSQL solution Example

Log files Wide Column HBase

Product Catalogs Key Value on disk DynamoDB

User profiles Key Value in memory Redis

Startups Document MongoDB

Social media connections Graph Neo4j

LOB w/Transactions NONE! Use RDBMS SQL Server

105

L

Page 103: Big Data and NoSQL for Database and BI Pros

Relational vs. NoSQL

Line of Business -> Relational

Large, public (consumer)-facing sites -> NoSQL

Complex data structures -> Relational

Big Data -> NoSQL

Transactional -> Relational

Content Management -> NoSQL

Enterprise->Relational

Consumer Web -> NoSQL

Page 104: Big Data and NoSQL for Database and BI Pros

Data Scientists…L

Page 105: Big Data and NoSQL for Database and BI Pros

Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem

Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training

environments

Learn NoSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon

Karmasphere, Microsoft Excel connectors, etc…

NoSQL To-Do ListL

Page 107: Big Data and NoSQL for Database and BI Pros

Thank You

[email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828

Page 108: Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Thank you!Diamond Sponsor