Big Data and NoSQL for Database and BI Pros

April 10-12 | Chicago, IL

Big Data and NoSQL for Database and BI Pros

Andrew J. Brust, Founder and CEO, Blue Badge Insights


Please silence cell phones

3

Meet Andrew

CEO and Founder, Blue Badge Insights

Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust

http://www.msbinyc.com/

http://www.nycdotnetdev.com/

Andrew’s New Blog (bit.ly/bigondata)

http://bit.ly/bigondata

http://bit.ly/bigondata

Lynn Langit (in absentia)

CEO and Founder, Lynn Langit consultingFormer Microsoft Evangelist (4 years)Google Developer ExpertMongoDB MasterMCT 13 years – 7 certificationsCloudera Certified Developer MSDN Magazine articles • SQL Azure• Hadoop on Azure• MongoDB on Azurewww.LynnLangit.com@LynnLangit

L

http://www.LynnLangit.com/

Read all about it!

Agenda

Overview / Landscape • Big Data, and Hadoop• NoSQL• The Big Data-NoSQL Intersection

Drilldown on Big DataDrilldown on NoSQL

What is Big Data?

100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involvedHadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactional databasesAnalyzing interactions, rather than transactionsThe three V’s: Volume, Velocity, VarietyBig Data tech sometimes imposed on small data problems

9

Big Data = Exponentially More DataRetail Example -> ‘Feedback Economy’• Number of transactions• Number of behaviors (collected every minute)

L

10

Big Data = ‘Next State’ Questions

• What could happen?• Why didn’t this happen?• When will the next new thing

happen?• What will the next new thing

be?• What happens?

Collecting Behavio

raldata

L

11

My Data: An Example from Health CareMedical records

• Regular• Emergency• Genetic data – 23andMeFood data • SparkPeoplePurchasing • Grocery card• credit cardSearch – GoogleSocial media• Twitter• FacebookExercise • Nike Fuel Band• Kinect• Location - phone

L

12

Big Data = More DataL

Big Data Considerations

Collection – get the

data

Storage – keep the

data

Querying – make

sense of the data

Visualization – see the business

value

L

14

Data Collection

Types of Data• Structured, semi-structured, unstructured vs. data standards• Behavioral vs. transactional data

Methods of collection• Sensors everywhere• Machine-2-Machine• Public Datasets

• Freebase• Azure DataMarket• Hillary Mason’s list

L

What’s MapReduce?

Partition the bulk input data and send to mappers (nodes in cluster)Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducerReducer aggregates; one output per key, with valueMap and Reduce code natively written as Java functions

MapReduce, in a Diagram

mapper

mapper

mapper

mapper

mapper

mapper

Input

reducer

reducer

reducer

Input

Input

Input

Input

Input

Input

Output

Output

Output

Output

Output

Output

Output

Input

Input

Input

K1 , K4

K3 , K6

Output

Output

Output

K2 , K5

• Count by suite, on each floor

• Send per-suite, per platform totals to lobby

• Sort totals by platform

• Send two platform packets to 10th, 20th, 30th floor

• Tally up each platform

• Merge tallies into one spreadsheet

• Collect the tallies

A MapReduce Example

What’s a Distributed File System?

One where data gets distributed over commodity drives on commodity serversData is replicated• If one box goes down, no data lost• “Shared Nothing”

BUT: Immutable• Files can only be written to once• So updates require drop + re-write (slow)• You can append though• Like a DVD/CD-ROM

Hadoop = MapReduce + HDFS

Modeled after Google MapReduce + GFSHave more data? Just add more nodes to cluster. • Mappers execute in parallel• Hardware is commodity• “Scaling out”

Use of HDFS means data may well be local to mapper processing• So, not just parallel, but minimal data movement, which

avoids network bottlenecks

Comparison: RDBMS vs. Hadoop

Traditional RDBMS Hadoop / MapReduce

Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)

Updates Read / Write many times Write once, Read many times

Integrity High (ACID) Low

Query Response Time Can be near immediate Has latency (due to batch processing)

20

L

Just-in-Time Schema

When looking at unstructured data, schema is imposed at query timeSchema is context specific• If scanning a book, are the values words, lines, or pages?• Are notes a single field, or is each word a value?• Are date and time two fields or one?• Are street, city, state, zip separate or one value?• Pig and Hive let you determine this at query time• So does the Map function in MapReduce code

What’s HBase?

A Wide-Column Store NoSQL databaseModeled after Google BigTableUses HDFSTherefore, Hadoop-compatibleHadoop MapReduce often used with HBaseBut you can use either without the other

L

NoSQL Confusion

Many ‘flavors’ of NoSQL data storesEasiest to group by functionality, but…• Dividing lines are not clear or consistentNoSQL choice(s) driven by many factors• Type of data• Quantity of data• Knowledge of technical staff• Product maturity• Tooling

L

So much wrong information

Everything is ‘new’

People are religious

about data storage

Lots of incorrect

information

‘Try’ before you ‘buy’ (or

use)

Watch out for over

simplification

Confusion over vendor

offerings

L

Common NoSQL Misconceptions

Problems

Everything is ‘new’People are religious about data storageOpen source is always cheaperCloud is always cheaperReplace RDBMS with NoSQL

Solutions

‘Try’ before you ‘buy’ (or use)Leverage NoSQL communitiesAdd NoSQL to existing RDBMS solution

L


Drilldown on Big Data

The Hadoop Stack

MapReduce, HDFS

Database

RDBMS Import/Export

Query: HiveQL and Pig Latin

Machine Learning/Data Mining

Log file integration

http://hive.apache.org/

http://pig.apache.org/

What’s Hive?

Began as Hadoop sub-projectNow top-level Apache project

Provides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data


Hadoop Distributions

ClouderaHortonworksHCatalog: Hive/Pig/MR Interop

MapRNetwork File System replaces HDFS

IBM InfoSphere BigInsightsHDFS<->DB2 integration

And now Microsoft…

Microsoft HDInsight

Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for WindowsWindows Azure HDInsight and Microsoft HDInsight (for Windows Server)• Single node preview runs on Windows client

Includes ODBC Driver for HiveJavaScript MapReduce frameworkContribute it all back to open source Apache Project

Hortonworks Data Platform for Windows

MRLib (NuGet

Package)

LINQ to Hive

OdbcClient + Hive ODBC

Driver

Deployment

Debugging

MR code in C#,

HadoopJob, MapperBase, ReducerBase

Amenities for Visual Studio/.NET

Some ways to work

Microsoft HDInsight• Cloud: go to www.windowsazure.com, request a cluster• Local: Download Microsoft HDInsight

• Runs on just about anything, including Windows XP• Get it via the Web Platform installer (WebPI)

• Local version is free; cloud billed at 50% discount during previewAmazon Web Services Elastic MapReduce• Create AWS account• Select Elastic MapReduce in Dashboard• Cheap for experimenting, but not freeCloudera CDH VM image• Download as .tar.gz file• “Un-tar” (can use WinRAR, 7zip)• Run via VMWare Player or Virtual Box• Everything’s free

http://www.hadooponazure.com/

Some ways to work

HDInsight EMR CDH 4

35

Microsoft HDInsight

Much simpler than the othersBrowser-based portal• Launch MapReduce jobs• Azure: Provisioning cluster, managing ports, gather external data

Interactive JavaScript & Hive console• JS: HDFS, Pig, light data visualization• Hive commands and metadata discovery• New console coming

Desktop Shortcuts:• Command window, MapReduce, Name Node status in browser• Azure: from portal page you can RDP directly to Hadoop head node for

these desktop shortcuts


DemoWindows Azure HDInsight

Amazon Elastic MapReduce

Lots of steps!At a high level:• Setup AWS account and S3 “buckets”• Generate Key Pair and PEM file• Install Ruby and EMR Command Line Interface• Provision the cluster using CLI

• A batch file can work very well here

• Setup and run SSH/PuTTY• Work interactively at command line


DemoAmazon Elastic MapReduce

Cloudera CDH4 Virtual Machine

Get it for free, in VMWare and Virtual Box versions.• VMWare player and Virtual Box are free too

Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP.Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to:• http://192.168.1.59:8888

Can also use browser in VM and hit:• http://localhost:8888

Work in “Hue”…

http://192.168.1.59:8888/

http://localhost:8888/

Hue

Browser based UI, with front ends for:HDFS (w/ upload & download)MapReduce job creation and monitoringHive (“Beeswax”)And in-browser command line shells for:HBasePig (“Grunt”)

Impala: What it Is

Distributed SQL query engine over Hadoop clusterAnnounced at Strata/Hadoop World in NYC on October 24th

In Beta, as part of CDH 4.1Works with HDFS and Hive dataCompatible with HiveQL and Hive drivers• Query with Beeswax

Impala: What it’s Not

Impala is not Hive• Hive converts HiveQL to Java MapReduce code and executes it in

batch mode• Impala executes query interactively over the data• Brings BI tools and Hadoop closer together

Impala is not an Apache Software Foundation project• Though it is open source and Apache-licensed, but it’s still

incubated by Cloudera• Only in CDH


DemoCloudera CDH4, Impala

Hadoop commands

HDFS• hadoop fs filecommand• Create and remove directories

• mkdir, rm, rmr

• Upload and download files to/from HDFS• get, put

• View directory contents• ls, lsr

• Copy, move, view files• cp, mv, cat

MapReduce• Run a Java jar-file based job

• hadoop jar jarname params


DemoHadoop (directly)

HBase

Concepts:• Tables, column families• Columns, rows• Keys, valuesCommands:• Definition: create, alter, drop, truncate• Manipulation: get, put, delete, deleteall, scan• Discovery: list, exists, describe, count• Enablement: disable, enable• Utilities: version, status, shutdown, exit• Reference: http://wiki.apache.org/hadoop/Hbase/Shell

Moreover,• Interesting HBase work can be done in MapReduce, Pig

http://wiki.apache.org/hadoop/Hbase/Shell

HBase Examples

create 't1', 'f1', 'f2', 'f3'describe 't1'alter 't1', {NAME => 'f1', VERSIONS => 5} put 't1', 'r1', 'c1:f1', 'value'get 't1', 'r1'count 't1'


DemoHBase

Submitting, Running and Monitoring JobsUpload a JARUse Streaming• Use other languages (i.e. other than Java) to write MapReduce

code• Python is popular option• Any executable works, even C# console apps• On MS HDInsight, JavaScript works too• Still uses a JAR file: streaming.jar

Run at command line (passing JAR name and params) or use GUI


DemoRunning MapReduce Jobs

Hive

Used by most BI products which connect to HadoopProvides a SQL-like abstraction over HadoopOfficially HiveQL, or HQL

Works on own tables, but also on HBaseQuery generates MapReduce job, output of which becomes result setMicrosoft has Hive ODBC driverConnects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)


Hive, Continued

Load data from flat HDFS files• LOAD DATA [LOCAL] INPATH 'myfile'

INTO TABLE mytable;

SQL Queries• CREATE, ALTER, DROP• INSERT OVERWRITE (creates whole tables)• SELECT, JOIN, WHERE, GROUP BY• SORT BY, but ordering data is tricky!• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce

steps utilizing Java or streaming code


Data Explorer• Beta add-in for Excel• Acquire, transform

data• Data sources include

Facebook, HDFS• Visually- or script-

driven• Also includes Azure

BLOB storage backing up HDInsight

56


DemoHive, Data Explorer


Pig

Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions• Do a combo of Query and ETL

“10 lines of Pig Latin ≈ 200 lines of Java.”Works with structured or unstructured dataOperations• As with Hive, a MapReduce job is generated• Unlike Hive, output is only flat file to HDFS or text at command line console• With HDInsight, can easily convert to JavaScript array, then manipulate

Use command line (“Grunt”) or build scripts


Example

A = LOAD 'myfile' AS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO 'output';


Pig Latin Examples

Imperative, file system commands• LOAD, STORE

•Schema specified on LOAD

Declarative, query commands (SQL-like)• xxx = file or data set• FOREACH xxx GENERATE (SELECT…FROM xxx)• JOIN (WHERE/INNER JOIN)• FILTER xxx BY (WHERE)• ORDER xxx BY (ORDER BY)• GROUP xxx BY / GENERATE COUNT(xxx)

(SELECT COUNT(*) GROUP BY)• DISTINCT (SELECT DISTINCT)Syntax is assignment statement-based:• MyCusts = FILTER Custs BY SalesPerson eq 15;Access Hbase• CpuMetrics = LOAD 'hbase://SystemMetrics' USING

org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');



DemoPig


Sqoop

sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>

Sqoop

sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"

Flume NG

Source• Avro (data serialization system – can read json-encoded data files,

and can work over RPC)• Exec (reads from stdout of long-running process)

Sinks• HDFS, HBase, Avro

Channels• Memory, JDBC, file

Flume NG (next generation)

Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory

# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414

# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger

# Finally, now that we've defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1

From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

Mahout Algorithms

Recommendation• Your info + community info• Give users/items/ratings; get user-user/item-item• itemsimilarityClassification/Categorization• Drop into buckets• Naïve Bayes, Complementary Naïve Bayes, Decision ForestsClustering• Like classification, but with categories unknown• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift

Workflow, Syntax

Workflow• Run the job• Dump the output• Visualize, predict

mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2…Example:• mahout itemsimilarity

--input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD

The Truth About Mahout

Mahout is really just an algorithm engineIts output is almost unusable by non-statisticians/non-data scientistsYou need a staff or a product to visualize, or make into a usable prediction modelInvestigate Predixion Software• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team• Excel add-in can use Mahout remotely, visualize its output, run

predictive analyses• Also integrates with SQL Server, Greenplum, MapReduce• http://www.predixionsoftware.com

http://www.predixionsoftware.com/

The “Data-Refinery” Idea

Use Hadoop to “on-board” unstructured data, then extract manageable subsetsLoad the subsets into conventional DW/BI servers and use familiar analytics tool to examineThis is the current rationalization of Hadoop + BI tools’ coexistenceWill it stay this way?

Dremel-based service for massive amounts of dataPay for query and storageSQL-like query languageHas an Excel connector

Google BigQueryL


Google BigQuery


Drilldown on NoSQL

NoSQL Data Fodder

AddressesPreference

s

NotesFriends,

Followers

Documents

“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users

Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences

NoSQL Common Traits

Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies

More than just the Elephant in the roomOver 120+ types of noSQL databases

So many NoSQL optionsL

Concepts

ConsistencyCAP TheoremIndexingQueriesMapReduceSharding

Consistency

CAP Theorem

• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance

NoSQL does not offer “ACID” guarantees

• Atomicity, consistency, isolation and durability

Instead offers “eventual consistency”

Similar to DNS propagation

Things like inventory, account balances should be consistent

• Imagine updating a server in Seattle that stock was depleted

• Imagine not updating the server in NY

• Customer in NY goes to order 50 pieces of the item

• Order processed even though no stock

Things like catalog information don’t have to be, at least not immediately

• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it

But catalog info must come up quickly

• Therefore don’t lock data in one location while waiting to update the other

Therefore, OK to sacrifice consistency for speed, in some cases

Consistency

CAP Theorem

Consistency

Availability

Partition Tolerance

Relational

NoSQL

Indexing

Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged

• Logged writes are batched

• File is re-created and sorted

Queries

Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…

MapReduce

This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB

L

Sharding

A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery

Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place

NoSQL Categories

GraphWide ColumnDocumentKey/Value

L

87

Key-Value Stores

The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row

Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage

MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak

Key-Value Stores

Table: CustomersRow ID: 101

First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501

Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502

Table: Orders

Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Database

Wide Column Stores

Has tables with declared column families

• Each column family has “columns” which are KV pairs that can vary from row to row

These are the most foundational for large sites

• BigTable (Google)

• HBase (Originally part of Yahoo-dominated Hadoop project)

• Cassandra (Facebook)

• Calls column families “super columns” and tables “super column families”

They are the most “Big Data”-ready

• Especially HBase + Hadoop

Table: CustomersRow ID: 101

Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501

Table: Orders

Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428

Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502

Wide Column Stores


DemoWide Column Stores

Document Stores

Have “databases,” which are akin to tablesHave “documents,” akin to rows

• Documents are typically JSON objects

• Each document has properties and values

• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)

• Can have attachments as well

Old versions are retained

• So Doc Stores work well for content management

Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:

• CouchDB

• Derivatives

• MongoDB

Document Store Application Orientation

Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON

• Documents are JSON objects

• CouchDB/MongoDB use JavaScript as native language

In CouchDB, “view functions” also have unique URIs and they return HTML

• So you can build entire applications in the database

Database: CustomersDocument ID: 101

First_Name: AndrewLast_Name: BrustAddress:

Orders:

Database: Orders

Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457

Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428

Number: 123Street: Main Street

Most_recent: 1501

Document ID: 202First_Name: JaneLast_Name: DoeAddress:

Orders:

Number: 321Street: Elm Street

Most_recent: 1502

Document Stores

Com

pari

ng…


DemoDocument Stores

Graph Databases

Great for social network applications and others where relationships are importantNodes and edges• Edge like a join

• Nodes like rows in a table

Nodes can also have properties and valuesNeo4j is a popular graph db

Database

Sent invitation to

Commented on photo by

Friend of

Address

Placed order

Item2

Item1

Joe Smith Jane Doe

Andrew Brust

Street: 123 Main StreetCity: New YorkState: NYZip: 10014

ID: 52134Type: DressColor: Blue

ID: 24457Type: ShirtColor: Red

ID: 252Total Price: 300 USD

George Washington

Graph Databases

NoSQL on Windows Azure

Platform as a Service• Cloudant: https://cloudant.com/azure/

• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/

MongoDB, DIY: • On an Azure Worker Role:

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/

https://cloudant.com/azure/

http://blog.mongolab.com/2012/10/azure/

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer

http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorial



http://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/



NoSQL on Windows AzureOthers, DIY (Linux VMs):• Couchbase:

http://blog.couchbase.com/couchbase-server-new-windows-azure

• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure

• Riak:http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/

• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx

• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/



http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure



http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/

http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/

http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx

http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx

http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/

http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/

NoSQL + BI

NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”

• See next slide

Wide-column stores and column-oriented databases are similar technologically

NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop

Why?• Lack of indexing not a problem

• Consistency not an issue

• Fast reads very important

• Distributed file systems important too

• Commodity hardware and disk assumptions also important

• Not Web scale but massive scale-out, so similar concerns

NoSQL Compromises

Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)

Common DBA Tasks in NoSQL

RDBMS NoSQL

Import Data Import Data

Setup Security Setup Security

Perform a Backup Make a copy of the data

Restore a Database Move a copy to a location

Create an Index Create an Index

Join Tables Together Run MapReduce

Schedule a Job Schedule a (Cron) Job

Run Database Maintenance Monitor space and resources used

Send an Email from SQL Server Set up resource threshold alerts

Search BOL Interpret Documentation

104

L

Which Type of NoSQL for Which Type of Data?

Type of Data Type of NoSQL solution Example

Log files Wide Column HBase

Product Catalogs Key Value on disk DynamoDB

User profiles Key Value in memory Redis

Startups Document MongoDB

Social media connections Graph Neo4j

LOB w/Transactions NONE! Use RDBMS SQL Server

105

L

Relational vs. NoSQL

Line of Business -> Relational

Large, public (consumer)-facing sites -> NoSQL

Complex data structures -> Relational

Big Data -> NoSQL

Transactional -> Relational

Content Management -> NoSQL

Enterprise->Relational

Consumer Web -> NoSQL

Data Scientists…L

Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem

Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training

environments

Learn NoSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon

Karmasphere, Microsoft Excel connectors, etc…

NoSQL To-Do ListL

NoSQL for .NET Developers

RavenDBMongoDB C#/.NET DriverMongoDB on Windows AzureCouchBase .NET Client LibraryRiak client for .NETAWS Toolkit for Visual StudioGoogle cloud APIs (REST-based)

http://ravendb.net/

http://www.mongodb.org/display/DOCS/CSharp+Language+Center

http://www.devproconnections.com/article/windows-azure-platform2/nosql-azure-cloud-database-142882

http://www.couchbase.com/develop/net/current

http://corrugatediron.org/

http://aws.amazon.com/visualstudio/

https://developers.google.com/storage/

Thank You

• [email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828

http://www.twitter.com/andrewbrust

http://www.twitter.com/andrewbrust


Thank you!Diamond Sponsor

Technology

Big Data and NoSQL for Database and BI Pros