Upload
andrew-brust
View
1.826
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Big Data and NoSQL for Database and BI Pros - PASS Business Analytics Conference 2013
Citation preview
April 10-12 | Chicago, IL
Big Data and NoSQL for Database and BI Pros
Andrew J. Brust, Founder and CEO, Blue Badge Insights
April 10-12 | Chicago, IL
Please silence cell phones
3
Meet Andrew
CEO and Founder, Blue Badge Insights
Big Data blogger for ZDNetMicrosoft Regional Director, MVPCo-chair VSLive! and 17 years as a speakerFounder, Microsoft BI User Group of NYC• http://www.msbinyc.comCo-moderator, NYC .NET Developers Group• http://www.nycdotnetdev.com“Redmond Review” columnist for Visual Studio Magazine and Redmond Developer Newsbrustblog.com, Twitter: @andrewbrust
Lynn Langit (in absentia)
CEO and Founder, Lynn Langit consultingFormer Microsoft Evangelist (4 years)Google Developer ExpertMongoDB MasterMCT 13 years – 7 certificationsCloudera Certified Developer MSDN Magazine articles • SQL Azure• Hadoop on Azure• MongoDB on Azurewww.LynnLangit.com@LynnLangit
L
Read all about it!
Agenda
Overview / Landscape • Big Data, and Hadoop• NoSQL• The Big Data-NoSQL Intersection
Drilldown on Big DataDrilldown on NoSQL
What is Big Data?
100s of TB into PB and higherInvolving data from: financial data, sensors, web logs, social media, etc.Parallel processing often involvedHadoop is emblematic, but other technologies are Big Data tooProcessing of data sets too large for transactional databasesAnalyzing interactions, rather than transactionsThe three V’s: Volume, Velocity, VarietyBig Data tech sometimes imposed on small data problems
9
Big Data = Exponentially More DataRetail Example -> ‘Feedback Economy’• Number of transactions• Number of behaviors (collected every minute)
L
10
Big Data = ‘Next State’ Questions
• What could happen?• Why didn’t this happen?• When will the next new thing
happen?• What will the next new thing
be?• What happens?
Collecting Behavio
raldata
L
11
My Data: An Example from Health CareMedical records
• Regular• Emergency• Genetic data – 23andMeFood data • SparkPeoplePurchasing • Grocery card• credit cardSearch – GoogleSocial media• Twitter• FacebookExercise • Nike Fuel Band• Kinect• Location - phone
L
12
Big Data = More DataL
Big Data Considerations
Collection – get the
data
Storage – keep the
data
Querying – make
sense of the data
Visualization – see the business
value
L
14
Data Collection
Types of Data• Structured, semi-structured, unstructured vs. data standards• Behavioral vs. transactional data
Methods of collection• Sensors everywhere• Machine-2-Machine• Public Datasets
• Freebase• Azure DataMarket• Hillary Mason’s list
L
What’s MapReduce?
Partition the bulk input data and send to mappers (nodes in cluster)Mappers pre-process, put into key-value format, and send all output for a given (set of) key(s) to a reducerReducer aggregates; one output per key, with valueMap and Reduce code natively written as Java functions
MapReduce, in a Diagram
mapper
mapper
mapper
mapper
mapper
mapper
Input
reducer
reducer
reducer
Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Output
Input
Input
Input
K1 , K4
K3 , K6
Output
Output
Output
K2 , K5
• Count by suite, on each floor
• Send per-suite, per platform totals to lobby
• Sort totals by platform
• Send two platform packets to 10th, 20th, 30th floor
• Tally up each platform
• Merge tallies into one spreadsheet
• Collect the tallies
A MapReduce Example
What’s a Distributed File System?
One where data gets distributed over commodity drives on commodity serversData is replicated• If one box goes down, no data lost• “Shared Nothing”
BUT: Immutable• Files can only be written to once• So updates require drop + re-write (slow)• You can append though• Like a DVD/CD-ROM
Hadoop = MapReduce + HDFS
Modeled after Google MapReduce + GFSHave more data? Just add more nodes to cluster. • Mappers execute in parallel• Hardware is commodity• “Scaling out”
Use of HDFS means data may well be local to mapper processing• So, not just parallel, but minimal data movement, which
avoids network bottlenecks
Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Updates Read / Write many times Write once, Read many times
Integrity High (ACID) Low
Query Response Time Can be near immediate Has latency (due to batch processing)
20
L
Just-in-Time Schema
When looking at unstructured data, schema is imposed at query timeSchema is context specific• If scanning a book, are the values words, lines, or pages?• Are notes a single field, or is each word a value?• Are date and time two fields or one?• Are street, city, state, zip separate or one value?• Pig and Hive let you determine this at query time• So does the Map function in MapReduce code
What’s HBase?
A Wide-Column Store NoSQL databaseModeled after Google BigTableUses HDFSTherefore, Hadoop-compatibleHadoop MapReduce often used with HBaseBut you can use either without the other
L
NoSQL Confusion
Many ‘flavors’ of NoSQL data storesEasiest to group by functionality, but…• Dividing lines are not clear or consistentNoSQL choice(s) driven by many factors• Type of data• Quantity of data• Knowledge of technical staff• Product maturity• Tooling
L
So much wrong information
Everything is ‘new’
People are religious
about data storage
Lots of incorrect
information
‘Try’ before you ‘buy’ (or
use)
Watch out for over
simplification
Confusion over vendor
offerings
L
Common NoSQL Misconceptions
Problems
Everything is ‘new’People are religious about data storageOpen source is always cheaperCloud is always cheaperReplace RDBMS with NoSQL
Solutions
‘Try’ before you ‘buy’ (or use)Leverage NoSQL communitiesAdd NoSQL to existing RDBMS solution
L
April 10-12 | Chicago, IL
Drilldown on Big Data
The Hadoop Stack
MapReduce, HDFS
Database
RDBMS Import/Export
Query: HiveQL and Pig Latin
Machine Learning/Data Mining
Log file integration
What’s Hive?
Began as Hadoop sub-projectNow top-level Apache project
Provides a SQL-like (“HiveQL”) abstraction over MapReduceHas its own HDFS table file format (and it’s fully schema-bound)Can also work over HBaseActs as a bridge to many BI products which expect tabular data
Hadoop Distributions
ClouderaHortonworksHCatalog: Hive/Pig/MR Interop
MapRNetwork File System replaces HDFS
IBM InfoSphere BigInsightsHDFS<->DB2 integration
And now Microsoft…
Microsoft HDInsight
Developed with Hortonworks and incorporates Hortonworks Data Platform (HDP) for WindowsWindows Azure HDInsight and Microsoft HDInsight (for Windows Server)• Single node preview runs on Windows client
Includes ODBC Driver for HiveJavaScript MapReduce frameworkContribute it all back to open source Apache Project
Hortonworks Data Platform for Windows
MRLib (NuGet
Package)
LINQ to Hive
OdbcClient + Hive ODBC
Driver
Deployment
Debugging
MR code in C#,
HadoopJob, MapperBase, ReducerBase
Amenities for Visual Studio/.NET
Some ways to work
Microsoft HDInsight• Cloud: go to www.windowsazure.com, request a cluster• Local: Download Microsoft HDInsight
• Runs on just about anything, including Windows XP• Get it via the Web Platform installer (WebPI)
• Local version is free; cloud billed at 50% discount during previewAmazon Web Services Elastic MapReduce• Create AWS account• Select Elastic MapReduce in Dashboard• Cheap for experimenting, but not freeCloudera CDH VM image• Download as .tar.gz file• “Un-tar” (can use WinRAR, 7zip)• Run via VMWare Player or Virtual Box• Everything’s free
Some ways to work
HDInsight EMR CDH 4
35
Microsoft HDInsight
Much simpler than the othersBrowser-based portal• Launch MapReduce jobs• Azure: Provisioning cluster, managing ports, gather external data
Interactive JavaScript & Hive console• JS: HDFS, Pig, light data visualization• Hive commands and metadata discovery• New console coming
Desktop Shortcuts:• Command window, MapReduce, Name Node status in browser• Azure: from portal page you can RDP directly to Hadoop head node for
these desktop shortcuts
April 10-12 | Chicago, IL
DemoWindows Azure HDInsight
Amazon Elastic MapReduce
Lots of steps!At a high level:• Setup AWS account and S3 “buckets”• Generate Key Pair and PEM file• Install Ruby and EMR Command Line Interface• Provision the cluster using CLI
• A batch file can work very well here
• Setup and run SSH/PuTTY• Work interactively at command line
April 10-12 | Chicago, IL
DemoAmazon Elastic MapReduce
Cloudera CDH4 Virtual Machine
Get it for free, in VMWare and Virtual Box versions.• VMWare player and Virtual Box are free too
Run it, and configure it to have its own IP on your network. Use ifconfig to discover IP.Assuming IP of 192.168.1.59, open browser on your own (host) machine and navigate to:• http://192.168.1.59:8888
Can also use browser in VM and hit:• http://localhost:8888
Work in “Hue”…
Hue
Browser based UI, with front ends for:HDFS (w/ upload & download)MapReduce job creation and monitoringHive (“Beeswax”)And in-browser command line shells for:HBasePig (“Grunt”)
Impala: What it Is
Distributed SQL query engine over Hadoop clusterAnnounced at Strata/Hadoop World in NYC on October 24th
In Beta, as part of CDH 4.1Works with HDFS and Hive dataCompatible with HiveQL and Hive drivers• Query with Beeswax
Impala: What it’s Not
Impala is not Hive• Hive converts HiveQL to Java MapReduce code and executes it in
batch mode• Impala executes query interactively over the data• Brings BI tools and Hadoop closer together
Impala is not an Apache Software Foundation project• Though it is open source and Apache-licensed, but it’s still
incubated by Cloudera• Only in CDH
April 10-12 | Chicago, IL
DemoCloudera CDH4, Impala
Hadoop commands
HDFS• hadoop fs filecommand• Create and remove directories
• mkdir, rm, rmr
• Upload and download files to/from HDFS• get, put
• View directory contents• ls, lsr
• Copy, move, view files• cp, mv, cat
MapReduce• Run a Java jar-file based job
• hadoop jar jarname params
April 10-12 | Chicago, IL
DemoHadoop (directly)
HBase
Concepts:• Tables, column families• Columns, rows• Keys, valuesCommands:• Definition: create, alter, drop, truncate• Manipulation: get, put, delete, deleteall, scan• Discovery: list, exists, describe, count• Enablement: disable, enable• Utilities: version, status, shutdown, exit• Reference: http://wiki.apache.org/hadoop/Hbase/Shell
Moreover,• Interesting HBase work can be done in MapReduce, Pig
HBase Examples
create 't1', 'f1', 'f2', 'f3'describe 't1'alter 't1', {NAME => 'f1', VERSIONS => 5} put 't1', 'r1', 'c1:f1', 'value'get 't1', 'r1'count 't1'
April 10-12 | Chicago, IL
DemoHBase
Submitting, Running and Monitoring JobsUpload a JARUse Streaming• Use other languages (i.e. other than Java) to write MapReduce
code• Python is popular option• Any executable works, even C# console apps• On MS HDInsight, JavaScript works too• Still uses a JAR file: streaming.jar
Run at command line (passing JAR name and params) or use GUI
April 10-12 | Chicago, IL
DemoRunning MapReduce Jobs
Hive
Used by most BI products which connect to HadoopProvides a SQL-like abstraction over HadoopOfficially HiveQL, or HQL
Works on own tables, but also on HBaseQuery generates MapReduce job, output of which becomes result setMicrosoft has Hive ODBC driverConnects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only)
Hive, Continued
Load data from flat HDFS files• LOAD DATA [LOCAL] INPATH 'myfile'
INTO TABLE mytable;
SQL Queries• CREATE, ALTER, DROP• INSERT OVERWRITE (creates whole tables)• SELECT, JOIN, WHERE, GROUP BY• SORT BY, but ordering data is tricky!• MAP/REDUCE/TRANSFORM…USING allows for custom map, reduce
steps utilizing Java or streaming code
Data Explorer• Beta add-in for Excel• Acquire, transform
data• Data sources include
Facebook, HDFS• Visually- or script-
driven• Also includes Azure
BLOB storage backing up HDInsight
56
Pig
Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions• Do a combo of Query and ETL
“10 lines of Pig Latin ≈ 200 lines of Java.”Works with structured or unstructured dataOperations• As with Hive, a MapReduce job is generated• Unlike Hive, output is only flat file to HDFS or text at command line console• With HDInsight, can easily convert to JavaScript array, then manipulate
Use command line (“Grunt”) or build scripts
Example
A = LOAD 'myfile' AS (x, y, z);B = FILTER A by x > 0;C = GROUP B BY x;D = FOREACH A GENERATE x, COUNT(B);STORE D INTO 'output';
Pig Latin Examples
Imperative, file system commands• LOAD, STORE
•Schema specified on LOAD
Declarative, query commands (SQL-like)• xxx = file or data set• FOREACH xxx GENERATE (SELECT…FROM xxx)• JOIN (WHERE/INNER JOIN)• FILTER xxx BY (WHERE)• ORDER xxx BY (ORDER BY)• GROUP xxx BY / GENERATE COUNT(xxx)
(SELECT COUNT(*) GROUP BY)• DISTINCT (SELECT DISTINCT)Syntax is assignment statement-based:• MyCusts = FILTER Custs BY SalesPerson eq 15;Access Hbase• CpuMetrics = LOAD 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('cpu:','-loadKey -returnTuple');
Sqoop
sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column>
Sqoop
sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>"
Flume NG
Source• Avro (data serialization system – can read json-encoded data files,
and can work over RPC)• Exec (reads from stdout of long-running process)
Sinks• HDFS, HBase, Avro
Channels• Memory, JDBC, file
Flume NG (next generation)
Setup conf/flume.conf# Define a memory channel called ch1 on agent1agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it# to bind to 0.0.0.0:41414. Connect it to channel ch1.agent1.sources.avro-source1.channels = ch1agent1.sources.avro-source1.type = avroagent1.sources.avro-source1.bind = 0.0.0.0agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives# and connect it to the other end of the same channel.agent1.sinks.log-sink1.channel = ch1agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell# agent1 which ones we want to activate.agent1.channels = ch1agent1.sources = avro-source1agent1.sinks = log-sink1
From the command line:flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Mahout Algorithms
Recommendation• Your info + community info• Give users/items/ratings; get user-user/item-item• itemsimilarityClassification/Categorization• Drop into buckets• Naïve Bayes, Complementary Naïve Bayes, Decision ForestsClustering• Like classification, but with categories unknown• K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-Shift
Workflow, Syntax
Workflow• Run the job• Dump the output• Visualize, predict
mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2…Example:• mahout itemsimilarity
--input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD
The Truth About Mahout
Mahout is really just an algorithm engineIts output is almost unusable by non-statisticians/non-data scientistsYou need a staff or a product to visualize, or make into a usable prediction modelInvestigate Predixion Software• CTO, Jamie MacLennan, used to lead SQL Server Data Mining team• Excel add-in can use Mahout remotely, visualize its output, run
predictive analyses• Also integrates with SQL Server, Greenplum, MapReduce• http://www.predixionsoftware.com
The “Data-Refinery” Idea
Use Hadoop to “on-board” unstructured data, then extract manageable subsetsLoad the subsets into conventional DW/BI servers and use familiar analytics tool to examineThis is the current rationalization of Hadoop + BI tools’ coexistenceWill it stay this way?
Dremel-based service for massive amounts of dataPay for query and storageSQL-like query languageHas an Excel connector
Google BigQueryL
April 10-12 | Chicago, IL
Google BigQuery
April 10-12 | Chicago, IL
Drilldown on NoSQL
NoSQL Data Fodder
AddressesPreference
s
NotesFriends,
Followers
Documents
“Web Scale”This the term used to justify NoSQLScenario is simple needs but “made up for in volume”• Millions of concurrent users
Think of sites like Amazon or GoogleThink of non-transactional tasks like loading catalog data to display product page, or environment preferences
NoSQL Common Traits
Non-relationalNon-schematized/schema-freeOpen sourceDistributedEventual consistency“Web scale”Developed at big Internet companies
More than just the Elephant in the roomOver 120+ types of noSQL databases
So many NoSQL optionsL
Concepts
ConsistencyCAP TheoremIndexingQueriesMapReduceSharding
Consistency
CAP Theorem
• Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance
NoSQL does not offer “ACID” guarantees
• Atomicity, consistency, isolation and durability
Instead offers “eventual consistency”
Similar to DNS propagation
Things like inventory, account balances should be consistent
• Imagine updating a server in Seattle that stock was depleted
• Imagine not updating the server in NY
• Customer in NY goes to order 50 pieces of the item
• Order processed even though no stock
Things like catalog information don’t have to be, at least not immediately
• If a new item is entered into the catalog, it’s OK for some customers to see it even before the other customers’ server knows about it
But catalog info must come up quickly
• Therefore don’t lock data in one location while waiting to update the other
Therefore, OK to sacrifice consistency for speed, in some cases
Consistency
CAP Theorem
Consistency
Availability
Partition Tolerance
Relational
NoSQL
Indexing
Most NoSQL databases are indexed by keySome allow so-called “secondary” indexesOften the primary key indexes are clusteredHBase uses HDFS (the Hadoop Distributed File System), which is append-only• Writes are logged
• Logged writes are batched
• File is re-created and sorted
Queries
Typically no query languageInstead, create procedural programSometimes SQL is supportedSometimes MapReduce code is used…
MapReduce
This is not Hadoop’s MapReduce, but it’s conceptually relatedMap step: pre-processes dataReduce step: summarizes/aggregates dataWill show a MapReduce code sample for Mongo soonWill demo map code on CouchDB
L
Sharding
A partitioning pattern where separate servers store partitionsFan-out queries supportedPartitions may be duplicated, so replication also provided• Good for disaster recovery
Since “shards” can be geographically distributed, sharding can act like a CDNGood for keeping data close to processing• Reduces network traffic when MapReduce splitting takes place
NoSQL Categories
GraphWide ColumnDocumentKey/Value
L
87
Key-Value Stores
The most common; not necessarily the most popularHas rows, each with something like a big dictionary/associative array• Schema may differ from row to row
Common on cloud platforms• e.g. Amazon SimpleDB, Azure Table Storage
MemcacheDB, Voldemort, Couchbase, DynamoDB (AWS), Dynomite, Redis and Riak
Key-Value Stores
Table: CustomersRow ID: 101
First_Name: AndrewLast_Name: BrustAddress: 123 Main StreetLast_Order: 1501
Row ID: 202First_Name: JaneLast_Name: DoeAddress: 321 Elm StreetLast_Order: 1502
Table: Orders
Row ID: 1501Price: 300 USDItem1: 52134Item2: 24457
Row ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428
Database
Wide Column Stores
Has tables with declared column families
• Each column family has “columns” which are KV pairs that can vary from row to row
These are the most foundational for large sites
• BigTable (Google)
• HBase (Originally part of Yahoo-dominated Hadoop project)
• Cassandra (Facebook)
• Calls column families “super columns” and tables “super column families”
They are the most “Big Data”-ready
• Especially HBase + Hadoop
Table: CustomersRow ID: 101
Super Column: Name Column: First_Name: Andrew Column: Last_Name: BrustSuper Column: Address Column: Number: 123 Column: Street: Main StreetSuper Column: Orders Column: Last_Order: 1501
Table: Orders
Row ID: 1501Super Column: Pricing Column: Price: 300 USDSuper Column: Items Column: Item1: 52134 Column: Item2: 24457Row ID: 1502Super Column: Pricing Column: Price: 2500 GBPSuper Column: Items Column: Item1: 98456 Column: Item2: 59428
Row ID: 202Super Column: Name Column: First_Name: Jane Column: Last_Name: DoeSuper Column: Address Column: Number: 321 Column: Street: Elm StreetSuper Column: Orders Column: Last_Order: 1502
Wide Column Stores
April 10-12 | Chicago, IL
DemoWide Column Stores
Document Stores
Have “databases,” which are akin to tablesHave “documents,” akin to rows
• Documents are typically JSON objects
• Each document has properties and values
• Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage)
• Can have attachments as well
Old versions are retained
• So Doc Stores work well for content management
Some view doc stores as specialized KV storesMost popular with developers, startups, VCsThe biggies:
• CouchDB
• Derivatives
• MongoDB
Document Store Application Orientation
Documents can each be addressed by URIsCouchDB supports full REST interfaceVery geared towards JavaScript and JSON
• Documents are JSON objects
• CouchDB/MongoDB use JavaScript as native language
In CouchDB, “view functions” also have unique URIs and they return HTML
• So you can build entire applications in the database
Database: CustomersDocument ID: 101
First_Name: AndrewLast_Name: BrustAddress:
Orders:
Database: Orders
Document ID: 1501Price: 300 USDItem1: 52134Item2: 24457
Document ID: 1502Price: 2500 GBPItem1: 98456Item2: 59428
Number: 123Street: Main Street
Most_recent: 1501
Document ID: 202First_Name: JaneLast_Name: DoeAddress:
Orders:
Number: 321Street: Elm Street
Most_recent: 1502
Document Stores
Com
pari
ng…
April 10-12 | Chicago, IL
DemoDocument Stores
Graph Databases
Great for social network applications and others where relationships are importantNodes and edges• Edge like a join
• Nodes like rows in a table
Nodes can also have properties and valuesNeo4j is a popular graph db
Database
Sent invitation to
Commented on photo by
Friend of
Address
Placed order
Item2
Item1
Joe Smith Jane Doe
Andrew Brust
Street: 123 Main StreetCity: New YorkState: NYZip: 10014
ID: 52134Type: DressColor: Blue
ID: 24457Type: ShirtColor: Red
ID: 252Total Price: 300 USD
George Washington
Graph Databases
NoSQL on Windows Azure
Platform as a Service• Cloudant: https://cloudant.com/azure/
• MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/
MongoDB, DIY: • On an Azure Worker Role:
http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+Worker+Roles
• On a Windows VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Windows+Installer
• On a Linux VM:http://www.mongodb.org/display/DOCS/MongoDB+on+Azure+VM+-+Linux+Tutorialhttp://www.windowsazure.com/en-us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
NoSQL on Windows AzureOthers, DIY (Linux VMs):• Couchbase:
http://blog.couchbase.com/couchbase-server-new-windows-azure
• CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couchdb-installer-for-windows-azure
• Riak:http://basho.com/blog/technical/2012/10/09/Riak-on-Microsoft-Azure/
• Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running-redis-on-a-centos-linux-vm-in-windows-azure.aspx
• Cassandra: http://www.windowsazure.com/en-us/manage/linux/other-resources/how-to-run-cassandra-with-linux/
NoSQL + BI
NoSQL databases are bad for ad hoc query and data warehousingBI applications involve models; models rely on schemaExtract, transform and load (ETL) may be your friendWide-column stores, however are good for “Big Data”
• See next slide
Wide-column stores and column-oriented databases are similar technologically
NoSQL + Big DataBig Data and NoSQL are interrelatedTypically, Wide-Column stores used in Big Data scenariosPrime example:• HBase and Hadoop
Why?• Lack of indexing not a problem
• Consistency not an issue
• Fast reads very important
• Distributed file systems important too
• Commodity hardware and disk assumptions also important
• Not Web scale but massive scale-out, so similar concerns
NoSQL Compromises
Eventual consistencyWrite bufferingOnly primary keys can be indexedQueries must be written as programsTooling• Productivity (= money)
Common DBA Tasks in NoSQL
RDBMS NoSQL
Import Data Import Data
Setup Security Setup Security
Perform a Backup Make a copy of the data
Restore a Database Move a copy to a location
Create an Index Create an Index
Join Tables Together Run MapReduce
Schedule a Job Schedule a (Cron) Job
Run Database Maintenance Monitor space and resources used
Send an Email from SQL Server Set up resource threshold alerts
Search BOL Interpret Documentation
104
L
Which Type of NoSQL for Which Type of Data?
Type of Data Type of NoSQL solution Example
Log files Wide Column HBase
Product Catalogs Key Value on disk DynamoDB
User profiles Key Value in memory Redis
Startups Document MongoDB
Social media connections Graph Neo4j
LOB w/Transactions NONE! Use RDBMS SQL Server
105
L
Relational vs. NoSQL
Line of Business -> Relational
Large, public (consumer)-facing sites -> NoSQL
Complex data structures -> Relational
Big Data -> NoSQL
Transactional -> Relational
Content Management -> NoSQL
Enterprise->Relational
Consumer Web -> NoSQL
Data Scientists…L
Understand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training
environments
Learn NoSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
NoSQL To-Do ListL
NoSQL for .NET Developers
RavenDBMongoDB C#/.NET DriverMongoDB on Windows AzureCouchBase .NET Client LibraryRiak client for .NETAWS Toolkit for Visual StudioGoogle cloud APIs (REST-based)
Thank You
• [email protected]• @andrewbrust on twitter• Want to get on Blue Badge Insights’ list?”Text “bluebadge” to 22828
April 10-12 | Chicago, IL
Thank you!Diamond Sponsor