Upload
mongodb
View
451
Download
3
Tags:
Embed Size (px)
Citation preview
Trigger WarningThis presentation, and materials to which it links, contains triggers. These will be triggering reactive, asynchronous, and message-driven environments.
A safe room is available in Empire West, where Alan Viars is presenting Modernizing National Health Care.
3
Objectionable Content
Language Impurities• Basic Linear Algebra
Subprograms (BLAS - Fortran)• Node-RED visual programming• Node.js• Scala, with Perlish accent
(Ehrmegerd, nerl perlish!)• Java, C++, Prolog• Twitter: unfiltered, live feed• Machine recommendations• Degenerate cases
Let’s Try It
Node-RED Twitter with Watson Resonance
db.tweets.aggregate([ {$group: { _id: { hour: {$hour: "$date"}, minute: {$minute: "$date"} }, total: {$sum: "$sentiment.score"}, average: {$avg: "$sentiment.score"}, count: {$sum: 1}, happyTalk: {$push: "$sentiment.positive"} }}, {$unwind: "$happyTalk"}, {$unwind: "$happyTalk"}, {$group: { _id: "$_id", total: {$first: "$total"}, average: {$first: "$average"}, count: {$first: "$count"}, happyTalk: {$addToSet: "$happyTalk"} }}, {$sort: {_id: -1} }])
The What and the Why
8
Machine Learning
• What: depends who you ask– learning that is done by machines [my lab partner] – algorithms that can learn from and make predictions on data [Wikipedia, just now]– induction and … other algorithms that can be said to “learn” [Kohavi 1998 goo.gl/WvEmNJ]– whatever the heck we’re selling [cloud vendors]– common cognitive framework, ingests content, observe, interpret, evaluate, decide [IBM Watson]– predictive analytics [Microsoft Azure, AWS]– algorithmic grab-bag [Mahout, MLlib]
• Why: depends what you want– Engagement, discovery, decision [Watson]– Prediction: maintenance, demand, resource allocation [Azure]– Analytics: fraud, personalization, marketing, churn, support [AWS]
9
Apache Mahout Samsara
• Architectures: standalone, MapReduce, Spark, H20• Languages: DSL shell, Java• Functions
– Collaborative filtering– Classification– Clustering– Dimensionality reduction– Topic models– Miscellany
Example: Create topic grouping for Wikipedia articles
10
Spark MLlib
• Languages: Scala, Java, Python• Clusters: EC2, YARN, Mesos, standalone• Linear algebra: Java Breeze / Fortran BLAS• Data: vector, point, matrix• Functions
– Basic stats– Classification and regression– Collaborative Filtering– Clustering– Dimensionality reduction (remove variables)– Feature extraction & transformation– Frequent pattern mining– Optimization (local min/max)
Example: interactive drill-down categories for large result set
11
The Magic of Alternating Least Squares Latent Factoring
Which is the real me?
Movies recommended for you: 1: The Sound of Music (1965) 2: Snow White and the Seven Dwarfs (1937) 3: Beauty and the Beast (1991) 4: Charlie Brown Christmas, A (1965) 5: Bambi (1942) 6: Seven Brides for Seven Brothers (1954) 7: Mary Poppins (1964) 8: Pinocchio (1940) 9: Gone with the Wind (1939)10: The Wizard of Oz (1939)
Movies recommended for you: 1: Maradona by Kusturica (2008) 2: Shadows of Forgotten Ancestors (1964) 3: Rosario Tijeras (2005) 4: Constantine's Sword (2007) 5: Titicut Follies (1967) 6: Lady Chatterley (2006) 7: August Evening (2007) 8: Power of Nightmares: The Rise of the Politics of Fear, The (2004) 9: Sun Alley (Sonnenallee) (1999)10: Who's Singin' Over There? (a.k.a. Who Sings Over There) (Ko to tamo peva) (1980)
12
Watson Developer Cloud
• Presented as services for Bluemix• RESTful calls• Node.js• Node-RED
Example: Message resonance for email solicitation
13
Microsoft Azure
• R and Python• Flowchart GUI• Correlation, modeling, trend projection, forecasting• HDInsight cloud Hadoop• Publishing for profit via Machine Learning Gallery
– Voice recognition– Customer churn prediction– Text extraction: sentiment and key phrase– Contributor donation propensity– Frequently bought together– Classifier– Clustering– Linear regression– … 35 total in market [goo.gl/LhMbUu]
Example: Retail forecasting
14
AWS
• Create models• Generate predictions• Data: S3, Redshift, RDS• APIs: Java, .NET, Python, PHP, Node, Ruby• Mobile SDK • Use cases
– Fraud detection– Content personalization– Marketing propensity modeling – Document classification– Customer churn prediction– Customer support solutions
Example: Marketing response prediction
15
MongoDB
• Next-gen database– Document-model– Scalable– Highly-available– Secondary indexes
• Agile with schema and query types• Subsecond query response over multiple indexes• Low-second aggregation framework for basic analytics
Example: Number of articles by author
• In-database mapReduce• Hadoop connector
– Mongo[Input|Output]Format– mongo.[input|output].uri or BSON– mongo.input.query
Agility Aggregation Framework
Documents
High Availability Secondary Indexing
Scalability
16
MongoDB Data Operations Spectrum
• Retrieve Nothing – infinitely fast• Document Retrieval – 1ms if in cache, ~10ms from spinning disk• .find() – per-document cost similar to single document
– _id range– any secondary index range, can be composite key– intersect two indexes– covered indexes even faster
• .count(), .distinct(), .group() – fast, may be covered• .aggregate() – retrieval cost like find, plus pipeline operations
– $match– $group– $project– $redact
• .mapReduce() – in-database Javascript• Hadoop Connector
– mongo.input.query for indexed partial scan– full scan
Faster…
……
……
.....Slow
er
17
Using Spark
19
Topic Detection
• Grouping documents according to topics, especially over time– Google News
• Latent Dirichlet Allocation – Corpus of M documents, each of N words
Wij at position i in document j– Documents have (latent) topic distributions α
θi for document i– Topics have word distributions β, φk for topic k
Zij is topic contributing to word at position j in document i– Remove stopwords!
• Tweets– Large, terse corpus – Highly sensitive to number of iterations
(10 returned little more than word distribution)– Requires some iterative stopwording
"Smoothed LDA" by Slxu.public - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Smoothed_LDA.png#/media/File:Smoothed_LDA.png"Dirichlet distributions" by en:User:ThG - en:Image:Dirichlet_distributions.png. Licensed under Public Domain via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Dirichlet_distributions.png#/media/File:Dirichlet_distributions.png
** Form C := alpha*A**H*B + beta*C.* DO 120 J = 1,N DO 110 I = 1,M TEMP = ZERO DO 100 L = 1,K TEMP = TEMP + CONJG(A(L,I))*B(L,J) 100 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 110 CONTINUE 120 CONTINUE ELSE** Form C := alpha*A**T*B + beta*C* DO 150 J = 1,N DO 140 I = 1,M TEMP = ZERO DO 130 L = 1,K TEMP = TEMP + A(L,I)*B(L,J) 130 CONTINUE IF (BETA.EQ.ZERO) THEN C(I,J) = ALPHA*TEMP ELSE C(I,J) = ALPHA*TEMP + BETA*C(I,J) END IF 140 CONTINUE 150 CONTINUE END IF
ELSE IF (NOTA) THEN IF (CONJB) THEN** Form C := alpha*A*B**H + beta*C.* DO 200 J = 1,N IF (BETA.EQ.ZERO) THEN DO 160 I = 1,M C(I,J) = ZERO 160 CONTINUE ELSE IF (BETA.NE.ONE) THEN DO 170 I = 1,M C(I,J) = BETA*C(I,J) 170 CONTINUE END IF DO 190 L = 1,K IF (B(J,L).NE.ZERO) THEN TEMP = ALPHA*CONJG(B(J,L)) DO 180 I = 1,M C(I,J) = C(I,J) + TEMP*A(I,L) 180 CONTINUE END IF 190 CONTINUE 200 CONTINUE ELSE** Form C := alpha*A*B**T + beta*C* DO 250 J = 1,N IF (BETA.EQ.ZERO) THEN DO 210 I = 1,M C(I,J) = ZERO
Create the Resilient Distributed Dataset (RDD)
rdd = sc.newAPIHadoopRDD(
config, MongoInputFormat.class, Object.class, BSONObject.class)
config.set(
"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")
config.set(
"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')
config.set(
"mongo.output.uri",
"mongodb://127.0.0.1:27017/marketdata.fiveminutebars")
val minBarRawRDD = sc.newAPIHadoopRDD(
config,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Object],
classOf[BSONObject])
val fiveMinBars = groupBars.map(
g => (
g.head.get("_id"),
new BasicBSONObject(g.head.toMap()).
append("Close", g.last.get("Close") ).
append("High", g.map(b => b.get("High").toString.toFloat).reduceLeft(math.max) ).
append("Low", g.map(b => b.get("Low").toString.toFloat).reduceLeft(math.min) ).
append("Volume", g.map(b => b.get("Volume").toString.toInt).foldLeft(0)(_ + _) )
)
)
Operate through Spark on the RDD Object
// Create a separate Configuration for saving data back to MongoDB.
val outputConfig = new Configuration()
outputConfig.set("mongo.output.format", "com.mongodb.hadoop.MongoOutputFormat")
outputConfig.set("mongo.output.uri", "mongodb://"
+ mongoPort
+ "/marketdata.fiveminutebars")
fiveMinBars.saveAsNewAPIHadoopFile(
"file:///dummy",
classOf[Any],
classOf[Any],
classOf[MongoOutputFormat[_,_]],
outputConfig)
Put It Back Where You Found It
LOREM IPSUM
LOREM IPSUM
LOREM IPSUM
LOREM IPSUM
Sollicitudin VenenatisLOREM IPSUM
LOREM IPSUM
LOREM IPSUM
LOREM IPSUM
Graphic Element Examples
Porta Ultricies
Commodo Porta
Graph Examples
{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{ type : "Health",
plan : "PPO Plus" },
{ type : "Dental",
plan : "Standard" }
]
}
Code/Highlight Example
Aggregation Framework Agility Backup Big Data Briefcase
Buildings Business Intelligence Camera Cash Register Catalog
Chat Checkmark Checkmark Cloud Commercial Contract
Computer Content Continuous Development Credit Card Customer Success
Data Center Data Variety Data Velocity Data Volume Data Warehouse Database
Dialogue Directory Documents Downloads Drivers Dynamic Schema
EDW Integration Faster Time to Market File Transfer Flexible Gear Hadoop
Health Check High Availability Horizontal Scaling Integrating into Infrastructure Internet of Things Iterative Development
Life Preserver Line Graph Lock Log Data Lower Cost Magnifying Glass
Man Mobile Phone Meter Monitoring Music New Apps
New Data Types Online Open Source Parachute Personalization Pin
Platform Certification Product Catalog Puzzle Pieces RDBMS Realtime Analytics Rich Querying
Life Preserver RSS Scalability Scale Secondary Indexing Steering Wheel
Stopwatch Text Search Tick Data Training Transmission Tower Trophy
Woman World