36
R Statistics with Mon R Statistics with Mon goDB goDB Dr. Markus Schmidberger October 14th, 2013 Munich, Germany Email: Twitter: @cloudHPC [email protected] R Statistics with MongoDB 1 von 36

R statistics with mongo db

  • Upload
    mongodb

  • View
    122

  • Download
    3

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: R statistics with mongo db

R Statistics with Mon‐R Statistics with Mon‐goDBgoDB

Dr. Markus SchmidbergerOctober 14th, 2013 Munich, Germany

Email: Twitter: @cloudHPC

[email protected]

R Statistics with MongoDB

1 von 36

Page 2: R statistics with mongo db

Dr. Markus SchmidbergerDr. Markus SchmidbergerR Statistics with MongoDB

2 von 36

Page 3: R statistics with mongo db

OutlineOutlineIntroduction to Big Data, MongoSoup and R

R statistics with MongoDB and Examples

Summary & Questions

R Statistics with MongoDB

3 von 36

Page 4: R statistics with mongo db

Big DataBig DataWikipedia: … a collection of data sets so large and complex that itbecomes difficult to process using on-hand database managementtools or traditional data processing. …

storing

processing

R Statistics with MongoDB

4 von 36

Page 5: R statistics with mongo db

Storing: NoSQL - MongoDBStoring: NoSQL - MongoDBdatabases using looser consistency models to store data

German MongoDB as a Service: MongoSoup

cloudControl Add-On

currently running on AWS EU-Region (Ireland)

all features available: shared / dedicated hosting, replicaset, sharding

24/7 support available

R Statistics with MongoDB

5 von 36

Page 6: R statistics with mongo db

MongoSoup in < 5 minMongoSoup in < 5 mingo to cloudControl:

add an account and a billing address

create a new app, e.g. “rmongodb”

install cloudControl command line tools: cctrlapp

enable your preferred MongoSoup hosting: cctrlapprmongodb/default addon.add mongosoup.medium

go to the cloudControl Web-Console-AddOns and get yourcredentials

www.cloudcontrol.com

https://www.cloudcontrol.com/console/app/rmongodb

R Statistics with MongoDB

6 von 36

Page 7: R statistics with mongo db

Processing: Analyzing with R and HadoopProcessing: Analyzing with R and Hadoopbackward-looking analysis is outdated

today: quasi real-time analysis

tomorrow: forward-looking predictive analysis

more complex methods, more data available, moreprocessing time required

Check my Strata London Tutorial “Big Data Analyses with R”

R Statistics with MongoDB

7 von 36

Page 8: R statistics with mongo db

Introduction to RIntroduction to RR is a free software environment for statistical computingand graphics

offers tools to manage and analyze data

standard statistical methods are implemented

compiles and runs under different OS

support via huge community

www.r-project.org

R Statistics with MongoDB

8 von 36

Page 9: R statistics with mongo db

huge online-libraries with > 5000 R-packages:

possibility to write personalized code and to contribute newpackages

really famous since January 6, 2009: The New York Times,“Data Analysts Captivated by R's Power”

http://cran.r-project.org

R Statistics with MongoDB

9 von 36

Page 10: R statistics with mongo db

RStudio IDERStudio IDE

http://www.rstudio.com

R Statistics with MongoDB

10 von 36

Page 11: R statistics with mongo db

R as calculatorR as calculator (5+5) - 1 * 3

[1] 7

x <- 3 x

[1] 3

x^2 + 4

[1] 13

R Statistics with MongoDB

11 von 36

Page 12: R statistics with mongo db

y <- c(1,2,3)y

[1] 1 2 3

x <- 1:10x

[1] 1 2 3 4 5 6 7 8 9 10

x < 5

[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE

R Statistics with MongoDB

12 von 36

Page 13: R statistics with mongo db

x[3:7]

[1] 3 4 5 6 7

mean(x)

[1] 5.5

help("mean")?mean

R Statistics with MongoDB

13 von 36

Page 14: R statistics with mongo db

R Statistics with MongoDB

14 von 36

Page 15: R statistics with mongo db

Many Statistical FunctionsMany Statistical Functionskmeans(dat, 4)

K-means clustering with 4 clusters of sizes 21, 18, 30, 31

Cluster means: [,1] [,2]1 0.7755 0.85092 -0.1557 -0.23053 1.2299 1.14724 0.1510 0.1507

Clustering vector: [1] 4 2 4 4 2 4 4 4 2 4 4 4 2 2 4 4 1 4 2 2 2 4 4 4 2 4 2 4 4 2 4 2 2 4 4 [36] 4 4 4 4 4 4 4 4 2 4 2 2 4 2 2 1 1 1 1 3 1 3 3 3 1 1 3 3 3 3 1 3 1 3 3 [71] 1 3 1 1 3 3 3 3 1 1 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 1 3 3 3

Within cluster sum of squares by cluster:[1] 3.318 1.166 4.019 3.195 (between_SS / total_SS = 83.0 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" [5] "tot.withinss" "betweenss" "size"

R Statistics with MongoDB

15 von 36

Page 16: R statistics with mongo db

plot(dat, col = cl$cluster, cex=2, pch=16)points(cl$centers, col = 1:4, pch = 13, cex = 4)

R Statistics with MongoDB

16 von 36

Page 17: R statistics with mongo db

R Shiny - easy web applicationR Shiny - easy web applicationdeveloped by RStudio

turns R analyses into interactive web applications thatanyone can use

let your users choose input parameters using friendlycontrols like sliders, drop-downs, and text fields

easily incorporate any number of outputs like plots, tables,and summaries

no HTML or JavaScript knowledge is necessary, only R

http://www.rstudio.com/shiny/

R Statistics with MongoDB

17 von 36

Page 18: R statistics with mongo db

R and DatabasesR and DatabasesSQL provides a standard language to filter, aggregate, group,sort data

SQL in new places: Hive, Impala, …

ODBC provides SQL interface to non-database data (Excel,CSV, text files)

R stores relational data in data.frames (extended lists)

R Statistics with MongoDB

18 von 36

Page 19: R statistics with mongo db

data(iris)head(iris, n=3)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa

class(iris)

[1] "data.frame"

R Statistics with MongoDB

19 von 36

Page 20: R statistics with mongo db

R package: sqldfR package: sqldf

running SQL statements on R data frames

library(sqldf)sqldf("select * from iris limit 2")

Sepal_Length Sepal_Width Petal_Length Petal_Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa

sqldf("select count(*) from iris")

count(*)1 150

R Statistics with MongoDB

20 von 36

Page 21: R statistics with mongo db

Other relational R packageOther relational R packageRMySQL package provides an interface to MySQL

RPostgreSQL package provides an interface to PostgreSQL

ROracle package provides an interface for Oracle

RJDBC package provides access to databases through aJDBC interface

RSQLite package provides access to SQLite(SQLite engine is included)

One big problem:all packages read the full result in R memory

R Statistics with MongoDB

21 von 36

Page 22: R statistics with mongo db

R and MongoDBR and MongoDB

on CRAN there are two packages to connect R with MongoDB

rmongodb supported by MongoDB, Inc.

powerful for big data

difficult to use due to BSON objects

RMongo

easy to use

limited functionality

reads full results in R memory

does not work on MAC OS X

R Statistics with MongoDB

22 von 36

Page 23: R statistics with mongo db

R package: RMongoR package: RMongolibrary(Rmongo)mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017)dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")

dbShowCollections(mongo)dbGetQuery(mongo, "zips","{'state':'AL'}")dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }')

dbDisconnect(mongo)

R Statistics with MongoDB

23 von 36

Page 24: R statistics with mongo db

R package: rmongodbR package: rmongodbdeveloped on top of the MongoDB supported C driver

library(rmongodb)mongo <- mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")

mongo

[1] 0attr(,"mongo")<pointer: 0x105a1de80>attr(,"class")[1] "mongo"attr(,"host")[1] "dbs001.mongosoup.de"attr(,"name")[1] ""attr(,"username")[1] "JwQcDLJSYQJb"attr(,"password")[1] "RSXPkUkxRdOX"attr(,"db")[1] "cc_JwQcDLJSYQJb"attr(,"timeout")[1] 0

R Statistics with MongoDB

24 von 36

Page 25: R statistics with mongo db

mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb")

[1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp" "cc_JwQcDLJSYQJb.test"

mongo <- mongo.disconnect(mongo)

R Statistics with MongoDB

25 von 36

Page 26: R statistics with mongo db

buf <- mongo.bson.buffer.create()mongo.bson.buffer.append(buf, "state", "AL")

[1] TRUE

query <- mongo.bson.from.buffer(buf)query

state : 2 AL

R Statistics with MongoDB

26 von 36

Page 27: R statistics with mongo db

res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query)res

city : 2 ACMAR loc : 4 0 : 1 -86.515570 1 : 1 33.584132

pop : 16 6055 state : 2 AL _id : 2 35004

R Statistics with MongoDB

27 von 36

Page 28: R statistics with mongo db

out <- mongo.bson.to.list(res)out$loc

[1] -86.52 33.58

typeof(out$loc)

[1] "double"

out$pop

[1] 6055

out$state

[1] "AL"

R Statistics with MongoDB

28 von 36

Page 29: R statistics with mongo db

cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query)

res <- NULLwhile (mongo.cursor.next(cursor)){ value <- mongo.cursor.value(cursor) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue)}err <- mongo.cursor.destroy(cursor)

head(res, n=4)

city loc pop state _id Rvalue "ACMAR" Numeric,2 6055 "AL" "35004"Rvalue "ADAMSVILLE" Numeric,2 10616 "AL" "35005"Rvalue "ADGER" Numeric,2 3205 "AL" "35006"Rvalue "KEYSTONE" Numeric,2 14218 "AL" "35007"

R Statistics with MongoDB

29 von 36

Page 30: R statistics with mongo db

It is all about creating BSON query or field objects

b <- mongo.bson.from.list( list(name="Fred", age=29, city="Boston"))b

name : 2 Fred age : 1 29.000000 city : 2 Boston

mongo.bson.to.list(b)

$name[1] "Fred"

$age[1] 29

$city[1] "Boston"

R Statistics with MongoDB

30 von 36

Page 31: R statistics with mongo db

?mongo.bson?mongo.bson.buffer.append?mongo.bson.buffer.start.array?mongo.bson.buffer.start.object

buf <- mongo.bson.buffer.create()mongo.bson.buffer.append(buf, "aggregate", "zips")mongo.bson.buffer.start.array(buf, "pipeline") mongo.bson.buffer.start.object(buf, "$group") mongo.bson.buffer.append(buf, "_id", "$state") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$sum", "$pop") mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.start.object(buf, "$match") mongo.bson.buffer.start.object(buf, "totalPop") mongo.bson.buffer.append(buf, "$gte", "10000") mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.finish.object(buf)mongo.bson.buffer.finish.object(buf)query <- mongo.bson.from.buffer(buf)

R Statistics with MongoDB

31 von 36

Page 32: R statistics with mongo db

CCP Web Analytics ChallengeCCP Web Analytics Challengebuf <- mongo.bson.buffer.create()query <- mongo.bson.from.buffer(buf)buf <- mongo.bson.buffer.create()err <- mongo.bson.buffer.append(buf, "user", 1)err <- mongo.bson.buffer.append(buf, "type", 1)field <- mongo.bson.from.buffer(buf)out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000)res <- NULLwhile (mongo.cursor.next(out)){ value <- mongo.cursor.value(out) Rvalue <- mongo.bson.to.list(value) res <- rbind(res, Rvalue)}

R Statistics with MongoDB

32 von 36

Page 33: R statistics with mongo db

boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user")

R Statistics with MongoDB

33 von 36

Page 34: R statistics with mongo db

Shiny MongoShiny MongoR based MongoDB User Interface

R packages shiny and rmongodb

less than 200 lines of code

DEMO: http://localhost:8100

https://github.com/comsysto/ShinyMongo

R Statistics with MongoDB

34 von 36

Page 35: R statistics with mongo db

SummarySummaryR is a powerful statistical tool to analyse many different kindof data

R can access databases

MongoDB and rmongodb ready for Big Data

start playing around with R, Big Data and MongoDB

http://www.r-project.org

http://www.mongodb.org

http://www.mongosoup.de

R Statistics with MongoDB

35 von 36

Page 36: R statistics with mongo db

See you soonSee you soonthanks a lot for your attention

there are R trainings in December 2013 in Munich

we are hosting many events and meetups

meet you at the MongoSoup booth

Email: Twitter: @cloudHPC

http://comsysto.com/events.html#r

[email protected]

R Statistics with MongoDB

36 von 36