Upload
hoangkhuong
View
244
Download
2
Embed Size (px)
Citation preview
Microsoft
SQL Server 2016 R Services
Consistent experience from on-premises to cloud
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-memory across all workloads
built-inbuilt-in built-in built-in built-in
at massive scale
0 14
0 03
34
29
22
15
5
22
6
43
20
69
18
49
3
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6
SQL Server Oracle MySQL SAP HANA TPC-H
Oracle is #5#2
SQL Server
#1
SQL Server
#3
SQL Server
SQL Server 2016: Everything built-in
2
從資料到決策和行動
價值
資料
$1.6trillion
行動决策
微軟先進分析產品
Cortana
Analytics Suite
SQL Server 2016
典型先進分析的生命週期
Ingest Transform Explore Model Deploy
Score Visualize Measure
Model
Score
ƒ(x)
準備 Modeling
投入生產
資料科學家應該是關注創建/測試模型
Data scientist
Ingest Transform Explore Model Deploy
Score Visualize Measure
Model
Score
ƒ(x)
準備 Modeling
投入生產
但現實是...
Data scientist focus time
Ingest Transform Explore Model Deploy
Score Visualize Measure
Model
Score
ƒ(x)
準備 Modeling
投入生產
80%
5%
15%
決定
投入生產
先進分析是一項團隊運動
Preparation
model
什麼是 R ?
開源“lingua franca”
Analytics, computing, modeling
Global community
Millions of users 7,000+Packages
Big dataEcosystem
Scalability
CRAN: The Comprehensive R Archive Network
Open Source “lingua franca”
Analytics, Computing, Modeling
In addition to CRAN, Bioconductor, GitHub, and others distribute R packages
大量人才知道如何使用
為什麼 R ?
可擴充正在進行計算的資料
更容易保護重要的資料
角色使用創建效率
$?
開源R的挑戰
Uncertain total cost of ownership and return on investment
Integrating R with existing and ever changing data infrastructures
Scale and Performance
Data movement restricts access for efficient data modeling
Big Data In-memory bound Hybrid memory & disk scalability Operates on bigger
volumes & factors
Speed of
Analysis
Single threaded Parallel threading and Processing Shrinks analysis time
Enterprise
Readiness
Community support Commercial support Delivers full service
production support
Analytic
Breadth &
Depth
7000+ innovative analytic
packages
Leverage and optimize open
source packages plus Big Data
ready packages
Supercharges R
Commercial
Viability
Risk of deployment of open
source
Commercial licenses Eliminates risk with
open source
開源 好處微軟R
微軟R的好處
Faster And More Scalable
Custom parallelization
PEMA-R API
rxDataStep
rxExec
Data step
Data import – Delimited, fixed, SAS, SPSS, OBDC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, merge, split
Aggregate by category (means, sums)
Descriptive statistics
Min/max, mean, median (approx.)
Quantiles (approx.)
Standard deviation
Variance
Correlation
Covariance
Sum of squares (cross-product matrix for set variables)
Pairwise cross tabs
Risk ratio & odds ratio
Cross-tabulation of data (standard tables & long form)
Marginal summaries of cross tabulations
Statistical tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random sampling
Predictive models
Sum of squares (cross-product matrix for set variables)
Multiple linear regression
Generalized linear models (GLM) exponential family distributions: binomial,
Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit,
identity, log, logit, probit. User defined distributions & link functions.
Covariance & correlation matrices
Logistic regression
Classification & regression trees
Predictions/scoring for models
Residuals for all models
Simulation
Simulation (e.g., Monte Carlo)
Parallel random number generation
Cluster analysis
K-Means
Classification
Decision trees
Decision forests
Gradient-boosted decision trees
Naïve Bayes
Parallelized, Remote Executing Algorithms
In-database advanced analytics
Data Scientist
Interacts directly with data
SQL Developer/DBAManage data and
analytics together
ExtensibilityExample solutions
Sales forecasting
Warehouse efficiency
Predictive
maintenance
Credit risk protection
010010
100100
010101
Relational data
Analytics library
T-SQL interface
?R
integration
Built into
SQL Server 2016
010010
100100
010101
Real-time operational analyticswithout moving data
R with in-memory scalability
rows
min
ute
s
External
Access
In
Database
Flexibility & Agility
寫一次部署在任何地方 No model re-writes across platforms
No re-writes from modeling to scoring
Hybrid modeling & scoring Model on premises, score on premises
Model on premises, score in the cloud
Model on cloud, score on premises
ModelPrepare
SQL
Server
Score
Parallelized Models
Financial Services Digital Media & Retail
Healthcare & Pharma Government & Academia Analytics Service Providers
Manufacturing & High Tech
微軟R部分的客戶
SQL Server 2016 R Services ( In-database)
In-DB analytics
Parallel threading and processing
Easy to operationize
Developers, DBAs and Data Scientists can use their preferred tools
Model on-premises, score in cloud—or vice versa
Easy way to overcome memory limitations -enabling limits of larger data sets
Included in SQL Server 2016
Reuse and optimization of existing R code
Reduced recoding and training costs
$