53
<Insert Picture Here> Copyright © 2007 Oracle Corporation Data Warehousing ETL OLAP Data Mining Oracle 11 Oracle 11 g g DB DB Statistics Oracle's In-Database Statistical Functions Charlie Berger Sr. Director Product Management, Data Mining Technologies Oracle Corporation [email protected]

Oracle Statistical Functions 10gR2

  • Upload
    -

  • View
    237

  • Download
    0

Embed Size (px)

DESCRIPTION

Oracle Statistical Functions 10g

Citation preview

Page 1: Oracle Statistical Functions 10gR2

<Insert Picture Here>

Copyright © 2007 Oracle Corporation

Data Warehousing

ETL

OLAP

Data Mining

Oracle 11Oracle 11gg DBDB

Statistics

Oracle's In-Database Statistical Functions

Charlie BergerSr. Director Product Management, Data Mining TechnologiesOracle [email protected]

Page 2: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Synopsis• Oracle has delivered on a multi-year strategy to transform the

database from a data repository to an analytical database by bringing the "analytics" to the data (data mining, text mining, and statistical functions)

• This new “analytical Database”, integrated with Oracle Business Intelligence EE, opens new doors for better BI

• Why did something happen?• What corrective actions should be taken?• Which factors are influencing your business’s key performance indicators?• Which things should I target?• What will happen in the future and where should you focus limited resources?

• Overview of SQL statistical capabilities embedded in Oracle Database

• “Repeat what I was shown” hands-on session

Page 3: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Agenda

• Introduction• Oracle’s in-Database Statistical Functions• Several Simple Demonstrations• Opportunities for Use Cases• Hands-on Exercises• User Stories

• A• B• C• …

Page 4: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Market TrendsAnalytics Provide Competitive Value

• Competing on Analytics, by Tom Davenport

• “Some companies have built their very businesses on their ability to collect, analyze, and act on data.”

• “Although numerous organizations are embracing analytics, only a handful have achieved this level of proficiency. But analytics competitors are the leaders in their varied fields—consumer products finance, retail, and travel and entertainment among them.”

• “Organizations are moving beyond query and reporting” - IDC 2006

• Super Crunchers, by Ian Ayers

• “In the past, one could get by on intuition and experience. Times have changed. Today, the name of the game is data.”—Steven D. Levitt, author of Freakonomics

• “Data-mining and statistical analysis have suddenly become cool.... Dissecting marketing, politics, and even sports, stuff thiscomplex and important shouldn't be this much fun to read.” —Wired

Page 5: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Market TrendsAnalytics Save Lives

• Super Crunchers, by Ian Ayers• In December 2004, [Berwick] brazenly announced a plan to save 100,000

lives over the next year and a half. The “100,000 Lives Campaign” challenged hospitals to implement six changes in care to prevent avoidable deaths.

• … He noticed that thousands of ICU patients die each year from infections after a central line catheter is placed in their chests. About half of all intensive care patients have central line catheters, and ICU infections are deadly (carrying mortality rates of up to 20 percent). He then looked to see if there was any statistical evidence of ways to reduce the chance of infection. He found a 2004 article in Critical Care Medicine that showed that systematic hand-washing (combined with a bundle of improved hygienic procedures such as cleaning the patient’s skin with an antiseptic called chlorhexidine) could reduce the risk of infection from central-line catheters by more than 90 percent. Berwick estimated that if all hospitals just implemented this one bundle of procedures, they might be able to save as many as 25,000 lives per year.

• —New York Times, August 23, 2007, “Attack of the Super Crunchers: Adventures in Data Mining”, By Melissa Lafsky

Page 6: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Competitive Advantage of BI & Analytics

Optimization

Predictive Modeling

Forecasting/Extrapolation

Statistical Analysis

Alerts

Query/drill down

Ad hoc reports

Standard Reports

What’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?

Where exactly is the problem?

How many, how often, where?

What happened?

Analytic$

Access & Reporting

$$

Com

petit

ive

Adv

anta

ge

Degree of Intelligence

Source: Competing on Analytics, by T. Davenport & J. Harris

Page 7: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Oracle Data Mining & Statistical Functions

Page 8: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Definition: Statistics

“There are three kinds of lies: lies, damned lies,

and statistics.” 1

1 This well-known saying is part of a phrase attributed to BenjaminDisraeli and popularized in the U.S. by Mark Twain http://en.wikipedia.org/wiki/Statistics

Page 9: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Definition: Statistics

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities. Statistics are also used for making informed decisions – and misused for other reasons – in all areas of business and government.

http://en.wikipedia.org/wiki/Statistics

Page 10: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Definitions: Statistics

Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics.

http://en.wikipedia.org/wiki/Statistics

Page 11: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Statistical Concepts

Page 12: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Statistics & SQL Analytics

• Ranking functions• rank, dense_rank, cume_dist, percent_rank, ntile

• Window Aggregate functions (moving and cumulative)

• Avg, sum, min, max, count, variance, stddev, first_value, last_value

• LAG/LEAD functions• Direct inter-row reference using offsets

• Reporting Aggregate functions• Sum, avg, min, max, variance, stddev, count,

ratio_to_report

• Statistical Aggregates• Correlation, linear regression family, covariance

• Linear regression• Fitting of an ordinary-least-squares regression line

to a set of number pairs. • Frequently combined with the COVAR_POP,

COVAR_SAMP, and CORR functions.

• Descriptive Statistics• average, standard deviation, variance, min, max, median

(via percentile_count), mode, group-by & roll-up• DBMS_STAT_FUNCS: summarizes numerical columns

of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values

• Correlations• Pearson’s correlation coefficients, Spearman's and

Kendall's (both nonparametric).

• Cross Tabs• Enhanced with % statistics: chi squared, phi coefficient,

Cramer's V, contingency coefficient, Cohen's kappa

• Hypothesis Testing• Student t-test , F-test, Binomial test, Wilcoxon Signed

Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA

• Distribution Fitting• Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-

Squared Test, Normal, Uniform, Weibull, Exponential

Note: Statistics and SQL Analytics are included in Oracle Database Standard Edition

Page 13: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Descriptive Statistics

• MEDIAN & MODE• Median: takes numeric or datetype values and returns the middle

value• Mode: returns the most common value

A. SELECT STATS_MODE(EDUCATION) from CD_BUYERS;

B. SELECT MEDIAN(ANNUAL_INCOME) from CD_BUYERS;

C. SELECT EDUCATION, MEDIAN(ANNUAL_INCOME) from CD_BUYERS GROUP BY EDUCATION;

D. SELECT EDUCATION, MEDIAN(ANNUAL_INCOME) from CD_BUYERS GROUP BY EDUCATION ORDER BY MEDIAN(ANNUAL_INCOME) ASC;

> SQL

Page 14: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

DBMS_STAT_FUNCS PackageSUMMARY procedure

• The SUMMARY procedure is used to summarize a numerical column (ADM_PULSE); the summary is returned as record of type summaryType

> SQL DECLAREv_ownername varchar2(8);v_tablename varchar2(50);v_columnname varchar2(50);v_sigma_value number;type n_arr1 is varray(5) of number;type num_table1 is table of number;s1 dbms_stat_funcs.summaryType;BEGINv_ownername := 'cberger';v_tablename := 'LYMPHOMA';v_columnname := 'ADM_PULSE';v_sigma_value := 3;dbms_stat_funcs.summary(p_ownername=> v_ownername, p_tablename=> v_tablename, p_columnname=> v_columnname, p_sigma_value=> v_sigma_value, s=> s1);END;/

Page 15: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

DBMS_STAT_FUNCS PackageSUMMARY procedure

• The SUMMARY procedure is used to summarize a numerical column (ADM_PULSE); the summary is returned as record of type summaryType

> SQL set echo offconnect CBERGER/CBERGER@ora10gr2set serveroutput onset echo ondeclare

s DBMS_STAT_FUNCS.SummaryType;begin

DBMS_STAT_FUNCS.SUMMARY('CBERGER','LYMPHOMA','ADM_PULSE',3,s);dbms_output.put_line('SUMMARY STATISTICS');dbms_output.put_line('Count: '||s.count);dbms_output.put_line('Min: '||s.min);dbms_output.put_line('Max: '||s.max);dbms_output.put_line('Range: '||s.range);dbms_output.put_line('Mean: '||round(s.mean));dbms_output.put_line('Mode Count: '||s.cmode.count);dbms_output.put_line('Mode: '||s.cmode(1));dbms_output.put_line('Variance: '||round(s.variance));dbms_output.put_line('Stddev: '||round(s.stddev));dbms_output.put_line('Quantile 5 '||s.quantile_5);dbms_output.put_line('Quantile 25 '||s.quantile_25);dbms_output.put_line('Median '||s.median);dbms_output.put_line('Quantile 75 '||s.quantile_75);dbms_output.put_line('Quantile 95 '||s.quantile_95);dbms_output.put_line('Extreme Count: '||s.extreme_values.count);dbms_output.put_line('Extremes: '||s.extreme_values(1));dbms_output.put_line('Top 3: '||s.top_5_values(1)||','||s.top_5_values(2)||','||s.top_5_values(3));dbms_output.put_line('Bottom 3: '||s.bottom_5_values(5)||','||s.bottom_5_values(4)||','||s.bottom_5_values(3));

end;/

Page 16: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

DBMS_STAT_FUNCS PackageSUMMARY procedure

• A subset of data that is returned after execution of the PL/SQL package “summarizes” the use of the different SUMMARY procedures

Page 17: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Summary Statistics and Histograms• Oracle Data

Miner (gui for Oracle Data Mining Option) provides graphical histograms with summary statistics

Page 18: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Hypothesis Testing

• Parametric Tests • Parametric tests make some

assumptions about the data—typically that the data is normally distributed among other assumptions

• Oracle 10g parametric hypothesis tests include:

• T-test• F-test• One-Way ANOVA

Page 19: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

T-Test

• T-tests are used to measure the significance of a difference of means.

• T-tests include the following:• One-sample T-test• Paired-samples T-test• Independent-samples T-test (pooled variances)• Independent-samples T-test (unpooled variances)

Page 20: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Basic Example

• Compare difference in blood pressures between people who eat meat frequently vs. don’t

Page 21: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

One-Sample T-Test

STATS_T_TEST_* The t-test functions are:STATS_T_TEST_ONE: A one-sample t-testSTATS_T_TEST_PAIRED: A two-sample, paired t-test (also known as

a crossed t-test)STATS_T_TEST_INDEP: A t-test of two independent groups with the

same variance (pooled variances)STATS_T_TEST_INDEPU: A t-test of two independent groups with

unequal variance (unpooled variances)

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14200/functions157.htm

Page 22: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

One-Sample T-Test

• Query compares the mean of SURVIVAL_TIME to the assumed value of 35: SELECT avg(SURVIVAL_TIME_MO) group_mean,stats_t_test_one(SURVIVAL_TIME_MO, 35,'STATISTIC') t_observed,stats_t_test_one(SURVIVAL_TIME_MO, 35)two_sided_p_valueFROM LYMPHOMA;

• Returns the observed t value and its related two-sided significance

SQL Worksheet

Page 23: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Paired Samples T-Test

• Query compares the mean of LOGWT for Pig Weights in Week 3 to Week 8, grouped by Diet: SELECT substr(diet,1,1) as diet, avg(LOGWT3) logwt3_mean,

avg(LOGWT8) logwt8_mean,stats_t_test_paired(LOGWT3, LOGWT8,'STATISTIC') t_observed,

stats_t_test_paired(LOGWT3, LOGWT8) two_sided_p_valueFROM CBERGER.PIGLETS3GROUP BY ROLLUP(DIET)ORDER BY 5 ASC;

• Returns the observed t value and its related two-sided significance

SQL Worksheet

Page 24: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Independent Samples T-Test (Pooled Variances)

• Query compares the mean of AMOUNT_SOLD between MEN and WOMEN within CUST_INCOME_LEVEL rangesSELECT substr(cust_income_level,1,22) income_level,avg(decode(cust_gender,'M',amount_sold,null)) sold_to_men,avg(decode(cust_gender,'F',amount_sold,null)) sold_to_women,stats_t_test_indep(cust_gender, amount_sold, 'STATISTIC','F') t_observed,stats_t_test_indep(cust_gender, amount_sold) two_sided_p_value

FROM sh.customers c, sh.sales sWHERE c.cust_id=s.cust_idGROUP BY rollup(cust_income_level)ORDER BY 1;

SQL Worksheet

Page 25: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Independent Samples T-Test (Pooled Variances)

Page 26: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

F-Test

• Query compares the variance in the SIZE_TUMOR between MALES and FEMALES

SELECT variance(decode(GENDER,'0', SIZE_TUMOR_MM, null)) var_tumor_men,variance(decode(GENDER,'1', SIZE_TUMOR_MM,null)) var_tumor_women,stats_f_test(GENDER, SIZE_TUMOR_MM, 'STATISTIC', '1') f_statistic,stats_f_test(GENDER, SIZE_TUMOR_MM) two_sided_p_value

FROM CBERGER.LYMPHOMA;

• Returns observed f value and two-sided significance

SQL Worksheet

Page 27: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

F-Test• Query compares the variance in the SIZE_TUMOR

between males and females Grouped By GENDER

SELECT GENDER,stats_one_way_anova(TREATMENT_PLAN,SIZE_REDUCTION,'F_RATIO') f_ratio,stats_one_way_anova(TREATMENT_PLAN, SIZE_REDUCTION,'SIG') p_value, AVG(SIZE_REDUCTION)FROM CBERGER.LYMPHOMA

GROUP BY GENDER ORDER BY GENDER;

• Returns observed f value and two-sided significance

SQL Worksheet

Page 28: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

One-Way ANOVA

• In statistics, analysis of variance (ANOVA, or—sometimes—A.N.O.V.A.) is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables.

• Example• Group A is given vodka, Group B is given gin, and Group C

is given a placebo. All groups are then tested with a memory task. A one-way ANOVA can be used to assess the effect of the various treatments (that is, the vodka, gin, and placebo).

http://en.wikipedia.org/wiki/Statistics

Page 29: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

One-Way ANOVA

• Query compares the average SIZE_REDUCTION within different TREATMENT_PLANS Grouped By LYMPH_TYPE:

SELECT LYMPH_TYPE,stats_one_way_anova(TREATMENT_PLAN,SIZE_REDUCTION,'F_RATIO') f_ratio,stats_one_way_anova(TREATMENT_PLAN, SIZE_REDUCTION,'SIG') p_valueFROM CBERGER.LYMPHOMA

GROUP BY LYMPH_TYPE ORDER BY 1;

• Returns one-way ANOVA significance and split by LYMPH_TYPE

Page 30: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Hypothesis Testing(Nonparametric)

• Nonparametric tests are used when certain assumptions about the data are questionable.

• This may include the difference between samples that are not normally distributed.

• All tests involving ordinal scales (in which data is ranked) are nonparametric.

• Nonparametric tests supported in Oracle Database 10g:• Binomial test• Wilcoxon Signed Ranks test• Mann-Whitney test• Kolmogorov-Smirnov test

Page 31: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Customer Example

"..Our experience suggests that Oracle 10g Statistics and Data Mining features can reduce development effort of analytical systems by an order of magnitude."

Sumeet Muju Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics

development projects)

Page 32: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

?xCorrelation Functions

• The CORR_S and CORR_K functions support nonparametric or rank correlation (finding correlations between expressions that are ordinal scaled).

• Correlation coefficients take on a value ranging from –1 to 1, where:

• 1 indicates a perfect relationship• –1 indicates a perfect inverse

relationship• 0 indicates no relationship

• The following query determines whether there is a correlation between the AGE and WEIGHT of people, using Spearman's correlation:

select CORR_S(AGE, WEIGHT) coefficient,CORR_S(AGE, WEIGHT, 'TWO_SIDED_SIG')p_value,

substr(TREATMENT_PLAN, 1,15) as TREATMENT_PLANfrom CBERGER.LYMPHOMAGROUP BY TREATMENT_PLAN;

Page 33: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Cross Tabulations• This query analyzes the strength of the association between

TREATMENT_PLAN and GENDER Grouped By LYMPH_TYPE using a cross tabulation:

SELECT LYMPH_TYPE,stats_crosstab(GENDER, TREATMENT_PLAN,

'CHISQ_OBS') chi_squared,stats_crosstab(GENDER, TREATMENT_PLAN,'CHISQ_SIG') p_value,stats_crosstab(GENDER, TREATMENT_PLAN,'PHI_COEFFICIENT') phi_coefficient

FROM CBERGER.LYMPHOMAGROUP BY LYMPH_TYPE ORDER BY 1;

• Returns the observed p_value and phi coefficient significance:

Page 34: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Cross Tabulations

• STATS_CROSSTAB function takes as arguments two expressions (the two variables being analyzed) and a value that determines which test to perform. These values include the following:

• CHISQ_OBS (observed value of chi-squared)• CHISQ_SIG (significance of observed chi-squared)• CHISQ_DF (degree of freedom for chi-squared)• PHI_COEFFICIENT (phi coefficient)• CRAMERS_V (Cramer’s V statistic)• CONT_COEFFICIENT (contingency coefficient)• COHENS_K (Cohen’s kappa)

• Function returns all values as specified by the third argument (default is CHISQ_SIG)

Page 35: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Distribution-Fitting Functions

• Distribution-fitting functions in Oracle Database 10g include the following

• NORMAL_DIST_FIT function• UNIFORM_DIST_FIT function• POISSON_DIST_FIT function• WEIBULL_DIST_FIT function• EXPONENTIAL_DIST_FIT function

• These functions test how well a sample of values “fits” a particular distribution

• The IN parameter of each function specifies which of the tests to use to measure the fit

Page 36: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Page 37: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Opportunities for Use Cases

• Control charts• Set flags on your data—e.g. when a value is above 3 sigma

Page 38: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Opportunities for Use Cases

• Construction of a Control Chart

1.Calculate means and ranges for each “sample”

2.Chart3.Apply out-of-

control rules e.g. outside of 3 sigma

Page 39: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Opportunities for Use Cases

• Construction of a Control Chart

1.Calculate means and ranges for each “sample”

2.Chart3.Apply out-of-

control rules e.g. outside of 3 sigma

Page 40: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Customer Example

"..Our experience suggests that Oracle 10g Statistics and Data Mining features can reduce development effort of analytical systems by an order of magnitude."

Sumeet Muju Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics

development projects)

Page 41: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

http://www.oracle.com/technology/products/bi/stats_fns/index.html

Page 42: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

In-Database StatisticsAdvantages

• Data remains in the database at all times…with appropriate access security control mechanisms—fewer moving parts

• Straightforward inclusion within interesting and arbitrarily complex queries

• Real-world scalability—available for mission critical appls

Data Warehousing

ETL

OLAP

Data Mining

Oracle 10Oracle 10gg DBDB

Statistics

Page 43: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Industry AnalystsPREDICTIVE ANALYTICS: Extending the Value of YourData Warehousing Investment, By Wayne W. Eckerson

“…According to our survey, most organizations plan to significantly increase the analytic processing within a data warehouse database in the next three years, particularly for model building and scoring, which show 88% climbs. The amount of data preparation done in databases will only climb 36% in that time, but it will be done by almost two-thirds of all organizations (60%)—double the rate of companies planning to use the database to create or score analytical models.”“…it’s surprising that about one-third of organizations plan to build analytical models in databases within three years.”“‘We leverage the data warehouse database when possible,’ says one analytics manager. He says most analysts download a data sample to their desktop and then upload it to the data warehouse once it’scompleted. ‘Ultimately, however, everything will run in the data warehouse,’ the manager says.”

http://download.101com.com/pub/tdwi/Files/PA_Report_Q107_F.pdf

Page 44: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Analytics vs. 1. In-Database Analytics Engine

Basic Statistics (Free)

Data MiningText Mining

2. Costs (ODM: $20K cpu)Simplified environmentSingle serverSecurity

3. IT PlatformSQL (standard)

Java (standard)

1. External Analytical EngineBasic StatisticsData MiningText Mining (separate: SAS EM for Text)

Advanced Statistics2. Costs (SAS EM: $150K/5 users)

Duplicates dataAnnual Renewal Fee (AUF)

(~45% each year)

3. IT PlatformSAS Code (proprietary)

Data Warehousing

ETL

OLAP

Data Mining

Oracle 11g DBOracle 11g DB

Statistics

Page 45: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Analytics vs. 1. In-Database Analytics Engine

Basic Statistics (Free)

Data MiningText Mining

2. Costs (ODM: $20K cpu)Simplified environmentSingle serverSecurity

3. IT PlatformSQL (standard)

Java (standard)

1. External Analytical EngineBasic StatisticsData MiningText Mining (separate: SAS EM for Text)

Advanced Statistics2. Costs (SAS EM: $150K/5 users)

Duplicates dataAnnual Renewal Fee (AUF)

(~45% each year)

3. IT PlatformSAS Code (proprietary)

Oracle 11g DBOracle 11g DB Oracle 11g DBOracle 11g DB

Data Warehousing Data Warehousing

ETL ETL

OLAP OLAPStatistics Statistics

Data Mining Data Mining

Page 46: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

SAS In-Database Processing 3-Year Road Map

• “The goal of the SAS In-Database initiative is … to achieve deeper technical integration with database providers, but … also … blends the best SAS data

• integration and analytics with the core strengths of databases..”

•…Like all DBMS client applications, the SAS engine often must load and extract data over a network to and from the DBMS. This presents a series of challenges:

• …Network bottlenecks between SAS and the DBMS constrain access to large volumes of data

The best practice today is to read data into the SAS environment for processing. For highly repeatable processes, this might not be efficient because it takes time to transfer the data and resources are used to temporarily store in the SAS environment. In some cases, the results of the SAS processing must be transferred back to the DBMS for final storage, which further increases the cost. Addressing this challenge can result in improved resource utilization and enable companies to answer business questions more quickly.

•Oracle Data Mining is available todaySource: SAS In-Database Processing White Paper—October 2007

Page 47: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

SAS In-Database Processing 3-Year Road Map…

“It boils down to this simple equation:Less data movement = faster analytics, and faster analytics = faster delivery of real-time BI throughout an enterprise.”Source: http://www.teradata.com/t/pdf.aspx?a=83673&b=178909Use SAS® to get more power out of your databaseMove key components of BI, analytics and data integration processes from the server or desktop to inside the database and help shorten your time to intelligence

Page 48: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

IDC Worldwide Business Analytics Software

Oracle

http://www.oracle.com/corporate/analyst/reports/infrastructure/bi_dw/208699e.pdf

Page 49: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

References

1. “Back to Basics” Understanding and Visualising Variation in Data.Pete Ceuppens, Robert Shaw, Zhiping You. AstraZeneca R&D.

2. QuickStart: Oracle Statistics Release 10gR2. Charlie Berger, Oracle Corporation. April, 2007.

3. Oracle® Database SQL Reference 10g Release 2 (10.2) Part Number: B14200-02. December 2005

4. Applied Linear Statistical Models. John Neter, William Wasserman, Michael H. Kutner. IRWIN 1985.

5. Mathematical Statistics with Applications. Mendenhall, Scheffer, Wackley. Duxbury Press, Boston, MA. 1981

6. Oracle Database Data Warehousing Guide 10g Release 2 (10.2) Part Number: B14223-02 December 2005

7. Oracle Technology Network:http://www.oracle.com/technology/products/bi/stats_fns/index.html

Source: Oracle 10gR2 Statistics Functions, OLSUG08 Workshop, Henri B. Tuthill, AstraZeneca & Charlie Berger, Oracle

Page 50: Oracle Statistical Functions 10gR2

Copyright © 2007 Oracle Corporation

Hands-on Exercises

• Quick Start Statistics

Page 51: Oracle Statistical Functions 10gR2

<Insert Picture Here>

Copyright © 2007 Oracle Corporation

More Information:

Contact Information:Email: [email protected]

Oracle Data Mining 10g •oracle.com/technology/products/bi/odm/index.html

Oracle Statistical Functions•http://www.oracle.com/technology/products/bi/stats_fns/index.html

Oracle Business Intelligence Solutions•oracle.com/bi

Page 52: Oracle Statistical Functions 10gR2

Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Page 53: Oracle Statistical Functions 10gR2

“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”