45
http://www.hmwu.idv.tw http://www.hmwu.idv.tw 吳漢銘 國立臺北大學 統計學系 敘述統計& 參數估計 B03

敘述統計 參數估計 - hmwu.idv.t · 推薦三本書 統計有沒有死?會不會萬歲? 只要有米倉,就會有老鼠;只要有數 據,就會發展處理數據的方法。

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • http://www.hmwu.idv.tw http://www.hmwu.idv.tw

    吳漢銘國立臺北大學 統計學系

    敘述統計&參數估計

    B03

  • http://www.hmwu.idv.tw

    本章大綱

    Statistics, Data Mining, Data Science 資料描述: 中心趨勢, 分散程度, 偏態係數 基礎統計圖表 標準化、相關係數 參數估計 (parameter estimation)

    (利用樣本統計量及其抽樣分配來對母體參數進行推估, 以暸解母體的特性) 點估計 (動差法、最大概似法、最小平方法)

    評斷準則: 不偏性、有效性、一致性、最小變異不偏性、充份性。

    區間估計 貝式估計法

    Frequentist parameter estimation

    2/45

  • http://www.hmwu.idv.tw

    What is Statistics? Merriam-Webster dictionary defines statistics as "a

    branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data."

    傳統統計(歷史源自17世紀), 分兩類: 敘述統計 (Descriptive statistics): 推論統計(Inferential statistics): It uses patterns in the

    sample data to draw inferences (estimation, hypothesis testing) about the population represented, accounting for randomness.

    統計研究領域的分類: 數理統計、工業統計、商用統計、生物統計等等。 http://www.theusrus.de/blog/some-truth-about-big-data/

    3/45

  • http://www.hmwu.idv.tw

    Data Mining Diagrams

    Source: Published on Nov 26, 2014Language Technologies for Geomatics: From Intelligence to AgilityPublished in: Technologyhttp://www.slideshare.net/VisionGEOMATIQUE2014/gagnon-20141112vision

    Source:http://www.simplilearn.com/data-mining-vs-statistics-article

    4/45

  • http://www.hmwu.idv.tw

    Difference between Machine Learning & Statistical Modeling

    https://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/

    機器學習和統計棤型的差異http://vvar.pixnet.net/blog/post/242048881為什麼統計學家、機器學習專家解決同一問題的方法差別那麼大?https://read01.com/EBPPK7.html深度 | 機器學習與統計學是互補的嗎?https://read01.com/ezQ3K.html

    • Machine Learning is an algorithm that can learn from data without relying on rules-based programming.

    • Statistical Modelling is the formalization of relationships between variables in the form of mathematical equations.

    5/45

  • http://www.hmwu.idv.tw

    Statistics, Data Mining and Big Data

    Source: http://www.theusrus.de/blog/some-truth-about-big-data/

    6/45

  • http://www.hmwu.idv.tw

    小數據與大數據的區別

    調查資料 抽樣的 樣本反饋的 主觀的 結果的 結構化的 斷點的

    監測資料 全樣的 監測紀錄 客觀的 過程的 非結構化的 連續的

    7/45

  • http://www.hmwu.idv.tw

    數據科學 Data Science

    The Data Science Venn Diagramhttp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

    8/45

  • http://www.hmwu.idv.tw

    推薦三本書

    統計有沒有死?會不會萬歲?

    只要有米倉,就會有老鼠;只要有數據,就會發展處理數據的方法。但是不是叫做統計學、或者叫做computer science 的data mining,就要看這一代的統計人如何因應變局。

    趙民德,1999,「統計已死,統計萬歲!」第八屆南區統計研討會演說稿

    (March 7, 2016)

    9/45

  • http://www.hmwu.idv.tw

    Types of Data Scales質的資料: Categorical (類別資料), discrete, or nominal (名目變數) — Values contain

    no ordering information: 性別、種族、教育程度、宗教信仰、交通工具、音樂類型… (qualitative 屬質)

    Ordinal (順序) — Values indicate order, but no arithmetic operations are meaningful (e.g., "novice", "experienced", and "expert" as designations of programmers participating in an experiment); 非常同意,同意,普通,不同意,非常不同意; 優,佳,劣。

    量的資料: Interval — Distances between values are meaningful, but zero point is

    not meaningful. (e.g., degrees Fahrenheit, time) Ratio (Continuous Data 連續型資料)— Distances are meaningful and a

    zero point is meaningful (e.g., degrees K, 年收入、年資、身高、… (quantitative 計量)

    Ordinal methods cannot be used with nominal variable Nominal methods can be used with nominal, ordinal variables.

    10/45

  • http://www.hmwu.idv.tw

    資料無邊界!

    統計教學裡的範例幾乎都是結構性的數據。 大數據時代,80%的資料是非結構性的。

    Image source: http://www.36dsj.com/archives/95185

    Image source: http://marketbusinessnews.com/financial-glossary/what-isa-statistician

    11/45

  • http://www.hmwu.idv.tw

    資料描述 資料中心趨勢:

    平均數(average)眾數(mode)中位數(median)

    資料分散程度: 四分位數(Quartile)全距(range)四分位距(interquartile range, IQR)百位數(percentile)標準差(standard deviation)變異數(variance)

    https://zh.wikipedia.org/wiki/四分位距

    12/45

  • http://www.hmwu.idv.tw

    資料描述:偏態係數 偏態(skewness):

    http://www.t4tutorials.com/data-skewness-in-data-mining/

    大於0:右偏分配等於0:對稱分配小於0:左偏分配

    13/45

  • http://www.hmwu.idv.tw

    資料描述:峰態係數 14/45

  • http://www.hmwu.idv.tw

    範例: 由財稅大數據探討臺灣近年薪資樣貌

    由財稅大數據探討臺灣近年薪資樣貌 財政部統計處 106年8月https://www.mof.gov.tw/File/Attach/75403/File_10649.pdf

    15/45

  • http://www.hmwu.idv.tw

    玩玩看~薪情平臺

    https://earnings.dgbas.gov.tw/

    16/45

  • http://www.hmwu.idv.tw

    R程式練習: 加權算術平均數

    想想看: 如何決定權重? 維度縮減方法 (e.g., PCA)

    17/45

  • http://www.hmwu.idv.tw

    R程式練習: 敘述統計> score2015.orig dim(score2015.orig)[1] 80 12> head(score2015.orig)

    座號 學號 性別 姓名 小考1 小考2 小考3 小考4 助教 期中考 期末考 出席次數1 1 920541081 女 高婕嘉 0 0 0 36 35 26 25 62 2 920660451 女 倪儒子 30 0 NA NA 19 28 0 4...6 6 921451012 女 洪銘學 35 13 20 29 55 44 40 8> summary(score2015.orig[, 3:ncol(score2015.orig)])性別 姓名 小考1 小考2 小考3 女:60 王彥珮 : 1 Min. : 0.00 Min. : 0.0 Min. : 0.00 男:20 王淳昀 : 1 1st Qu.:25.25 1st Qu.:10.0 1st Qu.: 20.00

    王銘軒 : 1 Median :40.00 Median :30.0 Median : 40.00 朱新太 : 1 Mean :40.00 Mean :28.9 Mean : 47.76 何竣育 : 1 3rd Qu.:50.25 3rd Qu.:40.0 3rd Qu.: 80.00 余馨繁 : 1 Max. :90.00 Max. :80.0 Max. :100.00 (Other):74 NA's :4 NA's :7 NA's :13

    小考4 助教 期中考 期末考 出席次數Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. :1.0 1st Qu.: 36.00 1st Qu.: 35.00 1st Qu.: 32.00 1st Qu.: 23.75 1st Qu.:7.0 Median : 67.00 Median : 59.50 Median : 68.50 Median : 50.00 Median :8.0 Mean : 56.75 Mean : 56.24 Mean : 57.56 Mean : 46.71 Mean :7.7 3rd Qu.: 81.00 3rd Qu.: 75.25 3rd Qu.: 80.25 3rd Qu.: 69.50 3rd Qu.:9.0 Max. :100.00 Max. :100.00 Max. :100.00 Max. :100.00 Max. :9.0 NA's :15

    > > table(score2015.orig["出席次數"])1 2 3 4 5 6 7 8 9 1 1 2 3 3 7 4 21 38

    18/45

  • http://www.hmwu.idv.tw

    R程式練習: 敘述統計> score2015 score2015[is.na(score2015)] colMeans(score2015[, 5:11])

    小考1 小考2 小考3 小考4 助教 期中考 期末考38.0000 26.3750 40.0000 46.1125 56.2375 57.5625 46.7125 > apply(score2015[, 5:11], 1, mean)[1] 17.4285714 11.0000000 32.1428571 58.8571429 71.5714286 33.7142857 51.1428571[8] 16.7142857 67.0000000 85.1428571 31.2857143 65.5714286 19.8571429 88.7142857

    ...[78] 3.4285714 19.2857143 23.1428571> apply(score2015[, 5:11], 2, sd)

    小考1 小考2 小考3 小考4 助教 期中考 期末考23.29883 22.83478 36.26939 35.13014 27.04391 31.00708 30.71848 > x min(x)[1] 0> max(x)[1] 90> sum(x)[1] 3040> mean(x)[1] 38> mean(x)[1] 38> mean(x, trim=0.1)[1] 37.45312> median(x)[1] 40

    > Mode(x)[1] 50> quantile(x)

    0% 25% 50% 75% 100% 0 20 40 50 90

    > quantile(x, prob= seq(0, 100, 10)/100)0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

    0.0 4.5 14.6 27.4 33.6 40.0 45.0 50.0 55.0 68.2 90.0 > range(x)[1] 0 90> sd(x)[1] 23.29883> var(x)[1] 542.8354

    Mode

  • http://www.hmwu.idv.tw

    基礎統計圖表> par(mfrow=c(1, 3))> hist(score2015[, "小考1"], main="小考1直方圖")> barplot(table(score2015[, "出席次數"]), main="出席次數長條圖")> pie(table(score2015[, "出席次數"]), main="出席次數餅圖")

    20/45

  • http://www.hmwu.idv.tw

    基礎統計圖表> par(mfrow=c(1, 3))> selected.id student row.names(student) matplot(t(student), xlab="score", ylab="exam", type="l", xaxt="n", lwd=2,+ xlim=c(1, 6.5), main="5位同學歷次成績線圖")> axis(1, at=1:6, labels = colnames(student))> text(rep(6, length(selected.id)), student[,6], pos=4, rownames(student), cex=0.6)> > plot(score2015[, "期中考"], score2015[, "期末考"], main="期中考 vs 期末考", col=+ score2015[, "性別"], pch=16)> legend(20, 90, title="性別", legend=c("女", "男"), col=1:2, pch=16)> > boxplot(score2015[, 5:11], main="歷次成績盒形圖")

    21/45

  • http://www.hmwu.idv.tw

    相對次數與累積相對次數

    > h h$breaks[1] 0 10 20 30 40 50 60 70 80 90 100

    $counts[1] 16 2 7 4 13 10 8 6 9 5

    $density[1] 0.02000 0.00250 0.00875 0.00500 0.01625 0.01250 0.01000 0.00750 0.01125 0.00625

    $mids[1] 5 15 25 35 45 55 65 75 85 95

    ...attr(,"class")[1] "histogram"> hist(score2015[, "期末考"], breaks=seq(0, 100, 20), plot=F)$breaks[1] 0 20 40 60 80 100

    $counts[1] 18 11 23 14 14...> par(mfrow=c(2,2))> plot(h, xlab="分數", ylab="次數", main="直方圖")> plot(h$mids, h$counts, type="l", xlab="分數", ylab="次數", main="折線圖")> plot(h$mids, h$density, type="l", xlab="分數", ylab="相對次數", main="折線圖")> plot(h$mids, cumsum(h$density), type="l", xlab="分數", ylab="累積相對次數", main="折線圖")

    22/45

  • http://www.hmwu.idv.tw

    列聯表 (Contingency Tables)> table(score2015$性別)女 男60 20 > mytable mytable

    出席次數性別 1 2 3 4 5 6 7 8 9

    女 0 0 2 3 2 6 2 16 29男 1 1 0 0 1 1 2 5 9

    > prop.table(mytable)出席次數

    性別 1 2 3 4 5 6 7 8 9女 0.0000 0.0000 0.0250 0.0375 0.0250 0.0750 0.0250 0.2000 0.3625男 0.0125 0.0125 0.0000 0.0000 0.0125 0.0125 0.0250 0.0625 0.1125

    > margin.table(mytable, 1)性別女 男60 20 > margin.table(mytable, 2)出席次數1 2 3 4 5 6 7 8 9 1 1 2 3 3 7 4 21 38

    > myClass average grade table(score2015$性別, myClass, grade)

    > var.id table(score2015[ , var.id], dnn=var.id)

    See also: Introduction to Contingency Tables in Rhttps://data-flair.training/blogs/r-contengency-tables/

    23/45

  • http://www.hmwu.idv.tw

    Hmisc套件: describe (1/2) Hmisc: describe 對於不同類型變數列出不同內容的結果,具有一套輸出規則。

    對於一個取樣水準不超過10的數值型變數,會被預設為離散型變數。函數會列出連續變數的各分位點值; 對於一個非二分變數,且其取樣水準不超過20,則會列出該變數的頻率表; 當任一變數的取樣水準超過20,就會分別列出頻率最低和最高的5個水準值。

    > install.packages("Hmisc")> library(Hmisc)> describe(Insurance[, 1:3])Insurance[, 1:3]

    3 Variables 64 Observations-------------------------------------------------------------------------District

    n missing unique 64 0 4

    1 (16, 25%), 2 (16, 25%), 3 (16, 25%), 4 (16, 25%) -------------------------------------------------------------------------Group

    n missing unique 64 0 4

    2l (16, 25%) -------------------------------------------------------------------------Age

    n missing unique 64 0 4

    35 (16, 25%) -------------------------------------------------------------------------

    24/45

  • http://www.hmwu.idv.tw

    Hmisc套件: describe (2/2)

    > describe(Insurance[, 4:5])Insurance[, 4:5]

    2 Variables 64 Observations--------------------------------------------------------------------------------Holders

    n missing unique Info Mean .05 .10 .25 .50 .75 64 0 63 1 365 16.30 24.00 46.75 136.00 327.50

    .90 .95 868.90 1639.25

    lowest : 3 7 9 16 18, highest: 1635 1640 1680 2443 3582 --------------------------------------------------------------------------------Claims

    n missing unique Info Mean .05 .10 .25 .50 .75 64 0 46 1 49.23 3.15 4.30 9.50 22.00 55.50

    .90 .95 101.70 182.35

    lowest : 0 2 3 4 5, highest: 156 187 233 290 400 --------------------------------------------------------------------------------

    25/45

  • http://www.hmwu.idv.tw

    fBasics套件: basicStats fBasics: basicStats 金融工程相關的套件,basicStats用於計算時間序列

    資料基礎統計指標的函數。

    > install.packages("fBasics")> library(fBasics)> basicStats(Insurance$Holders)

    X..Insurance.Holdersnobs 6.400000e+01NAs 0.000000e+00Minimum 3.000000e+00Maximum 3.582000e+031. Quartile 4.675000e+013. Quartile 3.275000e+02Mean 3.649844e+02Median 1.360000e+02Sum 2.335900e+04SE Mean 7.784632e+01LCL Mean 2.094209e+02UCL Mean 5.205478e+02Variance 3.878432e+05Stdev 6.227706e+02Skewness 3.127833e+00Kurtosis 1.099961e+01

    sum 23359, 約2.3萬投保人資訊。且在District, Group, Age,平均有365位投保人。

    > library(timeDate)> skewness(Insurance[, 4:5])Holders Claims

    3.127833 2.877292 > kurtosis(Insurance[, 4:5])

    Holders Claims 10.999610 9.377258

    偏態係數 (偏度): s=0: 分佈對稱,|s|,|s|>1 & s>0: 右偏,|s|>1 & s

  • http://www.hmwu.idv.tw

    標準化與z分數

    Standardization, z = (x-x.bar)/s, (called z-score): the new variate z will have a mean of zero and a variance of one. (also called centering and scaling.)

    If the variables are measurements along a different scale or if the standard deviations for the variables are different from one another, then one variable might dominate the distance (or some other similar calculation) used in the analysis:

    Standardization is useful for comparing variables expressed in different units.

    In some multivariate contexts, the transformations may be applied to each variable separately.

    Standardization makes no difference to the shape of a distribution.

    27/45

  • http://www.hmwu.idv.tw

    Old Faithful Geyser Data Waiting time between eruptions and the duration of the eruption for the Old

    Faithful geyser (噴泉) in Yellowstone National Park, Wyoming, USA.

    A data frame with 272 observations on 2 variables.

    [,1] eruptions, Eruption time in mins

    [,2] waiting, Waiting time to next eruption (in mins)

    > head(faithful)> par(mfrow=c(2, 1))> hist(faithful$eruptions) > hist(faithful$waiting) >> hist(scale(faithful$eruptions)) > hist(scale(faithful$waiting))

    28/45

  • http://www.hmwu.idv.tw

    Euclidean Distance

    Pearson Correlation Coefficient

    Data Matrix

    Proximity Matrix

    Distance and Similarity Measure 29/45

  • http://www.hmwu.idv.tw

    Pearson Correlation Coefficientdist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)method: one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski" distance measure.cor(x, y = NULL, use = "everything", method = c("pearson", "kendall", "spearman"))

    30/45

    > dist(iris[sample(1:15, 3),1:4])13 2

    2 0.1414214 4 0.2645751 0.3316625> cor(iris[,1:4])

    Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000> pairs(iris[,1:4])

  • http://www.hmwu.idv.tw

    Dissimilarity/Similarity Measure for Quantitative Data

    Kendall’s tau

    More Similarity Measures (1/4)

    All indices range from -1 to +1

    31/45

  • http://www.hmwu.idv.tw

    More Similarity Measures (2/4)

    measures the strength of a linearrelationship

    measure any monotonicrelationship between two variables

    non-monotonic, fail to detect the existence of a relationship

    more robust

    32/45

  • http://www.hmwu.idv.tw

    Similarity Measures for Categorical Data

    2014, A survey of distance/similarity measures for categorical data, 2014 International Joint Conference on Neural Networks (IJCNN), 1907-1914.

    33/45

  • http://www.hmwu.idv.tw

    Sample Variance-Covariance MatrixCorrelation Matrix

    eigen-decomposition

    34/45

  • http://www.hmwu.idv.tw

    The Likelihood Function 35/45

  • http://www.hmwu.idv.tw

    最大概似估計法Maximum Likelihood Estimation

    點估計步驟:1. 抽取代表性樣本2. 選擇一個較佳的樣本統計量當估計式3. 計算估計式的估計值4. 以該估計值推論母體參數並作決策

    36/45

  • http://www.hmwu.idv.tw

    MLE of (mu, sigma^2) from a normal population

    The probability density function for a sample of n independent identically distributed (iid) normal random variables (the likelihood) is

    https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

    37/45

  • http://www.hmwu.idv.tw

    The maximum likelihood estimator

    for is

    MLE of (mu, sigma^2) from a normal population38/45

  • http://www.hmwu.idv.tw

    區間估計(Interval Estimation)

    區間估計是先對未知的母體參數求點估計值,然後在一信賴水準(Confidence Level) 下,導出一個上下區間,此區間稱為信賴區間(Confidence Interval),信賴水準是指該區間包含母體參數的可靠度。

    95% 信賴區間表示,做100 次信賴區間,區間約包含母體參數95 次

    Interval Estimate of Population Mean

    若大樣本(n> 30)、 母體σ已知, 由中央極限定理知

    39/45

  • http://www.hmwu.idv.tw

    範例: 老年人看電視的時間根據行政院主計處調查,台灣地區15歲以上的人口中,以老年人(65歲以上)看電視的時間最長。現在新立傳播公司計畫推出老年人的電視節目,因此想要了解老年人看電視的時間,以決定電視節目的數量。新立公司於是採隨機抽樣法抽取台北市100位老人調查看電視的時數,結果得知,每星期看電視的平均時間為 21.2小時。假設根據過去數次調查的資料,已知每星期看電視時間的標準差為8小時,問在95%信賴水準下,每星期看電視平均時間的信賴區間為何?

    可推論:「老年人每星期平均看電視的時間在19.632~22.768 小時之間,而此一區間的可信度(信賴水準)為95%。」

    > alpha xbar sigma n v c(xbar - v, xbar + v)[1] 19.63203 22.76797

    40/45

  • http://www.hmwu.idv.tw

    Bayesian Statistics貝式統計

    41/45

  • http://www.hmwu.idv.tw

    Bayesian Statistics貝式統計

    42/45

  • http://www.hmwu.idv.tw

    Bayes Estimator for the Mean of a Normal Distribution

    43/45

  • http://www.hmwu.idv.tw

    Bayes Estimator for the Mean of a Normal Distribution

    44/45

  • http://www.hmwu.idv.tw

    Bayes Estimator for the Mean of a Normal Distribution

    45/45