73
BIG DATA BIG DATA AND HIGH DIMENSIONAL DATA ANALYSIS B.L.S. Prakasa Rao CR RAO ADVANCED INSTITUTE OF MATHEMATICS, STATISTICS AND COMPUTER SCIENCE (AIMSCS) University of Hyderabad Campus GACHIBOWLI, HYDERABAD 500046 e-mail:[email protected] 1 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

  • Upload
    dokiet

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA AND HIGH DIMENSIONALDATA ANALYSIS

B.L.S. Prakasa Rao

CR RAO ADVANCED INSTITUTE OF MATHEMATICS, STATISTICS ANDCOMPUTER SCIENCE (AIMSCS)University of Hyderabad Campus

GACHIBOWLI, HYDERABAD 500046

e-mail:[email protected]

1 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 2: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

Statistical Methods have been applied in sciences such as

BiometricsBio-pharmaceutical ScienceEpidemiologyPhysical and Engineering SciencesEnvironmental and Ecology

2 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 3: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

and in business sciences such as

Business and Economic StatisticsQuality and Productivity and Marketing

and for policy making in

Official StatisticsHealth Policy StatisticsDefence and National security

and in

Social SciencesSurvey Research

3 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 4: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

Statistics provides methodology for use in these diversefields and helps to solve problems. An important andunique feature of being a statistician is that you maycollaborate in any discipline and you may change fields ofapplication depending on the need and demand. Statisticsimproves human welfare by its contributions to all fields. Itserves society. Some statisticians are doing research orinvolved in teaching, others are applying the methodologyor developing software tools for applying the tools ofstatistics.

4 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 5: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

The present century is the century of data. We arecollecting and processing data of all kinds on scalesunimaginable earlier. Examples of such data are internettraffic, financial tick-by-tick data and DNA Microarrayswhich feed data in large streams into scientific andbusiness bases worldwide.

What is BIG DATA?

5 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 6: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA is relentless. It is continuously generated on amassive scale. It is generated by online interactionsamong people, by transactions between people andsystems and by sensor enabled instrumentation.

6 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 7: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA can be related, linked and integrated to providehighly detailed information. Such a detail makes it possible:

For Banks to introduce individually tailored services

For Health care providers to offer personalized medicines

For Public safety departments to anticipate crime intargeted areas

Big Data is creating new business and transformingtraditional markets. Big Data is a challenge to statisticalcommunity.

7 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 8: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA is a class of data sets so large that it becomesdifficult to process using the standard methods of dataprocessing. The problems of such data include collection,storage,search, sharing, transfer, visualization andanalysis. An important advantage of analysis of BIG DATAis the additional information that can be obtained from asingle large set as opposed to separate smaller sets. BIGDATA allows correlations to be found, for instance, to spotbusiness trends.

8 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 9: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA involves increasing volume, that is, amount ofdata, velocity, that is speed at which data is in and out, andvariety, that is, range of data types and sources. It is highin volume, high in velocity and also high in variety. Thisrequires new forms of processing for decision making. BIGDATA can make important contributions to internationaldevelopment as it gives a cost-effective way to improvedecision making in areas such as health care,employment, economic productivity, resource managementand natural disasters. There are of course concerns suchas privacy with collection of such data.

9 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 10: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA has unique features that are not shared by thetraditional data sets. It is characterized by massive samplesize and high-dimensionality. Massive sample sizes allowsus to discover hidden patterns associated with smallsub-populations. Modeling intrinsic heterogeneity of BIGDATA requires sophisticated methods. High-dimensionalityand BIG DATA have special features such as noiseaccumulation, spurious correlation and incidentalendogenity.

10 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 11: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

BIG DATA is a combination of many data sources comingfrom different sub-populations. Each sub population mayexhibit features which are not shared by others. A mixturemodel may be appropriate to model BIG DATA. Analyzingsuch data requires simultaneous estimation or testing ofseveral parameters. Errors associated with estimation ofthese parameters accumulate when a decision rule orprediction rule depend on these parameters. Such a noiseaccumulation effect is severe due to high-dimension andmight even dominate the true signal. This is handled bysparsity assumption in high-dimensional data.

11 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 12: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

BIG DATA

High dimensionality brings in spurious correlation due tothe fact that many uncorrelated random variables mayhave high sample correlation coefficient inhigh-dimensions. Spurious correlation in turn lead towrong inferences. Sheer size of BIG DATA leads toincidental endogenity. It is the existence of correlationbetween variables "unintentionally" as well as due to"high-dimensionality". It is like finding two persons in alarge group of people who look alike but have no geneticrelationship or meeting an acquaintance by chance in a bigcity. Endogenity occurs as a result of selection bias,measurement errors and omitted variables.

12 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 13: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

High Dimensional Data

In traditional statistical data analysis, we think ofobservations of instances of a particular phenomenon.These observations being vector of values measured onseveral variables such as blood pressure, weight, heightetc. In traditional statistical methodology, we assumedmany observations and a few selected variables which arewell chosen to explain the phenomenon.

13 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 14: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

High Dimensional Data

The trend today is towards more observations, but evenmore so, to a very large number of variables—automatic,systematic collection of a large amount of detailedinformation about each observation. The observationscould be curves, images or movies so that a singleobservation has dimension in the thousands or millions orbillions while only tens or hundreds of observations forstudy. This is HIGH DIMENSIONAL DATA.

Classical methods are not designed to cope with this kindof growth of dimensionality of the observation vector.

14 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 15: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

High Dimensional Data

Traditionally, data analysis is a part of the subject ofstatistics with its basics in probability theory, decisiontheory and analysis. New sources of data —such as fromsatellites— generate automatically huge volumes of datawhose summarization called for a wide variety of dataprocessing and analysis tools. For such data, traditionalideas of mathematical statistics such as hypothesis testingand confidence intervals do not help.

15 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 16: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

RECENT DATA TRENDS

Over the last twenty years, data, data management anddata processing have become important factors. Largeinvestments have been made in gathering various data andin data processing mechanisms. The informationtechnology industry is the fastest growing and mostlucrative segment of the world economy and much of thegrowth occurs in the development, management andwarehousing of streams of data for scientific, medical,engineering and commercial purposes. Examples of suchdata are the following:

16 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 17: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

RECENT DATA TRENDS

Biotech data: Every body is aware of the progress madein getting data about human genome. The genome isindirectly related to protein function and the proteinfunction is indirectly related to the cell function. At eachstep, more and more massive data are compiled.

Financial data: Over the last twenty years, high frequencyfinancial data has become available.

17 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 18: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

RECENT DATA TRENDS

Satellite imagery: Providers of satellite imagery haveavailable a vast data bases of such images. Nationalremote sensing data is one such with applications of suchimagery include natural resource discovery and agriculture.

Consumer financial data: Every transaction we make onthe web, whether a visit, a search or a purchase is beingrecorded, correlated, compiled in data bases and sold andresold as advertisers scramble to correlate consumeraction with pockets of demand for various goods andservices.

18 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 19: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

RECENT DATA TRENDS

How to store such data? Through data matrix.

19 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 20: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

DATA MATRIX: A data matrix is a rectangular array with Nrows and p columns, the rows giving different observationsor individuals and the columns giving different attributes orvariables.

In a classical framework, N may be 25000 and p = 100.

Let us see some recent examples indicating someapplications of N × p data matrices.

20 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 21: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

(i) Web-term document data: Document retrieval by websearching is an important activity. One approach todocument retrieval is the vector space model of informationretrieval. Here one compiles term-document matrices,N × p arrays, where N is the number of documents whichmay be in millions and p the number of terms (words) is intens of thousands and each entry in the array measuresthe frequency of occurrence of given terms in the givendocument in a suitable normalization. Each such requestcan be looked at as a vector of term frequencies.

21 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 22: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

(ii) Sensor array data: Consider a problem inneuroscience. An array of p sensors may be attached tothe scalp with each sensor recording N observations overa period of seconds at a rate of ten thousand samples persecond. One hopes to use such data to study theresponse of human neuronal system to external stimuli.

22 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 23: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

(iii)Gene expression data: Here we obtain data on therelative abundance of p genes in each of N different celllines. The goal is to term which genes are associated withvarious diseases or other states associated with cell lines.

23 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 24: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

(iv)Consumer preference data: These are used to gatherinformation about borrowing and shopping behaviour ofcustomers and use this to modify presentation ofinformation to users. Examples include recommendationsystem like Netflix. Each consumer is asked to rate 100films; based on that rating, one consumer is compared toanother customer with similar preferences and predictionsare made of other movies which might be of interest to oneconsumer based on experiences of other customers whoviewed or rated those movies. Here we have a rectangulararray giving responses of N individuals on p movies with Npossibly running into millions and p in the hundreds orthousands.

24 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 25: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA MATRIX

(v)Consumer financial history: Example of credit cardtransaction records on N = 250,000 consumers whereseveral dozen variables p are available on each consumer.Here p may be in the hundreds.

25 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 26: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON COMPUTING INFRASTRUCTURE

Massive sample size of BIG DATA challenges thetraditional computing infrastructure. BIG DATA is highlydynamic and impossible to store in a central data base.The basic approach to such data is to divide and analyze.The idea is to partition a large problem into more tractablesubproblems. Each subproblem is attacked in parallel bydifferent processing units. Results from sub-problems arecombined to obtain the final result. When a large numberof computers are connected to process large computingtasks, it is possible that some of them fail. Given the size ofcomputing, we may want to distribute the work evenlybetween computers for a balanced work output. Designingvery large scale, highly adaptive and fault-tolerantcomputing systems is a difficult problem.

26 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 27: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON COMPUTING METHODS

Analysis of BIG DATA leads to problems of large scaleoptimization. Optimization with a large number of variablesis not only expensive but also suffers from slow numericalrates of convergence. Methods for implementation oflarge-scale non-smooth optimization procedures have tobe investigated.

27 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 28: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON COMPUTING METHODS

It is often computationally infeasible to directly makeinferences based on the raw data. To handle BIG DATAfrom both the statistical and the computational views, theidea of dimension reduction is an important step beforestart of processing of a BIG DATA.

28 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 29: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(A)Classification: In classification, one of the p variablesis an indicator of class membership. For instance, in aconsumer financial data base, most of the variablesmeasure consumer payment history and one of thevariables indicates whether the consumer has declaredbankruptcy. The analyst would like to predict bankruptcyfrom credit history. Many approaches are suggested forclassification ranging from identifying hyperplanes whichpartition the sample space in non-overlapping groups tok -nearest neighbour classification.

29 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 30: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(B)Regression: In regression setting, one of the pvariables is a quantitative response variable. Examplesinclude, for instance in financial database, the variability ofexchange rates today given recent exchange rates.

30 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 31: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

Tools for regression analysis:

31 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 32: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(i) Linear regression modeling:

Xi1 = a0 + a2Xi2 + · · ·+ apXip + Zi

Xi1 Response ; Xi2, . . . ,Xip predictors;

32 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 33: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(ii) Nonlinear regression modeling:

Xi1 = f (Xi2, . . . ,Xip) + Zi .

33 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 34: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(iii) Latent variable analysis:

In latent variable modeling, it is proposed that

X = AS

where X is a vector-valued observable, S is a vector ofunobserved latent variables and A is linear transformationconverting one into the other. It is hoped that a fewunderlying latent variables are responsible for essentiallythe structure we see in the array X and, by uncoveringthose variables, we get important insights.

34 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 35: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

Principle Component Analysis (PCA) is an example. Hereone takes the covariance matrix C of the observables X ,obtains the eigenvectors, which will be orthogonal, placesthem as columns in an orthogonal matrix U and defines

S = U ′X .

35 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 36: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

Here we have the latent variable form with A = U. Thistechnique is widely used for data analysis in sciences,engineering and commercial applications. Themathematical reason for this approach is that theprojection on the space spanned by the first k eigenvectorsof C gives the best rank k approximation to the vector X ina mean square sense.

36 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 37: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONAL DATA ANALYSIS

(C)Clustering: Here one seeks to arrange an unorderedcollection of objects in a fashion so that nearby objects aresimilar.

37 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 38: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

COMMENTS

Huge resources are now invested on a global scale on thecollection of massive data bases. In spite of large variety ofphenomena that data can be measuring from astronomicalto financial data, there are some standard forms that dataoften take and we can ask some standard questions onthis data.

38 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 39: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

COMMENTS

There is a large group of people working to answerpractical problems based on real data. They do not useheavy mathematical machinery but use computersimulation.

39 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 40: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

COMMENTS

For such data analysts:

(i) there is no formal definition of the phenomenon in termsof a carefully studied mathematical model;

(ii) there is no formal derivation of the proposed methodsuggesting that it is in some sense naturally associatedwith the phenomenon to be treated; and

(iii) there is no formal analysis justifying the apparentimprovement in simulation.

40 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 41: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

COMMENTS

This is far from the tradition of mathematicians and ofstatisticians.

41 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 42: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

High Dimensionality: We are now in the era of massiveautomatic data collection, systematically obtainingmany measurements, not knowing which ones will berelevant to the phenomenon of interest. Our aim is tofind the relevant variables from a large number ofthem.

42 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 43: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

This is different from the past where it was assumedthat one was dealing with a few well chosen variables,for example, using scientific knowledge to measurejust the right variables in advance. It has becomemuch cheaper to gather data than to worry much aboutwhat data to gather.

43 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 44: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

The basic methodology which was used in classicalstatistical inference is no longer applicable. Theclassical theory was based on the assumption thatp < N and N tends to∞. Many of the results assumethat the observations are multivariate normal. All theresults fail if p > N. In the high dimensional case, wemight even have p tends to∞ with N fixed.

44 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 45: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

For many types of event we can think of, we havepotentially a very large number of measurables quantifyingthat event but a relatively few instances of that event.

Example : Many genes and relatively few patients with agiven genetic disease.

45 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 46: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

(A)Curse of dimensionality: The phrase "curse ofdimensionality" was introduced by Richard Bellman whodeveloped the method of dynamic programming for studyof some optimization problems. In optimization: if we mustminimize a function f of d variables and we know that it isLipschitz, that is,

|f (x)− f (y)| ≤ C||x − y ||, x , y ,∈ Rd ,

then we need to order (1ε )d evaluations on a grid in order to

approximate the minimizer within error ε.

46 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 47: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

(B)Blessings of dimensionality:There are practicaldifficulties caused by increase in dimensionality but thereare theoretical benefits due to probability theory. Theregularity of having many "identical" dimensions overwhich one can "average" is a fundamental tool.

47 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 48: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

(i)Dimension asymptotics: One use is that we can obtainresults on the phenomenon by letting the dimension go toinfinity.

48 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 49: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

HIGH DIMENSIONALITY

(ii)Approach to continuum: Many times we have highdimensional data because the underlying objects are reallycontinuous space or continuous phenomena; there is anunderlying curve or image that we are sampling such as infunctional data analysis or image processing. Since themeasured curves are continuous, there is an underlyingcompactness to the space of observed data which will bereflected by an approximate finite-dimensionality and anincreasing simplicity of analysis for large d .

49 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 50: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

"Curse of dimensionality" in nonparametricestimation:Suppose we have a data set with d variables and the firstone is dependent on the others through a model of theform

Xi1 = f (Xi2, . . . ,Xid ) + εi . (1)

Suppose that f is unknown and we are not willing to specifya specific model for f such as a linear model. Instead, weare willing to assume that f is a Lipschitz function of thesevariables and that εi are i.i.d. N(0,1) variables .

50 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 51: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

How does the accuracy of the estimate depend on N, thenumber of observations in our data set?

51 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 52: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

Let Θ be the class of functions f which are Lipchitz on[0,1]d . It can be shown that

supf∈Θ

E [f̂ − f (x)]2 ≥ CN−2/(2+d)

(cf. Ibragimov and Khasminskii (1981)).

52 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 53: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

This lower bound is non-asymptotic. How much data do weneed in order to obtain an estimate of f accurate to withinε = .1? We can answer this question by the above resultand we need a very large sample and the sample sizeincreases as the dimension d increases. This is the curseof dimensionality.

53 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 54: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

Suppose we have a linear regression problem where thereis a dependent variable Xi1 which we want to model as alinear function of Xi2, . . . ,Xid as in (1). However d is verylarge and let us suppose that we are in a situation wherethere are thought to be only a few relevant variables but wedo not know which are relevant. If we leave out too manyrelevant variables in the model, we might get very poorperformance. For this reason, statisticians , for a long time,considered model selection by searching among subsetsof the possible explanatory variables trying to find just afew variables among the many which will adequatelyexplain the dependent variable.

54 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 55: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

Since 1970’s, one approach to model selection was tooptimize the complexity penalized form over subsetmodels: minimize

RSS(Model) + λ (Modelcomplexity)

where RSS is the residual sum of squares of the residualsXi1 −Modeli1 and model complexity is the number ofvariables Xi1, . . . ,Xid used in forming the model.

55 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 56: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

One choice for λ = 2σ2 where σ2 is the assumed varianceof the noise in (1). The idea is to impose a cost on largecomplex models. More recently another choice suggestedfor λ is λ = 2σ2 log d (Johnstone (1998)). With logarithmicpenalty function, one can mine the data to one’s tastewhile controlling the risk of finding spurious structure.

56 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 57: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

Asymptotics of principal component:This is another instance of "blessing of dimensionality"–-that results for higher dimensions are easier to derive thanfor moderate dimensions. Suppose

X (i) = (Xi1, . . . ,Xid ) ' N(0, Γ).

We are interested in knowing whether Γ = I or Γ#I.

57 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 58: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

This problem can be rephrased as λ1 = 1 or λ1 > 1 whereλ1 is the largest eigenvalue of C. It is natural to develop atest based on `1, the largest eigenvalue of the empiricalcovariance matrix

C = N−1X ′X .

58 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 59: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

This leads to finding the null distribution of `1. Exactformula for this distribution is known but not useful inpractice. Suppose d is large and we are in the setting ofmany dimensions and many variables.

59 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 60: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

What is the behaviour of Cd ,N the largest eigenvalue of Cas d

N → β? This is a problem discussed in "Random MatrixTheory". Iain Johnstone obtained asymptotic results for thelargest eigenvalue in the statistical setting.

60 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 61: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

IMPACT ON DATA ANALYSIS

These results are found to be useful even for dimensiond = 6. Hence, in some cases, we can obtain useful resultsfor moderate dimensions from high-dimensional analysis.

61 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 62: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

LASSO ESTIMATOR

Modeling high-dimensional data is challenging. For acontinuous response variable Y , a simple but very usefulapproach is given by a linear model

Yi =

p∑j=1

βjX(j)i + εi , i = 1, . . . ,n

where εi , i = 1, . . . ,n are zero mean independentidentically distributed errors. The unusual aspect of thismodel is p, the number of covariates is very largecompared to n, the number of observations.

62 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 63: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

LASSO ESTIMATOR

Since p > n, the ordinary least squares estimator is notunique and will heavily overfit the data. Tibshirani (1996)proposed an estimator called LASSO (Least AbsoluteShrinkage and Selection Operator). It has become verypopular for high-dimensional data analysis.

63 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 64: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DIRECTION OF PROGRESS

DIRECTION OF PROGRESS: High dimensional analysisare leading to challenges in both mathematics andstatistics. New subjects such as high-dimensionalapproximate linear algebra and high dimensionalapproximation theory using bases such as wavelets arebeing developed to face such challenges.

64 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 65: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

DATA SCIENTISTS

DATA SCIENTISTS: DATA SCIENTISTS are people whodraw insights from large amount of data. They areexpected to have expertise in statistical modeling andmachine learning; specialized programming skills and asolid grasp of business environment they are working in.Data science is an interdisciplinary blend of statistical,mathematical and computational sciences.

65 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 66: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

e-COMMERCE

e-COMMERCE: Electronic commerce (e-Commerce) hasexperienced an extreme surge of popularity in recentyears. It transformed the economy, eliminated borders,opened the door to many innovations and created newways in which consumers and business interact.E-commerce transactions include activities such as onlinebuying, selling or investing; e-book stores , e-grocers,web-based reservation system and ticket purchasing,online banking etc.

66 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 67: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

e-COMMERCE

e-COMMERCE: Due to the availability of massive amountsof data, empirical research is thriving. E-commerce dataarrives with many new statistical issues and challenges,starting from data collection to data exploration and toanalysis and modeling. e-commerce data is usually verylarge in both dimensionality and size. It is often acombination of longitudinal and cross-sectional data. Mostof it is observational. Data privacy protection is a majorissue for e-commerce.

67 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 68: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

e-COMMERCE

e-COMMERCE: There is a fairly new branch of statisticscalled "Functional Data Analysis" (FDA) which is beingused for analyzing the data obtained via e-commerce.Here the objects of interest are a set of curves or shapesand in general a set of functional observations. In classicalstatistics, the interest is around data values or datavectors. e-commerce is creating new challenges.

68 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 69: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

e-COMMERCE

e-COMMERCE: The first step in any FDA consists ofrecovering the underlying functional object from theobserved data. This involves choosing a suitable family ofbasis functions to reduce the dimensionality of the problemwhich in turn depends on(i) the nature of the data;(ii) the level of smoothness that the application warrants;(iii) the aspects of the data we want to study;(iv) the size of the data; and(v) the type of analysis we plan to perform.

69 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 70: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

STATISTICIANS

STATISTICIANS: Statistical thinking sets us thestatisticians apart from others in three ways:

(1) Like others, we look for features in large data - but weguard against FALSE DISCOVERY, BIAS ANDCONFOUNDING;

(2) Like others, we build statistical models that explain,predict and forecast- but we qualify their use withMEASURES OF UNCERTAINTY;

(3) Like others, we work with available data- but we alsodesign studies to produce data that have RIGHTINFORMATION CONTENT

That is the difference between us the STATISTICIANS andthem DATA MINERS/DATA SCIENTISTS.

70 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 71: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

STATISTICIANS

STATISTICIANS: In his recent address to ASA, its formerpresident Robert Rodriguez said "Young statisticians willneed to have continuing education in statistical theory,methodology and applications. They will need to knowabout new data, new problems and new computationaltechnology. They need to develop skills in collaboration,communication and leadership. It is necessary to have abroad view of what is being accomplished as a statistician."

71 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 72: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

STATISTICIANS

STATISTICIANS: "It is more than designing an experimentand analyzing data. Ideas should transform business,influence the decision making in formulating public policy.The future of statistics as a profession is unlimited as wegrow the opportunity for statistics to meet the needs of adata dependent society".This is true for Indian Statisticians too!!!

72 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis

Page 73: BIG DATA AND HIGH DIMENSIONAL DATA ANALYSISacmsc/WBDA2015/slides/blsp/Rev_BIGDATA.pdf · big data big data and high dimensional data analysis b.l.s. prakasa rao cr rao advanced institute

BIG DATA

THANK YOU

73 / 73 B.L.S. Prakasa Rao High Dimensional Data Analysis