56
Seminar on Sub- linear algorithms Prof. Ronitt Rubinfeld

Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Embed Size (px)

Citation preview

Page 1: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Seminar on Sub-linear algorithms

Prof. Ronitt Rubinfeld

Page 2: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What are we here for?

Page 3: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Big data?

Page 4: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Really Big Data

• Impossible to access all of it

• Potentially accessible data is too enormous to be viewed by a single individual

• Once accessed, data can change

Page 5: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Approaches• Ignore the problem

• Develop algorithms for dealing with such data

Page 6: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Small world phenomenon

• each “node’’ is a person

• “edge’’ between people that know each other

Page 7: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Small world property

• “connected” if every pair can reach each other • “distance’’ between two people is the

minimum number of edges to reach one from another

• “diameter’’ is the maximum distance between any pair

Page 8: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

6 degrees of separation property

In our language:

diameter of the world population is 6

Page 9: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Does earth have the small world property?

• How can we know?• data collection problem is immense• unknown groups of people found on earth• births/deaths

Page 10: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

The Gold Standard

• linear time algorithms• Inadequate…

Page 11: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What can we hope to do without viewing most of the data?

• Can’t answer “for all” or “there exists” and other “exactly” type statements:

• are all individuals connected by at most 6 degrees of separation? • exactly how many individuals on earth are left-handed?

• Maybe can answer?• is there a large group of individuals connected by at most 6

degrees of separation?• is the average pairwise distances of a graph roughly 6?• approximately how many individuals on earth are left-handed?

Page 12: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What can we hope to do without viewing most of the data?

• Must change our goals:• for most interesting problems: algorithm must give

approximate answer

• we know we can answer some questions...• e.g., sampling to approximate average, median values

Page 13: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Sublinear time models:

• Random Access Queries• Can access any word of input in one step

• Samples• Can get sample of a distribution in one step, • Alternatively, can only get random word of

input in one step • When computing functions depending on

frequencies of data elements• When data in random order p

x1,x2,x3,…

Page 14: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Sublinear space model: streaming

• Data flows by and once it goes by you can’t see it again• you do have small space to work with and remember some things…

Page 15: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Part I: Query model

Page 16: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

A. Standard approximation problems

• Estimate value of optimization problem after viewing only small portion of data

• Classically studied problems: Minimum spanning tree, vertex cover, max cut, positive linear program, edit distance, …

• some problems are “hard” some are “easy”…

Page 17: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Example• Vertex Cover:

• What is the smallest set of nodes that “sees” each edge?

• What’s known?• NP-complete (unlikely to have an efficient solution)• Can approximate to within a multiplicative factor of 2 in

linear time• Can approximate to within a multiplicative factor of 2 and a

bit of additional additive error in CONSTANT TIME on sparse graphs! [Nguyen Onak 2008][Onak Ron Rosen Rubinfeld 2012]

Page 18: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

B. Property testing

Page 19: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

• Quickly distinguish inputs that have specific property from those that are far from having the property

• Benefits:• natural question • just as good when data constantly changing • fast sanity check:

• rule out “bad” inputs (i.e., restaurant bills) • when is expensive processing worthwhile?

• Machine learning: Model selection problem

Main Goal:

all inputs

inputs with the property

close to having property

Page 20: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Property testing algorithms:• Does the input object have crucial properties?

• Clusterability, small diameter graph, close to a codeword, in alphabetical order,…

• Examples: • Can test if the social network has 6 degrees of

separation in CONSTANT TIME! [Parnas Ron]• Can test if a large dataset is sorted in O(log n) time

[Ergun Kannan Kumar Rubinfeld Viswanathan]• Can test if data is well-clusterable in CONSTANT time

[Parnas Ron Rubinfeld]

Page 21: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Property Testing

• Properties of any object, e.g.,• Functions• Graphs• Strings• Matrices• Codewords

• Model must specify• representation of object and allowable queries• notion of close/far, e.g.,

• number of bits/words that need to be changed• edit distance

Page 22: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

A simple property tester

Page 23: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Sortedness of a sequence

• Given: list y1 y2 ... yn

• Question: is the list sorted?

• Clearly requires n steps – must look at each yi

Page 24: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Sortedness of a sequence

• Given: list y1 y2 ... yn

• Question: can we quickly test if the list close to sorted?

Page 25: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What do we mean by ``quick’’?

• query complexity measured in terms of list size n

• Our goal (if possible): • Very small compared to n, will go for clog n

Page 26: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What do we mean by “close’’?

Definition: a list of size n is -close to sorted if can delete at most n values to make it sorted. Otherwise, -far.

( is given as input, e.g., =1/10)

Sorted: 1 2 4 5 7 11 14 19 20 21 23 38 39 45Close: 1 4 2 5 7 11 14 19 20 39 23 21 38 45 1 4 5 7 11 14 19 20 23 38 45Far: 45 39 23 1 38 4 5 21 20 19 2 7 11 14 1 4 5 7 11 14

Page 27: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Requirements for algorithm:

• Pass sorted lists • Fail lists that are -far.

• Equivalently: if list likely to pass test, can change at most fraction of list to make it sorted

Probability of success > ¾ (can boost it arbitrarily high by repeating several times and outputting “fail” if ever see a “fail”,

“pass” otherwise)

• Can test in O(1/ log n) time(and can’t do any better!)

What if list not sorted, but not far?

Page 28: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

An attempt:• Proposed algorithm:

• Pick random i and test that yi≤yi+1

• Bad input type:• 1,2,3,4,5,…n/4, 1,2,….n/4, 1,2,…n/4, 1,2,…,n/4• Difficult for this algorithm to find “breakpoint” • But other tests work well…

i

yi

Page 29: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

A second attempt:• Proposed algorithm:

• Pick random i<j and test that yi≤yj

• Bad input type:• n/4 groups of 4 decreasing elements 4,3, 2, 1,8,7,6,5,12,11,10,9…,4k, 4k-1,4k-2,4k-3,…• Largest monotone sequence is n/4• must pick i,j in same group to see problem• need W(n1/2) samples

i

yi

Page 30: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

A minor simplification:

• Assume list is distinct (i.e. xi xj)

• Claim: this is not really easier• Why?

Can “virtually” append i to each xi

x1,x2,…xn (x1,1), (x2,2),…,(xn,n) e.g., 1,1,2,6,6 (1,1),(1,2),(2,3),(6,4),(6,5)Breaks ties without changing order

Page 31: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

A test that works• The test:

Test O(1/) times:• Pick random i• Look at value of yi • Do binary search for yi

• Does the binary search find any inconsistencies? If yes, FAIL

• Do we end up at location i? If not FAILPass if never failed

• Running time: O(-1 log n) time• Why does this work?

Page 32: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Behavior of the test:• Define index i to be good if binary search for yi successful• O(1/ log n) time test (restated):

• pick O(1/) i’s and pass if they are all good• Correctness:

• If list is sorted, then all i’s good (uses distinctness) test always passes

• If list likely to pass test, then at least (1-)n i’s are good.• Main observation: good elements form increasing

sequence• Proof: for i<j both good need to show y i < yj

• let k = least common ancestor of i,j• Search for i went left of k and search for j went right

of k yi < yk <yj• Thus list is -close to monotone (delete < n bad elements)

Page 33: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Part II: Sample model

Page 34: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Testing and learning distributions over BIG domains

Page 35: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Outbreak of diseases

• Similar patterns? • Correlated with income level? • More prevalent near large cities?

Flu 2015

Flu 2016Lots of zip codes!

Page 36: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Transactions of 20-30 yr olds

Transactions of 30-40 yr olds

Testing closeness of two distributions:

trend change? Lots

of product

types!

Page 37: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Play the lottery?

Is it uniform?

Is it independent?

Page 38: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Information in neural spike trails

• Each application of stimuli gives sample of signal (spike trail)

• Entropy of (discretized) signal indicates which neurons respond to stimuli

Neural signals

time

[Strong, Koberle, de Ruyter van Steveninck, Bialek ’98]

Page 39: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

High dimensional data

Clusterable?

Depends on all or few variables?

Blood pressure, age, weight, pulse, temperature,…

ICU, mortality, ventilator,…Vs.

Page 40: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What properties do your big distributions have?

Page 41: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Properties of distributions

Given samples of a distribution, need to know, e.g.,

• Is it uniform, normal, Poisson, Gaussian, Zipfian…?• Correlations/independence?• Entropy• Number of distinct elements• “Shape” (monotone, bimodal,…)

Page 42: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Key Question

• How many samples do you need in terms of domain size?

• Do you need to estimate the probabilities of each domain item?

-- OR --• Can sample complexity be sublinear in size of the domain?

Rules out standard statistical techniques

Page 43: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

In many cases, the answer is YES!

Well?

Page 44: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Assumptions on distributions?

• Is it• Smooth?• Monotone?• Normal?• Gaussian?

• Hopefully – NO ASSUMPTIONS!

Page 45: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Distributions on BIG domains

• Previous work (including in other communities): • E.g., Can you beat n log n samples for estimating entropy?

• O(n) samples is a big success!

• Great progress! • Some optimal bounds:

• Additive estimates of entropy, support size, closeness of two distributions: n/log n [Valiant Valiant 2011]

• -multiplicative estimate of entropy: n1/ [Batu Dasgupta Kumar Rubinfeld 2005] [Valiant 2008]

• Two distributions - the same or far (in L1 distance)? n2/3 [Batu Fortnow Rubinfeld Smith White 2000] [Valiant 2008]

• Can do better for data with special properties

Page 46: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

• Independence,• Monotonicity,• Gaussian, normal, uniform,…• Clusterable,• Depends on few variables

• And lots and lots more!

Many more!

Page 47: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

To summarize: What can we do in sublinear time/space?• Approximations – what do we mean by this?

• Optimization problems• Decision problems – property testing• Which distance metric?

• Which types of problems?• Graphs and combinatorial objects• Strings• Functions• Probability distributions

• Settings?• Small time• Small space• Assistance of a prover?

Page 48: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Related course:

• Sublinear time algorithms 0368.4167• Monday 13-16 in Shenkar 104

Page 49: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

My contact info:

• www.cs.tau.ac.il/~Rubinfeld/COURSES/F15sem/index.html

• Can also google “Ronitt” and follow links.

• My email: [email protected]

Page 50: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Giving a talk

Page 51: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Main issue

Page 52: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Giving an INTERESTING talk

• Good material

• Good slides

• Good delivery

Page 53: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Material

Pick something you like!

How many technical details?Make sure we learn something technical/conceptual

Page 54: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Slides

• At least 24 point font• As few words as possible

• Split into more slides?

• Organization• Use bullets, sub-bullets• Use color wisely

• Write anything which should be remembered• Repeat a definition on next slide? Use a board?

• Animations? Don’t overdo it!• SPELL CHECK!!

Page 55: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

Delivery

• Bad:• Mumbling• Monotone voice• Repeat each sentence twice• Filler words• Stand in front of slides• Don’t look at audience

• Important: Practice the talk – several times!

Page 56: Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld

What now?

• Pick a paper, or two or three?• Come talk to me!