Future directions in computer science research 23rd International Symposium on Algorithms and...

Preview:

Citation preview

ISAAC

Future directions in computer science research

23rd International Symposium on Algorithms and Computation

John HopcroftCornell University

Time of change

The information age is a revolution that is changing all aspects of our lives.

Those individuals, institutions, and nations who recognize this change and position themselves for the future will benefit enormously.

ISAAC

Computer Science is changing

Early years Programming languages Compilers Operating systems Algorithms Data bases

Emphasis on making computers useful

ISAAC

Computer Science is changing

The future years

Tracking the flow of ideas in scientific literature Tracking evolution of communities in social networks Extracting information from unstructured data

sources Processing massive data sets and streams Extracting signals from noise Dealing with high dimensional data and dimension

reductionThe field will become much more application oriented

ISAAC

Computer Science is changing

Merging of computing and communication

The wealth of data available in digital form

Networked devices and sensors

Drivers of change

ISAAC

Implications for Theoretical Computer Science

Need to develop theory to support the new directions

Update computer science education

ISAAC

This talk consists of three parts.

A view of the future.

The science base needed to support future activities.

What a science base looks like.

ISAAC

Big data

We generate 2.5 exabytes of data/day, 2.5X1018. We broadcast 2 zetta bytes per day.

approximately 174 newspapers per day for every person on the earth.

Maybe 20 billion web pages.

ISAAC

ISAAC

Facebook

Higgs BosonCERN's Large Hadron Collider generates hundreds of millions of particle collisions each second. Recording, storing and analyzing these vast amounts of collisions presents a massive data challenge because the collider produces roughly 20 million gigabytes of data each year.

1,000,000,000,000,000: The number of proton-proton collisions, a thousand trillion, analyzed by ATLAS and CMS experiments. 100,000: The number of CDs it would take to record all the data from the ATLAS detector per second, or a stack reaching 450 feet (137 meters) high every second; at this rate, the CD stack could reach the moon and back twice each year, according to CERN. 27: The number of CDs per minute it would take to hold the amount of data ATLAS actually records, since it only records data that shows signs of something new."Without the worldwide grid of computing this result would not have happened," said Rolf-Dieter Heuer, director general at CERN during a press conference. The computing power and the network that CERN uses is a very important part of the research, he added.

ISAAC

Current database tools are insufficient to capture, analyze, search, and visualize the size of data encountered today.

ISAAC

ISAAC

Theory to support new directions

Large graphs Spectral analysis High dimensions and dimension reduction Clustering Collaborative filtering Extracting signal from noiseSparse vectors

Sparse vectors

ISAAC

There are a number of situations where sparse vectors are important.

Tracking the flow of ideas in scientific literature

Biological applications

Signal processing

Sparse vectors in biology

ISAAC

plants

GenotypeInternal code

PhenotypeObservablesOutward manifestation

Digitization of medical records

Doctor – needs my entire medical record Insurance company – needs my last doctor

visit, not my entire medical record Researcher – needs statistical information but

no identifiable individual information

Relevant research – zero knowledge proofs, differential privacy

ISAAC

A zero knowledge proof of a statement is a proof that the statement is true without providing you any other information.

ISAAC

ISAAC

ISAAC

Zero knowledge proof

Graph 3-colorability

Problem is NP-hard - No polynomial time algorithm unless P=NP

ISAAC

Zero knowledge proof

I send the sealed envelopes.

You select an edge and open the two

envelopes corresponding to the

end points.

Then we destroy all envelopes and

start over, but I permute the colors

and then resend the envelopes.

ISAAC

Digitization of medical records is not the only system

Car and road – gps – privacy

Supply chains

Transportation systems

ISAAC

ISAAC

In the past, sociologists could study groups of a few thousand individuals.

Today, with social networks, we can study interaction among hundreds of millions of individuals.

One important activity is how communities form and evolve.

ISAAC

Early workMin cut – two equal sized communitiesConductance – minimizes cross edges

Future workConsider communities with more external edges than internal edgesFind small communitiesTrack communities over timeDevelop appropriate definitions for communitiesUnderstand the structure of different types of social networks

ISAAC

Our view of a community

TCS

Me

Colleagues at Cornell

Classmates

Family and friendsMore connections outside than inside

ISAAC

Ongoing research on finding communities

ISAACSpectral clustering with K-means.

ISAAC

Spectral clustering with K-means.

ISAACSpectral clustering with K-means

ISAAC

ISAAC

Instead of two overlapping clusters, we find three clusters.

ISAAC

Instead of clustering the rows of the singular vectors, find the minimum 0-norm vector in the space spanned by the singular vectors.

The minimum 0-norm vector is, of course, the all zero vector, so we require one component to be 1.

ISAAC

Finding the minimum 0-norm vector is NP-hard.

Use the minimum 1-norm vector as a proxy. This is a linear programming problem.

ISAAC

What we have described is how to find global structure.

We would like to apply these ideas to find local structure.

ISAAC

We want to find community of size 50 in a network of size 109 .

ISAAC

ISAAC

ISAAC

ISAAC

ISAAC

Minimum 1-norm vector is not an indicator vector.

By thresh-holding the components, convert it to an indicator vector for the community.

ISAAC

0 50 100 150 200 250 300 350 4000.4

0.5

0.6

0.7

0.8

0.9

1

ISAAC

Actually allow vector to be close to subspace.

ISAAC

Random walk

How long?

What dimension?

ISAAC

Structure of communities

How many communities is a person in?Small, medium, large?

How many seed points are needed to uniquely specify a community a person is in?Which seeds are good seeds?Etc.

ISAAC

What types of communities are there?

How do communities evolve over time?

Are all social networks similar?

ISAAC

Are the underlying graphs for social networks similar or do we need different algorithms for different types of networks?

G(1000,1/2) and G(1000,1/4) are similar, one is just denser than the other. G(2000,1/2) and G(1000,1/2) are similar, one is just larger than the other.

ISAAC

ISAAC

ISAAC

ISAAC

Two G(n,p) graphs are similar even though they have only 50% of edges in common.

What do we mean mathematically when we say two graphs are similar?

ISAAC

Theory of Large Graphs

Large graphs with billions of vertices

Exact edges present not critical

Invariant to small changes in definition

Must be able to prove basic theorems

ISAAC

Erdös-Renyin verticeseach of n2 potential edges is present with independent probability

Nn

pn (1-p)N-n

vertex degreebinomial degree distribution

numberof

vertices

ISAAC

ISAAC

Generative models for graphs

Vertices and edges added at each unit of time

Rule to determine where to place edgesUniform probabilityPreferential attachment - gives rise to power

law degree distributions

ISAACVertex degree

Number

of

vertices

Preferential attachment gives rise to the power law degree distribution common in many graphs.

ISAAC

Protein interactions

2730 proteins in data base

3602 interactions between proteins SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1000

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 0

Science 1999 July 30; 285:751-753

Only 899 proteins in components. Where are the 1851 missing proteins?

ISAAC

Protein interactions

2730 proteins in data base

3602 interactions between proteins

SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1851

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 1

Science 1999 July 30; 285:751-753

ISAAC

Science Base

What do we mean by science base?

Example: High dimensions

ISAAC

High dimension is fundamentally different from 2 or 3 dimensional space

ISAAC

High dimensional data is inherently unstable.

Given n random points in d-dimensional space, essentially all n2 distances are equal.

22

1

d

i ii

x yx y

ISAAC

High Dimensions

Intuition from two and three dimensions is not valid for high dimensions.

Volume of cube is one in all dimensions.

Volume of sphere goes to zero.

ISAAC

Gaussian distribution

Probability mass concentrated between dotted lines

ISAAC

Gaussian in high dimensions

3

√d

ISAAC

Two Gaussians

3√d

ISAAC-4 -3 -2 -1 0 1 2 3 4

-4

-3

-2

-1

0

1

2

3

4

2 Gaussians with 1000 points each: mu=1.000, sigma=2.000, dim=500

ISAAC-4 -3 -2 -1 0 1 2 3 4

-4

-3

-2

-1

0

1

2

3

4

2 Gaussians with 1000 points each: mu=1.000, sigma=2.000, dim=500

ISAAC

Distance between two random points from same Gaussian

Points on thin annulus of radius

Approximate by a sphere of radius

Average distance between two points is (Place one point at N. Pole, the other point at random. Almost surely, the second point will be near the equator.)

d

d

2d

ISAAC

ISAAC

2d

d

d

ISAAC

Expected distance between points from two Gaussians separated by δ

2 2d

2d

ISAAC

Can separate points from two Gaussians if

2

14

2

12 2

2

2 2

2 1 2

1

2 2

2 2

d

d d

d d

d

d

ISAAC

Dimension reduction

Project points onto subspace containing centers of Gaussians.

Reduce dimension from d to k, the number of Gaussians

ISAAC

Centers retain separation Average distance between points reduced

by dk

1 2 1 2, , , , , , ,0, ,0d k

i i

x x x x x x

d x k x

ISAAC

Can separate Gaussians provided

2 2 2k k

> some constant involving k and γ independent of the dimension

ISAAC

We have just seen what a science base for high dimensional data might look like.

For what other areas do we need a science base?

ISAAC

Ranking is important Restaurants, movies, books, web pages Multi-billion dollar industry

Collaborative filtering When a customer buys a product, what else is he or she likely to buy?

Dimension reduction Extracting information from large data sources Social networks

ISAAC

This is an exciting time for computer science.

There is a wealth of data in digital format, information from sensors, and social networks to explore.

It is important to develop the science base to support these activities.

ISAAC

Remember that institutions, nations, and individuals who position themselves for the future will benefit immensely.

Thank You!

Recommended