72
Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Future Directions in Computer Science

John HopcroftCornell UniversityIthaca, New York

Page 2: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Time of change

There is a fundamental revolution taking place that is changing all aspects of our lives.

Those individuals who recognize this and position themselves for the future will benefit enormously.

Page 3: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Drivers of change in Computer Science

Computers are becoming ubiquitous Speed sufficient for word processing,

email, chat and spreadsheets Merging of computing and

communications Data available in digital form Devices being networked

Page 4: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Computer Science departments are beginning to develop courses that cover underlying theory for

Random graphs Phase transitions Giant components Spectral analysis Small world phenomena Grown graphs

Page 5: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

What is the theory needed to support the future?

Search Networks and sensors Large amounts of noisy, high dimensional

data Integration of systems and sources of

information

Page 6: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Internet queries are changing Today Autos Graph theory

Colleges, universities

Computer science

Tomorrow Which car should I buy? Construct an annotated

bibliography on graph theory

Where should I go to college?

How did the field of CS develop?

Page 7: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

What car should I buy?

List of makesCostReliabilityFuel economyCrash safety

Pertinent articlesConsumer reportsCar and driver

Page 8: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Where should I go to college? List of factors that might influence choice

CostGeographic locationSizeType of institution

MetricsRanking of programsStudent faculty ratiosGraduates from your high school/neighborhood

Page 9: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

How did field develop?

From ISI database create set of important papers in field? Extract important authors

author of key paper author of several important papers thesis advisor of Ph.D.’s student(s) who is (are) important author(s)

Extract key institutions institution of important author(s) Ph.D. granting institution of important authors

Output Directed graph of key papers Flow of Ph.D.’s geographically with time Flow of ideas

Page 10: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Temporal Cluster Histograms: NIPS Results

12: chip, circuit, analog, voltage, vlsi11: kernel, margin, svm, vc, xi10: bayesian, mixture, posterior, likelihood,

em9: spike, spikes, firing, neuron, neurons8: neurons, neuron, synaptic, memory,

firing7: david, michael, john, richard, chair6: policy, reinforcement, action, state,

agent5: visual, eye, cells, motion, orientation4: units, node, training, nodes, tree3: code, codes, decoding, message, hints2: image, images, object, face, video1: recurrent, hidden, training, units, error0: speech, word, hmm, recognition, mlp

NIPS k-means clusters (k=13)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10 11 12 13 14Year

Num

ber o

f Pap

ers

Shaparenko, Caruana, Gehrke, and Thorsten

Page 11: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 12: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 13: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 14: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 15: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 16: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 17: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Theory to supportnew directions

Large graphs Spectral analysis High dimensions and dimension reduction Clustering Collaborative filtering Extracting signal from noise

Page 18: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Graph Theory of the 50’s

Theorem: A graph is planar if it does not contain a Kuratowski subgraph as a contraction.

Page 19: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Theory of Large Graphs

Large graphs Billion vertices Exact edges present not critical

Theoretical basis for study of large graphs

Maybe theory of graph generation Invariant to small changes in definition Must be able to prove basic theorems

Page 20: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Erdös-Renyi n vertices each of n2 potential edges is present with

independent probability

Nn

pn (1-p)N-n

vertex degreebinomial degree distribution

numberof

vertices

Page 21: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 22: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Generative models for graphs

Vertices and edges added at each unit of time

Rule to determine where to place edgesUniform probabilityPreferential attachment - gives rise to

power law degree distributions

Page 23: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Vertex degree

Number

of

vertices

Preferential attachment gives rise to the power law degree distribution common in many graphs

Page 24: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Protein interactions

2730 proteins in data base

3602 interactions between proteins

SIZE OF COMPONENT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … 1851

NUMBER OF COMPONENTS

48 179 50 25 14 6 4 6 1 1 1 0 0 0 0 1 1

Science 1999 July 30; 285:751-753

Page 25: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Giant Component

1.Create n isolated vertices

2.Add Edges randomly one by one

3.Compute number of connected components

Page 26: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Giant Component1

1000

1 2

998 1

1 2 3 4 5 6 7 8 9 10 11

548 89 28 14 9 5 3 1 1 1 1

Page 27: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Giant Component1 2 3 4 5 6 7 8 9 10 11

548 89 28 14 9 5 3 1 1 1 1

1 2 3 4 5 6 7 8

367 70 24 12 9 3 2 2

9 10 12 13 14 20 55 101

2 2 1 2 2 1 1 1

1 2 3 4 5 6 7 8 9 11 514

252 39 13 6 3 6 2 1 1 1 1

Page 28: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Tubes

Tendrils

TendrilsIn44 million

Out44 million

SCC56 million nodes

Disconnected components

Source: almaden.ibm.com

Page 29: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Phase transitions G(n,p)

Emergence of cycleGiant componentConnected graph

N(p)Emergence of arithmetic sequence

CNF satisfiabilityFix number of variables, increase number of clauses

Page 30: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Access to InformationSMART Technology

document

aardvark 0abacus 0 . . .antitrust 42 . . .CEO 17 . . .microsoft 61 . . .windows 14wine 0wing 0winner 3winter 0 . . .zoo 0zoology 0Zurich 0

Page 31: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Locating relevant documents

Query: Where can I get information on gates?

2,060,000 hits

Bill Gates 593,000

Gates county 177,000

baby gates 170,000

gates of heaven 169,000

automatic gates 83,000

fences and gates 43,000

Boolean gates 19,000

Page 32: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Clustering documents

cluster documentsrefine cluster

microsoftwindowsantitrust

Booleangates

GatesCounty

automaticgates

gates

Bill Gates

Page 33: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Refinement of another type: books

mathem

atics

physics

chem

istry

astronomy

children’s books

textbooks

reference books

general population

Page 34: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

High Dimensions

Intuition from two and three dimensions not valid for high dimension

Volume of cube is one in all dimensions

Volume of sphere goes to zero

Page 35: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

1

Unit sphere

Unit square

2 Dimensions

2 2 11 10.707

2 2 2

Page 36: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

2 2 2 21 1 1 1

12 2 2 2

4 Dimensions

Page 37: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

21

22

dd

d Dimensions

1

Page 38: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Almost all area of the unit cube is outside the unit sphere

Page 39: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

High dimension is fundamentally different from 2

or 3 dimensional space

Page 40: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

High dimensional data is inherently unstable

Given n random points in d dimensional space essentially all n2 distances are equal.

22

1

d

i ii

x yx y

Page 41: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Gaussian distribution

Probability mass concentrated between dotted lines

Page 42: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Gaussian in high dimensions

3

√d

Page 43: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Two Gaussians

3√d

Page 44: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 45: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Distance between two random points from same Gaussian

Points on thin annulus of radius

Approximate by sphere of radius

Average distance between two points is

(Place one pt at N. Pole other at random. Almost surely second point near the equator.)

d

d

2d

Page 46: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York
Page 47: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

2d

d

d

Page 48: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Expected distance between pts from two Gaussians separated by δ

Rotate axis so first point is perpendicular to screen

All other points close to plane of screen

Page 49: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Expected distance between pts from two Gaussians separated by δ

Page 50: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Expected distance between pts from two Gaussians separated by δ

2 d

d

Page 51: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Expected distance between pts from two Gaussians separated by δ

2 d

d

2 2d

d

Page 52: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Can separate points from two Gaussians if

2

14

2

12 2

2

2 2

2 1 2

12 2

2 2

d

d d

d d

d

d

Page 53: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Dimension reduction

Project points onto subspace containing centers of Gaussians

Reduce dimension from d to k, the number of Gaussians

Page 54: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Centers retain separation Average distance between points reduced

by dk

1 2 1 2, , , , , , ,0, ,0d k

i i

x x x x x x

d x k x

Page 55: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Can separate Gaussians provided

2 2 2k k

> some constant involving k and γ independent of the dimension

Page 56: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Ranking is importantRestaurantsMoviesWeb pages

Multi billion dollar industry

Page 57: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Page rank equals stationary probability of random walk

Page 58: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

restart

15%

Page 59: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

restart

15%

Restart yields strongly connected graph

Page 60: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Suppose you wish to increase the page rank of vertex v

Capture restart

web farm

Capture random walksmall cycles

Page 61: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Capture restart

Buy 20,000 url’s and capture restart

Can be countered by small restart value Small restart increases web rank of page that

captures random walk by small cycles.

Page 62: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

1 0.851

0.66

0.15 restart

0.23 restart

0.661

1.56

Capture random walk

0.1restart

0.66

Page 63: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

1X=1.56

Y=0.66

0.23restart

0.1restart

X=1+0.85*y

Y=0.85*x/2

0.66

0.56 0.66

Page 64: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

If one loop increases Pagerank from 1 to 1.56 why not add many self loops?

Maximum increase in Pagerank is 6.67

Page 65: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Discovery time – time to first reach a vertex by random walk from uniform start

Cannot lower discovery time of any page in S below minimum already in S

S

Page 66: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Why not replace Pagerank by discovery time?

No efficient algorithm for discovery time

DiscoveryTime(v)

remove edges out of v

calculate Pagerank(v) in modified graph

Page 67: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Is there a way for a spammer to raise Pagerank in a way that is not statistically detectable

Page 68: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Information is important

When a customer makes a purchase what else is he likely to buy?

Camera Memory card Batteries Carrying case Etc.

Knowing what a customer is likely to buy is important information.

Page 69: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

How can we extract information from a customer’s visit to a web site?

What web pages were visited?What order?How long?

Page 70: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Detecting trends before they become obvious

Is some category of customer changing their buying habits?Purchases, travel destination, vacations

Is there some new trend in the stock market?

How do we detect changes in a large database over time?

Page 71: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Identifying Changing Patterns in a Large Data Set

How soon can one detect a change in patterns in a large volume of information?

How large must a change be in order to distinguish it from random fluctuations?

Page 72: Future Directions in Computer Science John Hopcroft Cornell University Ithaca, New York

Conclusions

We are in an exciting time of change. Information technology is a big driver of

that change. The computer science theory of the last

thirty years needs to be extended to cover the next thirty years.