Topics in Algorithms and Data Science Introductionce.sharif.edu/.../root/slides/Introduction.pdf ·...

Preview:

Citation preview

Topics in Algorithms and Data Science

Introduction

Omid Etesami

Early Computer Science (according to John Hopcroft)

• CS in 1960’s: emphasis on programming languages, compilers, operating systems

• CS theory in 1960’s: finite automata, regular expressions, context free languages, computability

• CS in 1970’s: making computers more useful

for well-defined tasks

• CS theory in 1970’s: important addition of

algorithms

Modern CS

• More focus on applications

• Merging of computing and communication

• More collected data in natural sciences, commerce, …

• Web, social networks

• Requires understanding data

Modern CS theory

Not only discrete mathematics

but also

probability, statistics, numerical methods

Textbook for the course

• Foundations of Data Science (draft of a new book as of May 2015)

by Avrim Blum, John Hopcroft, Ravi Kannan

• We will cover first four chapters.

• Available online: http://www.cs.cornell.edu/jeh/bookMay2015.pdf

Outline of the course

• Random graphs

• High-dimensional geometry

• Singular value decomposition

Random Graphs

• Models for web and social networks

• Simplest model: Erdos-Renyi random graph model

• Understanding global phenomenon such as giant connected component in terms of local choice

• Other models of random graphs: non-uniform

models, growth models with or without preferential

attachment, small-world graphs

Random graphs (continued)

• Random constraint satisfaction problems (like 3-SAT)

• Non-uniform random graphs and their relation to modern coding theory (like fountain codes)

3-SAT solution space (height represents # of unsatisfied constraints)!

High-dimensional geometry

• Represent data with vectors of many components

(e.g. in Search or Machine Learning)

• Intuition for two or three dimensions different from high dimensions!

Sphere in 3 dimensions Stereographic projection of sphere in 4 dimensions!

Singular value decomposition (SVD)

• To deal with high-dimensional data, we need matrix algebra and matrix algorithms

• Singular value decomposition is an important tool

• Applications of SVD:

Principal Component Analysis

Clustering statistical mixtures of Gaussian probability densities

Discrete optimization like Max-CUT

Grading

• Around 7 points for homework and quizzes.

• Around 5 points for midterm

• Around 8 points for final

• Additional points for presentation and project

Homework

• Late homework is NOT accepted. Prepare early.

• You can work on homework together, but you should acknowledge your collaborators and your write-up should be your own. (If you do not acknowledge, you can receive negative points.)

• If you use internet, you should acknowledge your source.

Prerequisites

• Probability including problem solving skills

and basic inequalities

• Linear algebra including

eigenvalues and eigenvectors

• Asymptotic analysis of algorithms

• Basic discrete math, basic calculus

• Most importantly, mathematical maturity like being able to rigorously prove things.

A few teasers (reflecting the background you need for the course)

Sex bias in graduate admissions

• 8442 men applied (44% admitted)

• 4321 women applied (35% admitted)

• In each department

% admitted women/women who applied

>=

% admitted men/men who applied

Can this happen?

Generating a random permutation

for i = 1 to n

j = random between 1 and n

swap(x[i], x[j])

How can you prove the above algorithm does not generate a uniformly random permutation (for all n >= 3)?

Matrix rank

Why is the number of linearly independent rows exactly equal to the number of linearly independent columns?

Volume of the sphere

Can you work out the volume of the sphere in 3 dimensions?