30
CS345: Advanced Databases Chris Ré

CS345: Advanced Databases

  • Upload
    keith

  • View
    70

  • Download
    3

Embed Size (px)

DESCRIPTION

CS345: Advanced Databases. Chris Ré. What this course is. Database fundamentals: Theory Old Crusty, Good SQL stuff No/New/Not-Yet SQL New stuff: Knowledge bases & Inference Databases is a strange and beautiful area: Theory, Algorithms, Systems, & Applications - PowerPoint PPT Presentation

Citation preview

Page 1: CS345: Advanced Databases

CS345: Advanced Databases

Chris Ré

Page 2: CS345: Advanced Databases

What this course isDatabase fundamentals:– Theory– Old Crusty, Good SQL stuff– No/New/Not-Yet SQL

New stuff: Knowledge bases & Inference

Databases is a strange and beautiful area: Theory, Algorithms, Systems, & Applications

It’s a bit scattered, and I love it.

Page 3: CS345: Advanced Databases

A Brief, BiasedDatabase History

Page 4: CS345: Advanced Databases

Three Turing Award Winners

Charles Bachmann

Edgar Codd

JimGray

Seminal contributions made in Industry

Page 5: CS345: Advanced Databases

The Birth of the Relational Model(1971)

database: a handful of relations (tables) with fixed schema.

WorksIn(Employee,Dept)

Query with small # of operations:Selection (filter),

Projection, Join, Union.

Basically, an operational finite model theory.

Page 6: CS345: Advanced Databases

Data and Query ModelR(A,B) = { (a1,b2),…,(an,bn) }S(B,C,D) = { (b’1,c1,d1),…,(b’m,cm,dm) }

PA(R) ={ a : exists b. (a,b) in R } Projection

SelectionsF(R) ={ (a,b) : F( (a,b) ) for t in R }

F : D(R) -> {True, False}

Join(R,S) = { (a,b,c,d) : (a,b) in R & (b,c,d) in S} Join

Data

Page 7: CS345: Advanced Databases

Key idea of the Relational Model

Declarative User says what they want---

not how to get it.

Page 8: CS345: Advanced Databases

Key question: Can one implement the Relational

Model efficiently?

Page 9: CS345: Advanced Databases

System R

In,1974 System R shows possible to get good performance.

1st Implementation of SQL.

IBM didn’t Push it,worried about IMS cannibalization, but…

Pat Selinger

Page 10: CS345: Advanced Databases

Others Come on to the Scene…

Larry Ellison hears about IBM’s Research prototype and founds a company….

Page 11: CS345: Advanced Databases

Fast Forward to TodayRelational model is dominate model of

data.

Page 12: CS345: Advanced Databases

Takeaways about Database Research

Started with mathematical elegance and with close ties to industry.

Improve runtime performance as a proxy to increase programmer

productivity.

Page 13: CS345: Advanced Databases

The Big Ideas

Page 14: CS345: Advanced Databases

Independence

Declarative languages can improve productivity– Different team members work

independently• Backend, Storage, UI, BI, Etc.

– Transactional model.– Challenge: Support efficient concurrent

access?

Page 15: CS345: Advanced Databases

Performance

Parallel programming is hard; SQL is most popular parallel programming language.– How do you deal with asymmetry of

memory hierarchy (Disk/MM/Cache)? – How do you structure parallel

optimization?– Concurrency?

Page 16: CS345: Advanced Databases

Manageability

Systems live over time, and the system should automate many routine tasks.–Maintain derived data products (views)– Self-monitoring systems (autonomic)

Page 17: CS345: Advanced Databases

Course Topics

Page 18: CS345: Advanced Databases

A user says what they want—not how to get it.

Page 19: CS345: Advanced Databases

Topic 1: QP FundamentalsQuery Processing Fundamentals

1. Empirical Join evaluation from 70s!2. System R: The Archetype (Cardinalityw)3. Formal Query Languages4. Acyclic Query Evaluation (Structure)5. Worst-case Optimal Join Algorithms (S

+ C)This will be the most

formal part of the course.

Page 20: CS345: Advanced Databases

Analyzing your data before it was big (when it was just very large…)

Page 21: CS345: Advanced Databases

Topic 2: OLAP-Style Analytics

Building new and old data systems:1. Theory of Materialized View2. Gamma (Parallel DBs) 3. MapReduce & the Rise of NoSQL

(2000s)4. NewSQL & Optimizing Joins on MR

(theory)5. Fagin’s Algorithm (theory)6. Statistical Analytic Systems

Page 22: CS345: Advanced Databases

My biased view of the future…

Page 23: CS345: Advanced Databases

Topic 3: Next-Generation Systems

1. Information Extraction2. Probabilistic Query Evaluation

(Theory)3. Scalable Inference4. Knowledge Bases

Page 24: CS345: Advanced Databases

Transactions.

Page 25: CS345: Advanced Databases

Topic 4: OLTP StyleTransactional Systems1. The rise of Key-Value Stores2. The case for determinism3. CALM & CAPs 4. The Return of Main Memory DBs.5. Spanner, F1, and Data Centers

Page 26: CS345: Advanced Databases

Course Logistics

Page 27: CS345: Advanced Databases

Grading• Course Project (More next)– Do something interesting with data.– Teams OK– Form teams soon and email me by Jan

12.

• Midterm Exam

Page 28: CS345: Advanced Databases

Projects in each topic1. Knowledgebase Construction– Pick a domain and build a KBC system for it with

DeepDive

2. Join Algorithms– Certificate versions (see me)– MapReduce? GraphLab? Spark?

3. Analytics Systems

4. Transactional Systems.

You are free to choose other

projects

Page 29: CS345: Advanced Databases

Datasets• Snapshot of the web marked up with NLP tools

and structured data (KBP and KBA challenges)

• 500k+ docs used by PaleoBiologists and structured data.

• We can mark up even more stuff.

• Benchmark ML, graphs if you want to work on analytics or join evaluation.

Page 30: CS345: Advanced Databases

Wednesday

• Wednesday we begin the ancient art of join evaluation. All who pass this way must pass through this ancient topic!

• Read: Shapiro.– not too carefully, we’ll go through

details