25
Lecture 1: Overview of Lecture 1: Overview of CSCI 585 CSCI 585 Prof. Shahram Ghandeharizadeh Prof. Shahram Ghandeharizadeh Director of USC Database Lab Director of USC Database Lab (http://dblab.usc.edu) (http://dblab.usc.edu) Computer Science Department Computer Science Department University of Southern California University of Southern California

Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Lecture 1: Overview of CSCI 585Lecture 1: Overview of CSCI 585

Prof. Shahram GhandeharizadehProf. Shahram GhandeharizadehDirector of USC Database Lab (http://dblab.usc.edu)Director of USC Database Lab (http://dblab.usc.edu)Computer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California

Page 2: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

LogisticsLogistics

Collection of technical papers:Collection of technical papers: ACM/IEEE/Springer digital libraries.ACM/IEEE/Springer digital libraries. URLs work from USC machines.URLs work from USC machines.

Pre-req for the course:Pre-req for the course: CSCI 485: Introduction to File and Database CSCI 485: Introduction to File and Database

Management, andManagement, and Knowledge Knowledge C++ programming language.C++ programming language.

Extensive use of Blackboard for homework Extensive use of Blackboard for homework and project submissions. Make sure to have and project submissions. Make sure to have access to:access to: http://den.usc.eduhttp://den.usc.edu

Power-point of presentations also available Power-point of presentations also available from http://dblab.usc.edufrom http://dblab.usc.edu

Page 3: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Pre-Req

585 assumes you know the following: Transactions and their ACID properties. Concurrency control protocols such as locking

and time-stamp based protocols. Crash recovery techniques such as logging and

shadow paging. Physical characteristics of magnetic disks. SQL Relational algebra operators ER data modeling Alternative normal forms.

Visit http://dblab.usc.edu/csci485 for an overview of this material.

Page 4: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Instructor DetailsInstructor Details

Dr. Shahram GhandeharizadehDr. Shahram Ghandeharizadeh

Office: SAL 208Office: SAL 208

E-mail: [email protected]: [email protected]

Phone: 213-740-4781Phone: 213-740-4781

Office Hours:Office Hours:Tuesday: 12:30 to 2 pmTuesday: 12:30 to 2 pm

Thursday: 4:30 to 5:30 pmThursday: 4:30 to 5:30 pm

Class URL: http://dblab.usc.edu/csci585Class URL: http://dblab.usc.edu/csci585

Page 5: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

TATA

Shahin ShayandehShahin Shayandeh

Office: SAL 200COffice: SAL 200C

E-mail: [email protected]: [email protected]

Office Hours:Office Hours:Mondays: 3:30 to 5 pmMondays: 3:30 to 5 pm

Thursday: 12:30 to 2 pmThursday: 12:30 to 2 pm

Page 6: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

OutlineOutline

Motivation for DBMSMotivation for DBMS An outline for the course materialAn outline for the course material Grading: Assignments and projectsGrading: Assignments and projects

Page 7: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Database Management Database Management Systems (DBMS)Systems (DBMS) Used almost on a daily basis for either Used almost on a daily basis for either

individual or business use.individual or business use.

Relational database vendors were one Relational database vendors were one of the fastest growing sectors during of the fastest growing sectors during the .COM boom!the .COM boom!

Page 8: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

DATABASE & DBMSDATABASE & DBMS

DatabaseDatabase: An integrated collection of : An integrated collection of data, usually stored on secondary data, usually stored on secondary storage, typically describing the storage, typically describing the activities of one or more related activities of one or more related organizations.organizations.

Database management systemDatabase management system ((DBMSDBMS)): : A collection of software/programs A collection of software/programs designed to assist in maintaining and designed to assist in maintaining and utilizing large collections of data.utilizing large collections of data.

Page 9: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

BEFORE DBMSBEFORE DBMS

Data

Data

User 1

User 2

Application programs

Application programs

Page 10: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Data managed by

DBMS

AFTER DBMSAFTER DBMS

User 1

User 2

DBMS

Application programs

Application programs

Page 11: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

WHY A DBMS?WHY A DBMS?1.1. Reduced application development timeReduced application development time

2.2. Data independence: Application programs not dependent on Data independence: Application programs not dependent on data representation and storage detailsdata representation and storage details

3.3. Data sharing: data is better utilized (discovered and reused), Data sharing: data is better utilized (discovered and reused), redundancy of data is minimizedredundancy of data is minimized

4.4. Data integrity and consistency: one may enforce consistency Data integrity and consistency: one may enforce consistency constraints on data, e.g., number of seats sold ≤ number of constraints on data, e.g., number of seats sold ≤ number of seats on the plane × 1.1seats on the plane × 1.1

5.5. Centralized control: DBA tunes the database to balance user's Centralized control: DBA tunes the database to balance user's needsneeds

6.6. Security: mechanisms to prevent unauthorized access. These Security: mechanisms to prevent unauthorized access. These mechanisms are based on content instead of file-oriented mechanisms are based on content instead of file-oriented approach.approach.

7.7. Concurrency control: avoids undesirable race conditions that Concurrency control: avoids undesirable race conditions that arise with simultaneous access/updates to dataarise with simultaneous access/updates to data

8.8. Crash recovery: ensures the integrity of data in the presence of Crash recovery: ensures the integrity of data in the presence of failuresfailures

Page 12: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

DBMS ARCHITECTUREDBMS ARCHITECTURE

User 1

User n

Conceptual schema

Conceptual schema

Physical data

DBDBMS

Page 13: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Data managed by

DBMS

An Emerging PhenomenaAn Emerging Phenomena

User 1

User 2

DBMS

Application programs

Application programs

Page 14: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Example

F. Chang et. al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006. Last paragraph of the paper:

“Finally, we have found that there are significant advantages to building our own storage solution at Google. We have gotten substantial amount of flexibility from designing our own data model for Bigtable. In addition, our control over Bigtable’s implementation, and the other Google infrastructure upon which Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise.”

Page 15: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

WHAT HAS CHANGED?WHAT HAS CHANGED?1.1. Relational database technology is now more than a quarter of Relational database technology is now more than a quarter of

century old.century old.2.2. While concepts such as concurrency control are extremely While concepts such as concurrency control are extremely

valuable, the performance loss attributed to their use is not valuable, the performance loss attributed to their use is not justified for some non-banking applications.justified for some non-banking applications.

E.g., A social networking site is not a banking application. E.g., A social networking site is not a banking application.

3.3. RDBMS vendors increased functionality for their own niche, RDBMS vendors increased functionality for their own niche, increasing complexity. Each application used a decreasing increasing complexity. Each application used a decreasing fraction of the provided features.fraction of the provided features.

A deployment requires a specialist, trained in database administration, for A deployment requires a specialist, trained in database administration, for maintainence.maintainence.

4.4. Availability of data is paramount.Availability of data is paramount.Cost of downtime is estimated at thousands of dollars per minute.Cost of downtime is estimated at thousands of dollars per minute.

5.5. SQL is too general and cumbersome to use with some SQL is too general and cumbersome to use with some applications.applications.

6.6. Storage has become larger and more economical.Storage has become larger and more economical.10 cents per Gigabyte of magnetic disk storage.10 cents per Gigabyte of magnetic disk storage.Flash as a new layer in the storage hierarchy: DRAM, Flash, Disk.Flash as a new layer in the storage hierarchy: DRAM, Flash, Disk.7 to 8 dollars per Gigabyte of DRAM.7 to 8 dollars per Gigabyte of DRAM.A bank’s data (TPC benchmark) becomes main memory resident!A bank’s data (TPC benchmark) becomes main memory resident!

Page 16: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

Cross-roads

Since 1998, database researchers have been aware of the limitations: More modular architecture based on

simple, component-based building blocks.

One architecture will not satisfy all applications.

Page 17: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus

Storage and Storage Management: M. Seltzer. Beyond Relational Databases.

Communications of the ACM, July 2008, Vol. 51, No. 7.

D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). ACM SIGMOD, 1988.

G. Graefe. The five-minute rule twenty-years later, and how flash memory changes the rules. Proceedings of the Third International Workshop on Data Management on New Hardware (DaMoN), 2007.

Flash as a new storage medium. 2-3 weeks. Start homework 1 using Berkeley DB.

Page 18: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Parallel DBMS: D. DeWitt et al. The Gamma Database Machine

Project. IEEE Transactions on Knowledge and Data Engineering, Vol. 2, 1990.

F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI 2006.

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Communications of the ACM, Vol. 51, No. 1, 2008.

Data intensive applications can be parallelized effectively.

2 Weeks.

Page 19: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Spatial Index Structures: A. Guttman. R-Trees: A Dynamic Index Structure for

Spatial Searching. In ACM SIGMOD 1984. P. E. O’Neil, and D. Quass. Improved Query Performance

with Variant Indexes. In ACM SIGMOD 1997.

No substitute for smart data indexing techniques! Brute-force approaches are not acceptable.

2 Weeks. Initiate your project to build a relational

query processing software using Berkeley DB.

Page 20: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Query optimizations: P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,

R. A. Lorie, T. G. Price. Access Path Selection in Relational Database Management System. In ACM SIGMOD 1979.

S. Chaudhuri. An Overview of Query Optimization in Relational Systems. PODS 1998.

Techniques to select index structures. Focus is on your project.

2 Weeks.

Page 21: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Decision Support: R. Agrawal and R. Srikant. Fast Algorithms for

Mining Association Rules in Large Databases. In VLDB 1994.

J. Gray et al. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab and SubTotals. Data Mining and Knowledge Discovery 1(1), 1997.

C. Stolte, D. Tang, and P. Hanrahan. Polaris: A System for Query, Analysis, and Visualization of Multidimensional Databases. Communications of the ACM, Vol. 51, No. 11, November 2008.

Discovery of trends in large data sets and their visualization.

2-3 Weeks.

Page 22: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Main Memory Databases: P. A. Boncz, M. L. Kristen, and S. Manegold.

Breaking the Memory Wall in MonetDB. Communications of the ACM, December 2008, Vol. 51, No. 12.

Use L2 cache of a CPU!

2 Weeks

Page 23: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

585 Syllabus (Cont…)

Cache Management: S. Ghandeharizadeh and S. Shyandeh. Greedy

Cache Management Techniques for Mobile Devices. In First International IEEE Workshop on Ambient Intelligence, Media and Sensing. April 2007.

Effective support for variable sized objects.

Time permitting, 1 to 2 weeks.

Page 24: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

GradingGrading

Midterm 1: 35%Midterm 1: 35% Midterm 2: 35%Midterm 2: 35% Assignments: 10%Assignments: 10% Project: 20%Project: 20%

Page 25: Lecture 1: Overview of CSCI 585 Prof. Shahram Ghandeharizadeh Director of USC Database Lab () Computer Science Department University

For next lecture

Read: M. Seltzer. Beyond Relational Databases.

Communications of the ACM, July 2008, Vol. 51, No. 7.