Info systems Spring 2013

Preview:

DESCRIPTION

Info systems Spring 2013. Focus of class. Cutting edge topics in information systems Conceptual material, as opposed to coding If you don’t have a good knowledge of relational databases and SQL, now is a good time to study it… A good MySQL book: murach mysql. Class blog & reaching me. - PowerPoint PPT Presentation

Citation preview

INFO SYSTEMS SPRING 2013

Focus of class• Cutting edge topics in information systems• Conceptual material, as opposed to coding• If you don’t have a good knowledge of relational

databases and SQL, now is a good time to study it…• A good MySQL book: murach mysql

Class blog & reaching me• http://rogerking.me

• All assignments and announcements will be on the blog.• I will not send out class email.• Only read blog entries starting on December 30.

• To contact me• buzzking@ymail.com• Do not use my CU mail, please. It is inundated with Spam.• 303 437 7419.• To talk to me, see me after class, or send email for an appointment.• rogerbuzzking on Skype, but I do not stay logged into Skype; you

must call or email first.

Assignments and exams• Two exams – material will be from

• 2 required books• Slides posted on blog, along with corresponding lectures• Handouts posted on blog

• Exam and projects• Midterm – 35 % of final grade

• On March 11• Second exam – 35 % of final grade

• This will be on the last schedule day of class, NOT at the final slot• i.e., it will be on May 1

• Small projects/assignments – 30% of final grade total

The First Assignment• Build a simple application using Mongo• Groups of 1 or 2• Due date to be posted on blog• The application will hold very large video and audio objects• More later…

Suggestions…• Please come to class.• Feel free to suggest references and/or topics.• Speak up at any time – it would be nice to have an

interactive class.

Likely and possible topics• Post relational database systems• Full text processing• “Big” data• Databases and XML• Media and databases (images, video, audio, animation)• Personal information systems (not just databases)• Integrating diverse data sets• ???

Review of relational databases• Schema based technology

• Small schema• Lots of data

• SQL, set-based data manipulation• Multi-language application development• Strict 2 phase ACID transactions• Rigidly structured, identically structured, simply structured

data• User defined types extensions for “objects in the small”

Relational basics• Tables (relations)• Columns (attributes)• Domains• Rows (tuples)• Primary and candidate keys• Foreign keys• Null values• Triggers• SQL as DDL and DML

Relational applications• A lot of data, but in amounts no longer considered huge• Transactions manipulating sets of rows• Complete accuracy is required• Not real time or interactive• If distributed, only a modest number of centrally controlled

servers• If distributed, there is a single, unified schema• Complex objects built by connecting tables• Logic of data is mostly in schema and SQL, not app• Mix of reads and writes, without writes being minimal• SQL programs (run as transactions) compiled once, and

optimized, run many times

Important facts about SQL DBs• Elegant, set theoretic underpinnings• Originally defined with an algebra:

• Projection, join, union/difference/intersection, selection• Closed under this algebra

• SQL is a tuple calculus based on the algebra• Declarative• Far more compact than algebra

• Equally elegant algebra for optimization• Query results are legal view tables• Serious impedance mismatch between o-o languages and

SQL

Syntax of SQLSELECT select_listFROM table_sourceWHERE search_condition

ExampleSELECT invoice_total, payment_total, credit_total, invoice_total - payment_total - credit_total AS balance_dueFROM invoicesWHERE NOT (invoice_total >= 5000 OR NOT invoice_date <= '2011-08-01')

ExampleSELECT 'Paid' AS source, invoice_number, invoice_date, invoice_totalFROM paid_invoicesWHERE invoice_date >= '2011-06-01'ORDER BY invoice_total DESC

Manipulating multiple tables• You can use the Where clause, T.color = R.color• You can use an explicit join

• Equijoin, outerjoin

SELECT vendor_name, invoice_number, invoice_totalFROM vendors LEFT JOIN invoices ON vendors.vendor_id = invoices.vendor_idORDER BY vendor_name

AggregatesAVG([ALL|DISTINCT] expression)SUM([ALL|DISTINCT] expression)MIN([ALL|DISTINCT] expression)MAX([ALL|DISTINCT] expression)COUNT([ALL|DISTINCT] expression)COUNT(*) – counts nulls

ExampleSELECT COUNT(*) AS number_of_invoices, SUM(invoice_total – payment_total – credit_total) AS total_dueFROM invoicesWHERE invoice_total – payment_total – credit_total > 0

Strict 2 phase transactions

• Each SQL program is within a begin and end transaction pair• Each transaction has its own workspace for DB items it is going

to update• Any transactions that overlap in execution time will appear to

have run in some serial order• This is done by transactions requesting read and write locks

(also known as shared and exclusive locks)• Read locks can be shared with other readers• Write locks cannot be shared with readers or writers• All locks held until the end of the transaction• They are released and then the changes that a transaction has

made are moved to the DB

Serializability & transactions• The net effect is that the transactions that overlap in

execution time appear to have run in some serial order• Transactions can be undone by throwing away the local

store (conceptually, at least)• The write period at the end of the transaction must be

atomic• The two phases:

• Request read, write, upgrade locks (and wait on locks) and process

• Release locks and move updates to the DB• There is a notion of “serializability” which means that the

actual schedule of executed steps corresponds to some serial order of running the transactions

Major concept: DB “state”• Changes at the end of each transaction (potentially)• Relies on 2 Phase transactions• We tradeoff throughput in favor of having safe state

changes

Normalization• Start with an entity model• Map to tables• Create PKs and FKs• Create other constraints• Normalize tables• Goals

• Minimize redundant data• Minimize “update anomalies”• Minimize misinterpreted db by documenting more constraints

• Problems• Skinny tables and more joins• Increases impedance mismatch

Dependencies• FD

• We say that ai FD-> aj• Or “ai functionally determines aj”

• MVD->• We say that ai MVD-> aj• Or “ai multivalued determines aj”

• Note: the right side of an FD or an MVD can be a set of attributes

Third normal form• First (1NF) The value stored at the intersection of each

row and column must be a scalar value, and a table must not contain any repeating columns.

• Second (2NF) Every non-key column must depend on the entire primary key.

• Third (3NF) Every non-key column must depend only on the primary key.

Fixed 3NF and 4NF• Boyce-Codd (BCNF) A non-key column can’t be

dependent on another non-key column. • Fourth (4NF) A table must not have more than one

multivalued dependency, where the primary key has a one-to-many relationship to non-key columns.

ExampleCustomer ID Address ZIP18 112 First 80304 17 123 Ash 80303 16 123 Ash 80303

Break into two tables:

Customer ID AddressAddress Zip

ExampleMothers_Phone Fathers_Phone Child_Name

Break into: Mothers_Phone Child_Name 3030000000 Sue 3031111111 SueAnd Fathers_Phone Child_Name 3032222222 Sue

3033333333 Sue

Note: both fields needed for PK

Stored programs• Stored procedures (can be called by an application)• Stored functions (can be called by an SQL program)• Triggers (tied to an operation like INSERT)• Events (tied to a clock)