62
Where do data come from? CS100 - Guest Lecture - Databases and Provenance Boris Glavic 1 DBGroup Illinois Institute of Technology October 25, 2019 1 [email protected] Slide 1 of 44 Boris Glavic - Where do data come from?:

Where do data come from?Queries (cont.) CWID Name Major GPA Phone A1333331 Peter CS 3.5 3125558888 A5552341 Alice‘ CS 4.0 3125557777 A1325324 Elisa Bio 3.2 3125555555 How many students

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Where do data come from?CS100 - Guest Lecture - Databases and Provenance

    Boris Glavic1

    DBGroupIllinois Institute of Technology

    October 25, 2019

    [email protected] 1 of 44 Boris Glavic - Where do data come from?:

  • Outline

    1 Who I am

    2 What are Databases?

    3 Data Provenance

    4 Questions

  • Who I am

    Hi, I am Boris

    Slide 3 of 44 Boris Glavic - Where do data come from?: Who I am

  • Who I am

    Hi, I am Boris

    I am a database guy!

    Slide 4 of 44 Boris Glavic - Where do data come from?: Who I am

  • Who I am

    Hi, I am Boris

    I am a database guy!

    I will tell you:1) Why DBs are important2) Why DBs are interesting

    3) My Research

    Slide 5 of 44 Boris Glavic - Where do data come from?: Who I am

  • Outline

    1 Who I am

    2 What are Databases?

    3 Data Provenance

    4 Questions

  • Where do data come from?

    Where do data come from?

    Slide 7 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • You might have heard . . .

    Slide 8 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • What Are Databases?

    Database systems and databasesdatabase systems manage databasesa database is a structured collection of data

    What do database systems do?1 Provide persistent storage of data2 Efficient declarative access to data: querying3 Protection from data loss under failures4 Safe concurrent access to data

    Slide 9 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Who Uses Databases?

    Most large software systems use databases!Business Intelligence, e.g., IBM cognosWeb-based systems

    Desktop softwareYou music playerYou email client (most likely at least)

    Every big company uses DBsbanksinsurancegovernment agencies. . .

    Slide 10 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Who Creates Databases?

    Relational databases is big businessIBM DB2OracleMicrosoft SQLServerTeradataOpen Source Systems: PostgreSQL, MySQL

    Distributed systemsCloud storage and Key-value stores

    Amazon S3, Google Big Table, CassandraBig Data Analytics

    MapReduce, Spark, Flink

    Slide 11 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Why Are Database Interesting?

    Combination of systems and theoretical researchInteresting systems problems

    Hacking complex and large systemsLow-level optimizations

    exploit modern hardware

    Interesting theoretical foundationsComplexity of answering queriesExpressiveness of query languagesStrong connections to logic

    Slide 12 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Why Are Databases Interesting (cont.)

    Connections to other CS fieldsDistributed systems

    getting more and more important

    CompilersModelingAI and machine learning

    Data mining

    Operating and File Systems

    Slide 13 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • The Relational Datamodel

    Relations aka Tablesa table consists of columns and rowstables store one type of entity

    e.g., students, bank accounts, loans, . . .each row is one entity

    e.g., one studentcolumns store a particular type of information about an entity

    e.g., name of a student

    Slide 14 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • The Relational Datamodel (cont.)

    Example TablesStudents table

    CWID Name Major GPA PhoneA1333331 Peter CS 3.5 312 555 8888A5552341 Alice‘ CS 4.0 312 555 7777A1325324 Elisa Bio 3.2 312 555 5555

    Grades table

    CWID Course GradeA1333331 CS100 AA5552341 CS425 CA1325324 CS525 AA1325324 CS566 B

    Slide 15 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Queries

    What do I do with the data in my database?you can interrogate the database system to extract information aboutyour datathis is done using a programming language called SQLSQL is a declarative language

    say what data you want not how to compute it

    Queries return table (closed language)

    Slide 16 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Queries (cont.)

    CWID Name Major GPA PhoneA1333331 Peter CS 3.5 312 555 8888A5552341 Alice‘ CS 4.0 312 555 7777A1325324 Elisa Bio 3.2 312 555 5555

    How many students are in mydatabase?

    #Students3

    Who has the highest GPA?

    NameAlice

    What are the names of CSstudents?

    NamePeterAlice

    Slide 17 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Persistence and Recovery

    What if you shutdown your computer?will you loose your precious data?

    What happens when your computer crashes?will you loose your precious data?

    No!the database system stores your data on stable storage (disk)database systems know how recover from failureswhen the database system signals to you that a change you made wasapplied then you will never loose it

    Slide 18 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Persistence and Recovery

    What if you shutdown your computer?will you loose your precious data?

    What happens when your computer crashes?will you loose your precious data?

    No!the database system stores your data on stable storage (disk)database systems know how recover from failureswhen the database system signals to you that a change you made wasapplied then you will never loose it

    Slide 18 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Concurrent Access

    Banking ExampleAccount A: $50Account B: $50Transfer $25 from A to BBank gives all accounts 10% interest

    Transfer Money

    ActionSubtract $25 from A

    Add $25 to B

    Give 10% interestAction

    Add %5 interest

    Balances

    Account A Account B$25 $50$27.5 $55$27.5 $80

    We have lost interest!

    Slide 19 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Concurrent Access (cont.)

    Concurrency Controldatabases manage concurrent operationsprevent bad things from happeningfrom user perspective:

    behaves like your program is the only one running!

    Can we loose interest?Nope!

    Slide 20 of 44 Boris Glavic - Where do data come from?: What are Databases?

  • Outline

    1 Who I am

    2 What are Databases?

    3 Data Provenance

    4 Questions

  • What is Provenance?

    Provenance in Artrecord of ownership of a piece of art

    Arnolfini PortraitThe provenance of the painting begins in 1434when it was dated by van Eyck and presumablyowned by the sitter(s). At some point before1516 it came into the possession of Don Diegode Guevara (d. Brussels 1520), a Spanish careercourtier of the Habsburgs . . .By 1516 he had given the portrait to Margaret ofAustria, . . .

    Slide 22 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is Provenance?

    Provenance in DatabasesRecords how data was produced

    which other data was used in the creation processwhich operations were involved in its creation

    For sake of this lecture, only provenance of queries

    Provenance of a query resultSelect one row from the result of a queryWhich input rows were used to compute it?Maybe also: how were these rows combined

    Slide 23 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Example

    Compute average salary of employees per department

    SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept

    name salary deptPeter 10 HRBob 20 HRAlice 5 IT

    Slide 24 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Example

    Compute average salary of employees per department

    dept avgsalHR 15IT 5

    SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept

    name salary deptPeter 10 HRBob 20 HRAlice 5 IT

    Slide 25 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Example (cont.)

    The first result row depends on the first two input rows

    dept avgsalHR 15IT 5

    SELECT dept, avg(salary) AS avgsalFROM empGROUP BY dept

    name salary deptPeter 10 HRBob 20 HRAlice 5 IT

    Slide 26 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Some Observations

    Provenance maps output rows of queries to input rowshere track this per rowcould also track attribute values (higher fidelity)could also track tables (lower fidelity)

    Slide 27 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Why should I care?

    Use casesDebugging queries and dataAuditingExplainabilityOptimizing DB operationsDetermining trust in data

    Slide 28 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Queries are Functions

    Functional View of QueryingA query takes as input a database and outputs a tableWe can think about queries as functions from databases to resulttables!

    What then is provenance?Select one of the outputs of the queryWhich inputs were used to compute it?

    Slide 29 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1

    f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2

    . . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1

    f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4

    . . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3

    . . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions

    What are functions in math?you already know functions from high school math!

    Examplesf (x) = x

    f (1) = 1f (2) = 2. . .

    f (x) = x2

    f (1) = 1f (2) = 4. . .

    f (x , y) = x + y

    f (1, 2) = 1+ 2 = 3. . .

    Slide 30 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions (cont.)

    What makes a function a function?Does it have to return numbers?Does have to take numbers as input?

    Counterexamplesf takes names as an input and returns the name converted to lowercase

    f (Peter) = peterf (Bob) = bob

    f takes text as input and returns the numbers of characters in the textf (Bob) = 3f (Alice) = 5

    Slide 31 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Recap Functions (cont.)

    Definition (Function)Input domain: A set of values IOutput domain: A set of value OMapping: Associate each value from I with one value from O

    Queries as FunctionsInput domain: databasesOutput domains: tables

    Slide 32 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examples

    if f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5x

    if f (x) = x3 then f −1(x) = 3√x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5x

    if f (x) = x3 then f −1(x) = 3√x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • What is the Provenance of a Function?

    Provenance = Inversion?We want to understand which input was used to generate an outputIn math this is called function inversionThe inverse f −1 of a function f takes an output of f and returns thecorresponding input

    When f (x) = y then f −1(y) = x

    Examplesif f (x) = 2x then f −1(x) = 0.5xif f (x) = x3 then f −1(x) = 3

    √x

    if f (x) = x2 then f −1(x) = 2√x?

    this does not work (two possible solutions)

    Slide 33 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • From Invertible Functions to Queries

    Queries are typically not invertible!Return the number of students in CS100Let’s say the result is 3 studentsInverse function would have to magically guess who these 3 studentsare!

    Queries operate on tablesWe want more fine-granular information:

    Which rows from the input affected which rows from the output!

    Slide 34 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Annotation Propagation to the Rescue

    Quadratic functionInput Output

    -2 4-1 10 01 12 4

    Cannot invert thisThe output is not enough to compute provenance!How can we deal with that?

    Slide 35 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Magic Machine Riddle

    “Magic Machine”

    Slide 36 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Magic Machine Solution

    “Magic Machine”

    Slide 37 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Generalizing the solution

    Approachannotate input data with unique identifiers (colors)outputs annotated with the color of the input they are derived from

    Quadratic functionInput Output(-2,�) (4,�)(-1,�) (1,�)(0,�) (0,�)(1,�) (1,�)(2,�) (4,�)

    Slide 38 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Caveats

    We assumed that the function happily accepts inputs that are pairs ofnumbers and colorsIf inputs and outputs are tables then we need to understand theinternals of the function to know how they are related

    Slide 39 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Provenance for Queries (how it is actually done)

    Encode annotations by extending tableseach row is extended with extra attributesthese attributes are used to store provenance

    Query instrumentationWe rewrite queries input queries into queries that

    1 create annotations for each input2 propagate these annotations to produced annotated outputs

    Slide 40 of 44 Boris Glavic - Where do data come from?: Data Provenance

  • Outline

    1 Who I am

    2 What are Databases?

    3 Data Provenance

    4 Questions

  • DBGroup - What do we do?

    Distributed and High-performance DatabasesHRDBMS - a scalable database with high per-node performanceHCDF - operating system - database co-design

    Data Integration and Cleaninghow to systematically evaluate cleaning and integration systems

    BartiBench

    Data ProvenanceGProM - a generic provenance middlewareRelevance-based Data Management - optimizing data operationsbased on what data is relevant

    Slide 42 of 44 Boris Glavic - Where do data come from?: Questions

    http://www.cs.iit.edu/~dbgroup/projects/hrdbms.htmlhttp://db.unibas.it/projects/bart/http://www.cs.iit.edu/~dbgroup/projects/ibench.htmlhttp://www.cs.iit.edu/~dbgroup/projects/gprom.htmlhttp://www.cs.iit.edu/~dbgroup/projects/relevance.html

  • DBGroup - What do we do? (cont.)

    Data ScienceWe are data science enablers!Vizier - a data-centric notebook platform with uncertainty tracking

    Slide 43 of 44 Boris Glavic - Where do data come from?: Questions

    https://vizierdb.info/

  • Questions?

    IIT DBGroupstudents: 7 Ph.D., 1 Master, 1 Undergraduateresearch group: http://www.cs.iit.edu/~dbgroup/personal page:http://www.cs.iit.edu/~dbgroup/members/bglavic.htmlgithub: https://github.com/IITDBGroup

    Slide 44 of 44 Boris Glavic - Where do data come from?: Questions

    http://www.cs.iit.edu/~dbgroup/http://www.cs.iit.edu/~dbgroup/members/bglavic.htmlhttps://github.com/IITDBGroup

    Who I amWhat are Databases?Data ProvenanceQuestionsAppendix