class 2 data systems 101 - Data Systems...

Preview:

Citation preview

data systems 101prof. Stratos Idreos

HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS265/

class 2

CS265, Spring 2017 Stratos Idreos /552

big data V’s (it is not about size only)

volume velocity variety veracity

actually none of that is really new…

new: our ability to gather and store machine generated data broad understanding that we cannot just manually get value out of data

CS265, Spring 2017 Stratos Idreos /553

a data system stores data and provides access to data

& makes knowledge generation easy

data system analysisdata

knowledgeX

CS265, Spring 2017 Stratos Idreos /554

database kernel

data data data

algo

rithm

s/op

erat

ors

applications

sql

disk

memory

cpu

CS265, Spring 2017 Stratos Idreos /555

declarative interface ask ‘’what’’ you want

data system

the system decides “how” to best store and access data

CS265, Spring 2017 Stratos Idreos /556

SQL noSQL

as apps become more complex

as apps need to be more scalable

complex legacy tuning

expensive …

simple clean

just enough …

newSQL

hadoop spark flink

goal: understand why, how and what is next

Gartner: DBMS Market = ~36Billion

noSQL/Hadoop =~700Million SQL = ~rest

http://www.infoworld.com/article/3056637/database/nosql-chips-away-at-oracle-ibm-and-microsoft-dominance.html

CS265, Spring 2017 Stratos Idreos /557

more data

more applications

more h/w

the need for new systems

CS265, Spring 2017 Stratos Idreos /558

soon everyone will need to be a “data scientist”hmm,

my data is too big :(

1

new applications/requirements

how far away are we from a future where a data system sits in

the critical path of everything we do?

CS265, Spring 2017 Stratos Idreos /559

not always sure what we are looking for (until we find it)

data exploration2

CS265, Spring 2017 Stratos Idreos /5510

years

daily

dat

a[IBMbigdata]

years

daily

dat

a

[StratosGuess]

data

* ski

lls

data system design, set-up, tune, use

3

CS265, Spring 2017 Stratos Idreos /5511

system where db runs

sql

caches

cpu - cpu - cpu - cpu

memory

disk - disk - disk - disk

cpu registers

smal

ler/f

aste

r

+ flash

+ non volatile memory

memory hierarchy

CS265, Spring 2017 Stratos Idreos /5512

registers

on chip cache2x

on board cache10x

memory100x

disk100Kx

Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations award

Pluto2 years

New York1.5 hours

this building10 min

this room1 min

my head~0

CS265, Spring 2017 Stratos Idreos /5513

registers

on chip cache

on board cache

memory

disk

CPU

memory wall

chea

per

fast

er

SRAM

DRAM

~1ns

~10ns

~100ns

cache miss: looking for something which is not in the cache

memory miss: looking for something which is not in memory

time

speed cpu

mem

The term static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed. SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer's main memory.

CS265, Spring 2017 Stratos Idreos /5514

design of storage/access methods/algorithms should minimize:

and touch/access only what you need

data misses+

instruction misses

CS265, Spring 2017 Stratos Idreos /5515

random access & page-based access

need to only read x… but have to read all of page 1

page1 page2 page3

data value x

registers

on chip cache

on board cachememory

disk

CPU

data

mov

e

CS265, Spring 2017 Stratos Idreos /5516

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

scan scan

40 bytes

5 10 6 4 12

(size=120 bytes)

2 8 9 7 6

4 2

7 11 3 9 6

scan

3

80 bytes120 bytes

try to read something and use it fully and never ever read it again to avoid data transfer costs - reading data from disk has been the major effort

CS265, Spring 2017 Stratos Idreos /5517

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6

memory level N

memory level N-1

query x<5

page size: 5x8 bytes

oracle oracle

40 bytes

5 10 6 4 12

(size=120 bytes)

2 8 9 7 6

4 2

7 11 3 9 6

oracle

3

80 bytes120 bytesan oracle gives us the positions

it does not make a difference - in fact we have to query and maintain the oracle

CS265, Spring 2017 Stratos Idreos /5518

when does it make sense to have an oraclehow can we minimize the cost

…5 10 6 4 12 2 8 9 7 6 7 11 3 9 65 10 6 4 12 2 8 9 7 6 7 11 3 9 6

e.g., query x<5

CS265, Spring 2017 Stratos Idreos /5519

it all starts with how we store data every bit matters

design space

CS265, Spring 2017 Stratos Idreos /5520

sequential access: read one block; consume it completely; discard it; read next

what is next?

in parallel/prefetching

hardware can better predict/buffer sequential pages to be read

e.g., 2MB buffers in modern DRAM amortize cost of moving disk arms

1 2 3 4

CS265, Spring 2017 Stratos Idreos /5521

random access: read one block; consume it partially; discard it; might have to read it again in future; read “random” next;

disk mechanical arms - high cost to move arms - so when you move them exploit it and read everything in there memory buffers

CS265, Spring 2017 Stratos Idreos /5522

C/C++

CS265, Spring 2017 Stratos Idreos /5523

MAIN-MEMORY OPTIMIZED DATA SYSTEMS

CS265, Spring 2017 Stratos Idreos /5524

a “simple” example

assume an array of N integers: find all positions where value>x

data

qualifying positions

exists in all systems: sql, nosql, newsqlselect operator

make it like a key-value store

CS265, Spring 2017 Stratos Idreos /5525

datamemory

assume an array of N integers: find all positions where value>x

res=new array[data.size]

j=0 for (i=0; i<data.size; i++)

if data[i]>x res[j++]=i

what if only 1% qualifies?

rescopy res

but how can we know?

CS265, Spring 2017 Stratos Idreos /5526

assume an array of N integers: find all positions where value>x

res=new array[data.size]

j=0 for (i=0; i<data.size; i++)

if data[i]>x res[j++]=i

what if 90% qualifies?

result size= qualifying values*x bytes

bit vector for res? 1 0 0 0 1 1 1 0 1

vs

00000001 00000101 00000110 00000111 00001001

and we haven’t even started discussing about how to find the qualifying values…

if statements= bad, bad, bad

CS265, Spring 2017 Stratos Idreos /5527

assume an array of N integers: find all positions where value>x

res=new array[data.size]

j=0 for (i=0; i<data.size; i++)

if data[i]>x res[j++]=i

thread1, core1

thread2, core2

thread3, core3

logically partition

NUMA architectures? SIMD functionality?

& what about result writing?

not as simple as spinning off N threads…

CS265, Spring 2017 Stratos Idreos /5528

assume an array of N integers: find all positions where value>x

res=new array[data.size]

j=0 for (i=0; i<data.size; i++)

if data[i]>x res[j++]=i

q1,q2,q3

N>>1 queries in parallel

q4

q4

CS265, Spring 2017 Stratos Idreos /5529

assume an array of N integers: find all positions where value>x

res=new array[10]

j=0 for (i=0; i<10; i++)

if data[i]>x res[j++]=i

res=new array[10] j=0

if data[0]>x res[j++]=i if data[1]>x res[j++]=i if data[2]>x res[j++]=i if data[3]>x res[j++]=i if data[4]>x res[j++]=i if data[5]>x res[j++]=i if data[6]>x res[j++]=i if data[7]>x res[j++]=i if data[8]>x res[j++]=i if data[9]>x res[j++]=i

vs

30vs50

CS265, Spring 2017 Stratos Idreos /5530

assume an array of N integers: find all positions where value>x

option 1: scan all data

option 2: use a tree (do not consider tree generation costs)

which one is best

CS265, Spring 2017 Stratos Idreos /5531

cost: data touched & computation

time

speed cpu

mem

CS265, Spring 2017 Stratos Idreos /5532

a “simple” example

data

build a key-value store similar to the ones Facebook, Google, etc use

interface supported: put, get, scan, count, get range, load unique key-value pairs, r>>w but w>>0

how to store and access

CS265, Spring 2017 Stratos Idreos /5533

design

logical design

physical design

system design

CS265, Spring 2017 Stratos Idreos /5534

essential steps in using a database system

clean schema load tune

query

experts/system admins

user/apps

CS265, Spring 2017 Stratos Idreos /5535

relational model+SQL

professors(id,name,…)

courses(id,name, profId,…)

students(id,name,…)

database

table/relationcolumn/attribute

give me the names of all students: select name from students

create table for professors: create table professors (id:integer, name: char(40), telephone: char(10), …) insert into professors (76897689, “john smith”, …)

where GPA>3.0

key

CS265, Spring 2017 Stratos Idreos /5536

employee(id:int, name:varchar(50), office:char(5), telephone:char(10), city:varchar(30), salary:int)

(1, name1, office1, tel1, city1, salary1) (2, name2, office2, tel2, city2, salary2) (3, name3, office3, tel3, city3, salary3) (4, name4, office4, tel4, city4, salary4) (5, name5, office5, tel5, city5, salary5) (6, name6, office6, tel6, city6, salary6) (7, name7, office7, tel7, city7, salary7) (8, name8, office8, tel8, city8, salary8) (9, name9, office9, NULL, city9, salary9)

schema

data

cardinality=9

SQL:insert into employee (1, name1, office1, tel1, city1, salary1)

value does not exist

CS265, Spring 2017 Stratos Idreos /5537

relational model+SQL

professors(id,name,…)

courses(id,name, profId,…)

students(id,name,…)

database

give me all students enrolled in cs265select student.name from students, enrolled, courses where courses.name=“cs165” and enrolled.courseId=course.id and student.id=enrolled.studentId

enrolled(studentId,

courseId,…) foreign key

join

CS265, Spring 2017 Stratos Idreos /5538

fact table(id1,id2,…)

dimension table 1(id1,…)

dimension table 2(id2,…)

star schema

CS265, Spring 2017 Stratos Idreos /5539

key-value store vs relational

can we store a document collection in a relational systems? can we store a relational database in a key-value store?

CS265, Spring 2017 Stratos Idreos /5540

design

logical design

physical design

system design

CS265, Spring 2017 Stratos Idreos /5541

essential steps in using a database system

clean schema load tune

query

experts/system admins

user/apps

CS265, Spring 2017 Stratos Idreos /5542

so do db systems “just work”?

declarative interface ask what you want

db system

CS265, Spring 2017 Stratos Idreos /5543

declarative interface ask what you want

db system

DBAindexes/views/tuning knobsbut … db cracking, adaptive* ideas

CS265, Spring 2017 Stratos Idreos /5544

design

logical design

physical design

system design

CS265, Spring 2017 Stratos Idreos /5545

database kernel

data data dataal

gorit

hms/

oper

ator

s

select min(A) from R where B<10 and C<80

optimizer

parser

execution

storage

CS265, Spring 2017 Stratos Idreos /5546

logisticshttp://daslab.seas.harvard.edu/classes/cs265/

CS265, Spring 2017 Stratos Idreos /5547

projects

option 1: systems projectbasic key-value store functionality - work individually single machine - multi-core design basic design as in Facebook, LinkedIn, Mongo, etc. can lead to research

option 2: research projectself-designing data systems + shape-shifting access methods research with DASlab researchers - groups of 3 available for cs165 students or otherwise advanced studentsnext generation adaptive Key-value stores (with Facebook)

CS265, Spring 2017 Stratos Idreos /5548

C/C++no libraries unless we explicitly allow it

we expect you build everything from scratch so you can control storage and access 100%

CS265, Spring 2017 Stratos Idreos /5549

midway check-in (10%)

special class (2-3 hour?) in mid March: 1) design docs 2) at least one performance example 3) presentation/poster

CS265, Spring 2017 Stratos Idreos /5550

two reading sessions and two hacking sessions per week? so maybe a minimum of 15 hours per week

(if you already have decent hacking and data structure/algorithms experience)

workload?

CS265, Spring 2017 Stratos Idreos /5551

how can I prepare?

Get familiar with the very basics of traditional database architectures:Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

Get familiar with very basics of modern database architectures:The Design and Implementation of Modern Column-store Database Systems.  By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013

Get familiar with the very basics of modern large scale systems:Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013

1) start browsing some basic texts

2) play with basic data structures implementation in C (linked list/hash table/tree)

CS265, Spring 2017 Stratos Idreos /5552

two more lectures next week and then we go into discussion mode

wed: db architectures basics fri: projects

next week: paper signup + systems project is already online

CS265, Spring 2017 Stratos Idreos /5553

web site: http://daslab.seas.harvard.edu/classes/cs265/ piazza: piazza.com/harvard/spring2017/cs265/homeoffice hours: Stratos: Wed/Thur/Fri, 3-4pm, MD139 TF office hours: Mon ?, Tue, 3-4pm, MD 136textbook: nope research papers will be available from the Harvard network

Action steps: 1) Read the syllabus & website carefully, 2) Register to Piazza, 3) Do P0 if you have not taken CS165 and check self-test, 4) Register for paper presentation (week 2), 5) Start submitting your paper reviews (week 3)

data systems 101BIG DATA SYSTEMSprof. Stratos Idreos

class 2

modern main-memory optimized data systems

next time

Recommended