Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
data systems 101prof. Stratos Idreos
HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS265/
class 2
CS265, Spring 2017 Stratos Idreos /552
big data V’s (it is not about size only)
volume velocity variety veracity
actually none of that is really new…
new: our ability to gather and store machine generated data broad understanding that we cannot just manually get value out of data
CS265, Spring 2017 Stratos Idreos /553
a data system stores data and provides access to data
& makes knowledge generation easy
data system analysisdata
knowledgeX
CS265, Spring 2017 Stratos Idreos /554
database kernel
data data data
algo
rithm
s/op
erat
ors
applications
sql
disk
memory
cpu
CS265, Spring 2017 Stratos Idreos /555
declarative interface ask ‘’what’’ you want
data system
the system decides “how” to best store and access data
CS265, Spring 2017 Stratos Idreos /556
SQL noSQL
as apps become more complex
as apps need to be more scalable
complex legacy tuning
expensive …
simple clean
just enough …
newSQL
hadoop spark flink
goal: understand why, how and what is next
Gartner: DBMS Market = ~36Billion
noSQL/Hadoop =~700Million SQL = ~rest
http://www.infoworld.com/article/3056637/database/nosql-chips-away-at-oracle-ibm-and-microsoft-dominance.html
CS265, Spring 2017 Stratos Idreos /557
more data
more applications
more h/w
the need for new systems
CS265, Spring 2017 Stratos Idreos /558
soon everyone will need to be a “data scientist”hmm,
my data is too big :(
1
new applications/requirements
how far away are we from a future where a data system sits in
the critical path of everything we do?
CS265, Spring 2017 Stratos Idreos /559
not always sure what we are looking for (until we find it)
data exploration2
CS265, Spring 2017 Stratos Idreos /5510
years
daily
dat
a[IBMbigdata]
years
daily
dat
a
[StratosGuess]
data
* ski
lls
data system design, set-up, tune, use
3
CS265, Spring 2017 Stratos Idreos /5511
system where db runs
sql
caches
cpu - cpu - cpu - cpu
memory
disk - disk - disk - disk
cpu registers
smal
ler/f
aste
r
+ flash
+ non volatile memory
memory hierarchy
CS265, Spring 2017 Stratos Idreos /5512
registers
on chip cache2x
on board cache10x
memory100x
disk100Kx
Jim Gray, IBM, Tandem, DEC, Microsoft ACM Turing award ACM SIGMOD Edgar F. Codd Innovations award
Pluto2 years
New York1.5 hours
this building10 min
this room1 min
my head~0
CS265, Spring 2017 Stratos Idreos /5513
registers
on chip cache
on board cache
memory
disk
CPU
memory wall
chea
per
fast
er
SRAM
DRAM
~1ns
~10ns
~100ns
cache miss: looking for something which is not in the cache
memory miss: looking for something which is not in memory
time
speed cpu
mem
The term static differentiates SRAM from DRAM (dynamic random-access memory) which must be periodically refreshed. SRAM is faster and more expensive than DRAM; it is typically used for CPU cache while DRAM is used for a computer's main memory.
CS265, Spring 2017 Stratos Idreos /5514
design of storage/access methods/algorithms should minimize:
and touch/access only what you need
data misses+
instruction misses
CS265, Spring 2017 Stratos Idreos /5515
random access & page-based access
…
need to only read x… but have to read all of page 1
page1 page2 page3
data value x
registers
on chip cache
on board cachememory
disk
CPU
data
mov
e
CS265, Spring 2017 Stratos Idreos /5516
…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6
memory level N
memory level N-1
query x<5
page size: 5x8 bytes
scan scan
40 bytes
5 10 6 4 12
(size=120 bytes)
2 8 9 7 6
4 2
7 11 3 9 6
scan
3
80 bytes120 bytes
try to read something and use it fully and never ever read it again to avoid data transfer costs - reading data from disk has been the major effort
CS265, Spring 2017 Stratos Idreos /5517
…5 10 6 4 12 2 8 9 7 6 7 11 3 9 6
memory level N
memory level N-1
query x<5
page size: 5x8 bytes
oracle oracle
40 bytes
5 10 6 4 12
(size=120 bytes)
2 8 9 7 6
4 2
7 11 3 9 6
oracle
3
80 bytes120 bytesan oracle gives us the positions
it does not make a difference - in fact we have to query and maintain the oracle
CS265, Spring 2017 Stratos Idreos /5518
when does it make sense to have an oraclehow can we minimize the cost
…5 10 6 4 12 2 8 9 7 6 7 11 3 9 65 10 6 4 12 2 8 9 7 6 7 11 3 9 6
e.g., query x<5
CS265, Spring 2017 Stratos Idreos /5519
it all starts with how we store data every bit matters
design space
CS265, Spring 2017 Stratos Idreos /5520
sequential access: read one block; consume it completely; discard it; read next
what is next?
in parallel/prefetching
hardware can better predict/buffer sequential pages to be read
e.g., 2MB buffers in modern DRAM amortize cost of moving disk arms
1 2 3 4
CS265, Spring 2017 Stratos Idreos /5521
random access: read one block; consume it partially; discard it; might have to read it again in future; read “random” next;
disk mechanical arms - high cost to move arms - so when you move them exploit it and read everything in there memory buffers
CS265, Spring 2017 Stratos Idreos /5522
C/C++
CS265, Spring 2017 Stratos Idreos /5523
MAIN-MEMORY OPTIMIZED DATA SYSTEMS
CS265, Spring 2017 Stratos Idreos /5524
a “simple” example
assume an array of N integers: find all positions where value>x
data
qualifying positions
exists in all systems: sql, nosql, newsqlselect operator
make it like a key-value store
CS265, Spring 2017 Stratos Idreos /5525
datamemory
assume an array of N integers: find all positions where value>x
res=new array[data.size]
j=0 for (i=0; i<data.size; i++)
if data[i]>x res[j++]=i
what if only 1% qualifies?
rescopy res
but how can we know?
CS265, Spring 2017 Stratos Idreos /5526
assume an array of N integers: find all positions where value>x
res=new array[data.size]
j=0 for (i=0; i<data.size; i++)
if data[i]>x res[j++]=i
what if 90% qualifies?
result size= qualifying values*x bytes
bit vector for res? 1 0 0 0 1 1 1 0 1
vs
00000001 00000101 00000110 00000111 00001001
and we haven’t even started discussing about how to find the qualifying values…
if statements= bad, bad, bad
CS265, Spring 2017 Stratos Idreos /5527
assume an array of N integers: find all positions where value>x
res=new array[data.size]
j=0 for (i=0; i<data.size; i++)
if data[i]>x res[j++]=i
thread1, core1
thread2, core2
thread3, core3
…
logically partition
NUMA architectures? SIMD functionality?
& what about result writing?
not as simple as spinning off N threads…
CS265, Spring 2017 Stratos Idreos /5528
assume an array of N integers: find all positions where value>x
res=new array[data.size]
j=0 for (i=0; i<data.size; i++)
if data[i]>x res[j++]=i
q1,q2,q3
N>>1 queries in parallel
q4
q4
CS265, Spring 2017 Stratos Idreos /5529
assume an array of N integers: find all positions where value>x
res=new array[10]
j=0 for (i=0; i<10; i++)
if data[i]>x res[j++]=i
res=new array[10] j=0
if data[0]>x res[j++]=i if data[1]>x res[j++]=i if data[2]>x res[j++]=i if data[3]>x res[j++]=i if data[4]>x res[j++]=i if data[5]>x res[j++]=i if data[6]>x res[j++]=i if data[7]>x res[j++]=i if data[8]>x res[j++]=i if data[9]>x res[j++]=i
vs
30vs50
CS265, Spring 2017 Stratos Idreos /5530
assume an array of N integers: find all positions where value>x
option 1: scan all data
option 2: use a tree (do not consider tree generation costs)
which one is best
CS265, Spring 2017 Stratos Idreos /5531
cost: data touched & computation
time
speed cpu
mem
CS265, Spring 2017 Stratos Idreos /5532
a “simple” example
data
build a key-value store similar to the ones Facebook, Google, etc use
interface supported: put, get, scan, count, get range, load unique key-value pairs, r>>w but w>>0
how to store and access
CS265, Spring 2017 Stratos Idreos /5533
design
logical design
physical design
system design
CS265, Spring 2017 Stratos Idreos /5534
essential steps in using a database system
clean schema load tune
query
experts/system admins
user/apps
CS265, Spring 2017 Stratos Idreos /5535
relational model+SQL
professors(id,name,…)
courses(id,name, profId,…)
students(id,name,…)
database
table/relationcolumn/attribute
give me the names of all students: select name from students
create table for professors: create table professors (id:integer, name: char(40), telephone: char(10), …) insert into professors (76897689, “john smith”, …)
where GPA>3.0
key
CS265, Spring 2017 Stratos Idreos /5536
employee(id:int, name:varchar(50), office:char(5), telephone:char(10), city:varchar(30), salary:int)
(1, name1, office1, tel1, city1, salary1) (2, name2, office2, tel2, city2, salary2) (3, name3, office3, tel3, city3, salary3) (4, name4, office4, tel4, city4, salary4) (5, name5, office5, tel5, city5, salary5) (6, name6, office6, tel6, city6, salary6) (7, name7, office7, tel7, city7, salary7) (8, name8, office8, tel8, city8, salary8) (9, name9, office9, NULL, city9, salary9)
schema
data
cardinality=9
SQL:insert into employee (1, name1, office1, tel1, city1, salary1)
value does not exist
CS265, Spring 2017 Stratos Idreos /5537
relational model+SQL
professors(id,name,…)
courses(id,name, profId,…)
students(id,name,…)
database
give me all students enrolled in cs265select student.name from students, enrolled, courses where courses.name=“cs165” and enrolled.courseId=course.id and student.id=enrolled.studentId
enrolled(studentId,
courseId,…) foreign key
join
CS265, Spring 2017 Stratos Idreos /5538
fact table(id1,id2,…)
dimension table 1(id1,…)
dimension table 2(id2,…)
…
…
…
star schema
CS265, Spring 2017 Stratos Idreos /5539
key-value store vs relational
can we store a document collection in a relational systems? can we store a relational database in a key-value store?
CS265, Spring 2017 Stratos Idreos /5540
design
logical design
physical design
system design
CS265, Spring 2017 Stratos Idreos /5541
essential steps in using a database system
clean schema load tune
query
experts/system admins
user/apps
CS265, Spring 2017 Stratos Idreos /5542
so do db systems “just work”?
declarative interface ask what you want
db system
CS265, Spring 2017 Stratos Idreos /5543
declarative interface ask what you want
db system
DBAindexes/views/tuning knobsbut … db cracking, adaptive* ideas
CS265, Spring 2017 Stratos Idreos /5544
design
logical design
physical design
system design
CS265, Spring 2017 Stratos Idreos /5545
database kernel
data data dataal
gorit
hms/
oper
ator
s
select min(A) from R where B<10 and C<80
optimizer
parser
execution
storage
CS265, Spring 2017 Stratos Idreos /5546
logisticshttp://daslab.seas.harvard.edu/classes/cs265/
CS265, Spring 2017 Stratos Idreos /5547
projects
option 1: systems projectbasic key-value store functionality - work individually single machine - multi-core design basic design as in Facebook, LinkedIn, Mongo, etc. can lead to research
option 2: research projectself-designing data systems + shape-shifting access methods research with DASlab researchers - groups of 3 available for cs165 students or otherwise advanced studentsnext generation adaptive Key-value stores (with Facebook)
CS265, Spring 2017 Stratos Idreos /5548
C/C++no libraries unless we explicitly allow it
we expect you build everything from scratch so you can control storage and access 100%
CS265, Spring 2017 Stratos Idreos /5549
midway check-in (10%)
special class (2-3 hour?) in mid March: 1) design docs 2) at least one performance example 3) presentation/poster
CS265, Spring 2017 Stratos Idreos /5550
two reading sessions and two hacking sessions per week? so maybe a minimum of 15 hours per week
(if you already have decent hacking and data structure/algorithms experience)
workload?
CS265, Spring 2017 Stratos Idreos /5551
how can I prepare?
Get familiar with the very basics of traditional database architectures:Architecture of a Database System. By J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007
Get familiar with very basics of modern database architectures:The Design and Implementation of Modern Column-store Database Systems. By D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, S. Madden. Foundations and Trends in Databases, 2013
Get familiar with the very basics of modern large scale systems:Massively Parallel Databases and MapReduce Systems. By Shivnath Babu and Herodotos Herodotou. Foundations and Trends in Databases, 2013
1) start browsing some basic texts
2) play with basic data structures implementation in C (linked list/hash table/tree)
CS265, Spring 2017 Stratos Idreos /5552
two more lectures next week and then we go into discussion mode
wed: db architectures basics fri: projects
next week: paper signup + systems project is already online
CS265, Spring 2017 Stratos Idreos /5553
web site: http://daslab.seas.harvard.edu/classes/cs265/ piazza: piazza.com/harvard/spring2017/cs265/homeoffice hours: Stratos: Wed/Thur/Fri, 3-4pm, MD139 TF office hours: Mon ?, Tue, 3-4pm, MD 136textbook: nope research papers will be available from the Harvard network
Action steps: 1) Read the syllabus & website carefully, 2) Register to Piazza, 3) Do P0 if you have not taken CS165 and check self-test, 4) Register for paper presentation (week 2), 5) Start submitting your paper reviews (week 3)
data systems 101BIG DATA SYSTEMSprof. Stratos Idreos
class 2
modern main-memory optimized data systems
next time