Yannis Ioannidis MaDgIK Lab University of Athens & ATHENA Research Center

Embed Size (px)

Citation preview

  • Slide 1
  • Yannis Ioannidis MaDgIK Lab University of Athens & ATHENA Research Center
  • Slide 2
  • Motivation Relevant Systems & Languages The ADP System Optimal Dataflow Scheduling Conclusions & Future work
  • Slide 3
  • Slide 4
  • High-level Data Languages Languages to easily express data operations Semantics Big Data Processing TB or PB of data (scientific, sensors, ) Efficiency (Query) optimization Reconciling efficiency and semantics
  • Slide 5
  • Op1 Op-i Data1 Data-j Q C-k A Cloud Processing Containers Data Registry Operator Registry
  • Slide 6
  • select name, department, salary from employee e, dept d where e.position = director and e.did = d.id employee dept select join project dept join partition project join employee 1 employee 2 partition project select partition union
  • Slide 7
  • Query:graph of relational algebra operators Optimality:response time or completion time Environment:cluster of dedicated distributed / parallel hosts
  • Slide 8
  • Query:graph of arbitrary operators Optimality:response time or completion time and money Environment:cloud of hosts (elasticity)
  • Slide 9
  • 1 2 3 4 5 Graph of arbitrary operators Non-relational data analytics Query log analysis Data mining Simulation model composition User behavior analysis for European national libraries One of sixteen flows
  • Slide 10
  • select name, department, salary from employee e, dept d where haircolor(e.picture) = gray and e.did = d.id employee dept haircolor join project dept join partition project join employee 1 employee 2 partition project haircolor partition union haircolor Operator Registry f f f f f f
  • Slide 11
  • dept join partition project join employee 1 employee 2 partition project haircolor partition union Cloud
  • Slide 12
  • Virtualized IT resources offered as on-demand service Software as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Variety of charging and use policies
  • Slide 13
  • Cloud of hosts (elasticity) Virtual resources (virtual hosts = containers) Available on demand Used for as much time needed Leased on a per quantum pricing scheme Illusion of infinite resources Arbitrary # of choices of price/performance ratio
  • Slide 14
  • Slide 15
  • Hadoop Open source software for reliable, scalable, distributed computing Won Jim Grays Terabyte Sort Benchmark in 2008 (209 seconds) Hive (Facebook) Data warehouse build on top of Hadoop Condor Designed for CPU intensive applications Pegasus Scientific workflows on the Grid Cassandra (Facebook) P2P database with linear scalability
  • Slide 16
  • Map-Reduce (Google) Jim Grays Terabyte Sort Benchmark in 68 seconds in 2009 PNUTS (Yahoo! Research) Massively parallel & geographically distributed database system Dryad (Microsoft Research) General-purpose distributed execution engine for coarse-grain data-parallel applications Greenplum Massively parallel processing (MPP) database Hadapt SQL processing on top of Hadoop
  • Slide 17
  • SQL Hive-QL Pig-Latin Dataflow language Datalog Mashups Yahoo! pipes MashQL
  • Slide 18
  • Hive-QL is based on SQL CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY( dt STRING, country STRING) STORED AS SEQUENCEFILE; INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date