21
CLyDE MidFlight: What we have learned about the SSDBased IO Stack Philippe Bonnet – [email protected] Joint work with Luc Bouganim (INRIA), Niv Dayan (ITU), MaQas Bjørling (ITU), Jesper Madsen (ITU) In LyllaboraQon with Jens Axboe (Facebook), David Nellans (Nvidia), Zvonimir Bandic (HGST), Qingbo Wang (HGST), Aviad Zuck (Tel Aviv Univ.) hYp://clyde.itu.dk A project funded by the Danish Council for Independent Research

CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– [email protected]& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

CLyDE  Mid-­‐Flight:    What  we  have  learned  about  the    

SSD-­‐Based  IO  Stack  

Philippe  Bonnet  –  [email protected]    

Joint  work  with  Luc  Bouganim  (INRIA),  Niv  Dayan  (ITU),  MaQas  Bjørling  (ITU),  Jesper  Madsen  (ITU)  

 In  LyllaboraQon  with  Jens  Axboe  (Facebook),  David  Nellans  (Nvidia),  Zvonimir  

Bandic  (HGST),  Qingbo  Wang  (HGST),  Aviad  Zuck  (Tel  Aviv  Univ.)  

hYp://clyde.itu.dk  A  project  funded  by  the  Danish  Council  for  Independent  Research  

Page 2: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

The  Advent  of  SSDs  

Balanced  High-­‐End  Systems   Energy  Efficient  Storage  

Jason  Taylor,  Flash  Memory  Summit,  08/13  

High  throughput/Low  latency    reduces  the  gap  between  RAM  and  IO  performance  

Fewer  WaYs/TB  drives  higher  data  density  on  the  cloud  (e.g.,  OCZ  vertex  4:  5W/TB;  Toshiba  MBF:  22W/TB)  

0  

200000  

400000  

600000  

800000  

1000000  

1200000  

1400000  

4K  Read  IOPS  

Page 3: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Focus  of  CLyDE  

The  Challenges  

Cloud  provider  How  to  design  and  characterize  

the  IO  service?  

DBMS  designer    What  is  the  impact  of  this  new  IO  stack  on  DBMS  design?  

Database  Administrator  How  to  tune  for  performance?  

hYp://docs.aws.amazon.com/AWSEC2/latest/UserGuide/i2-­‐instances.html  

AWS  I2.8xlarge  instance:  32  vCPU,  244GiB  RAM,  8x800  GB  SSD,  $6.82/h      

Page 4: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

How  to  design  for  SSDs?  Business  as  usual  

•  Layered  design  •  SSD  as  block  device  •  SSD  as  a  black  box  •  Performance  model  for  SSD  

to  drive  design  decisions  

Lessons  learnt  [CIDR’09][Sigmod’10][DEB’10]:  -­‐  Performance  varies  across  SSDs,  in  

Qme  for  a  SSD  (with  firmware  updates,  depending  on  IO  history)  

-­‐  Performance  varies  for  a  given  IO  paYern  with  target  size,  with  concurrency,  with  submission  rate  

ApplicaQon  

DBMS  

OS  

SSD  driver  

SSD  controller  

block  device  

Page 5: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

SSD  Internals  

LUN  

LUN  

LUN  

…  

LUN  

LUN  

LUN  

…  

LUN  

LUN  

LUN  

…  

LUN  

LUN  

LUN  

…  

Read  Write  Trim    

Logical  add

ress  sp

ace  

Physical  add

ress  sp

ace  

Mapping  

Wear  Leveling  Garbage  collecFon  

Shared  Internal  data  structures  

Read  Program  Erase  

Flash  memory  array  

Flash  TranslaQon  Layer  (FTL)  Implemented  on  SSD  controller  or  SSD  driver  

Page 6: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

How  to  design  for  SSDs?  

New  approaches  •  Layers  revisited:  

–  OS  bypassed  (e.g.,  Moneta-­‐D  [asplos12])  –  Avoid  duplicaQon  of  work  (e.g.,  from  

ARIES  to  MARS  [SOSP13])  

•  Cross-­‐layer  opQmizaQons  –  Trim  command  [HPCA11]  –  Extending  fadvise  [NVM12]  

•  New  SSD  interface:  –  CommunicaQon  abstracQon  to  

replace  memory  abstracQon  –  More  complex  address  space  Moneta-­‐D  [asplos12]  

Caulfield  et  al.,  UCSD  

Page 7: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

CLyDE  Approach  

1.  Keep  tradiQonal  layers  2.  Focus  on  SSD  interface  to  support  cross-­‐

layers  opQmizaQon  

Talk  outline:  – SSD  SimulaQon:  EagleTree  and  LightNVM  – MulQ-­‐queue  design  for  the  Linux  Block  Layer    – Tree-­‐Based  SSD  interface  for  FS/SSD  co-­‐design  

Page 8: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Exploring  the  Design  Space  •  Fundamental  quesQons:  

–  What  is  the  relaQve  importance  of  data  placement  and  scheduling  within  the  FTL?  

–  Should  scheduling  adapt  to  HW  configuraQon?  to  app  workload?  •  Design  choices:  

–  within  each  layer  –  across  layers  

•  Two  complementary  approaches:  –  EagleTree  [VLDB’13]:  Simulated  SSD  driver/controller  to  experiment  with  

actual  OS/apps  •  Wall-­‐Qme  clock  

–  LightNVM  [NVMW’14]:    Simulated  SSD/OS/apps  to  experiment  with    •  Dynamic  run-­‐Qme  clock  

Page 9: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

EagleTree  [VLDB’13]  hYps://github.com/nivdayan/EagleTree   Niv  Dayan:  [email protected]  

Page 10: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

LightNVM  [NVMW’14]  

MaQas  Bjørling:  [email protected]  

Page 11: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Linux  Block  IO  

Page 12: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Scalability  Problem  

Page 13: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

MulQqueue  Approach  Jens  Axboe’s  design  

Page 14: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Experimental  Results  [Systor13]  

Page 15: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas
Page 16: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Next  BoYleneck  within  Linux  IO  Stack  

•  Page  cache  

Page 17: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Beyond  block  device  interface  

•  File  System  and  SSD  today:  –  File  systems  has  no  clear  performance  model  for  SSD  –  SSD  makes  un-­‐informed  assumpQons  about  file  system  

•  Goal:  File  System  /  SSD  Co-­‐Design  –  SSD  exposes  an  abstracQon  that  is  well  suited  for  File  System  –  SSD  takes  leverages  this  abstracQon  to  opQmize  data  placement/scheduling  and  thus  minimize  GC  operaQons  

Page 18: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

SSD  AbstracQon  •  SSD  must  provide  a  

mapping  from  logical  to  physical  address  spaces  

•  File  systems  in  Linux  abstracted  by  VFS  (files,  dentries,  inodes,  superblocks)  

•  Btrfs  insight:  What  if  every  internal  data  structure  in  a  file  system  was  an  entry  in  a  B-­‐tree?    

•  Idea:  What  if  SSD  provides  a  Tree-­‐Based  interface?  

•  Approach:  Rely  on  LightNVM  as  experimentaQon  plaqorm  –  Tree-­‐interface  supported  within  the  LightNVM  driver  (server-­‐side)  

–  B-­‐link  as  tree  structure  (for  a  start)  

–  CLyDE  FS  layer  mapping  VFS  abstracQons  onto  Tree-­‐based  SSD  interface  

Page 19: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Tree-­‐Based  SSD  Architecture  

•  Tree-­‐Based  interface  –  CreateTree  –  RemoveTree  –  InsertNode  –  RemoveNode  – WriteNode  –  ReadNode  –  TruncateNode  

Jesper  Madsen:  [email protected]  

Page 20: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

IniQal  Results  

Buffered  IOs  on  RAM-­‐disk(ext2)  and  in-­‐RAM  B+-­‐tree  (CLyDEFS)  

Future  Work:  Mapping  onto  NVM,  comparison  with  Btrfs  

Constant  throughput  due  to  no  IO  scheduling  in  current  CLyDEFS  incarnaQon  

Page 21: CLyDE&Mid*Flight:& - WordPress.comCLyDE&Mid*Flight:& Whatwe&have&learned&aboutthe&& SSD*Based&IO&Stack& Philippe&Bonnet– phbo@itu.dk& & Jointwork&with&Luc& Bouganim&(INRIA),&Niv&Dayan&(ITU),&Maas

Mid-­‐Flight  Conclusions  •  Infrastructure  in  place  to  explore  cross-­‐layer  design  

–  EagleTree  and  LightNVM  •  BoYleneck  chasing  throughout  the  Linux  IO  stack  (on  mulQcore)  – Much  similar  to  boYleneck  chasing  in  Shore-­‐MT  –  DBMS  storage  manager  is  next  

•  Room  for  cross-­‐layer  opQmizaQon  –  Tree-­‐based  structure  is  a  good  candidate  as  replacement  for  block  device  interface  (i)  to  convey  appropriate  informaQon  from  DBMS  to  OS  to  SSD,  and  (ii)  to  manage  contenQon  

•  Need  to  address  challenges  of  DBMS  running  on  SSD-­‐based  virtualized  environments