35
PyTables An ondisk binary data container, query engine and computa:onal kernel Francesc Alted Tutorial for the PyData Conference, October 2012, New York City

PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

PyTables  

An  on-­‐disk  binary  data  container,  query  engine  and  computa:onal  kernel  

Francesc  Alted  

Tutorial  for  the  PyData  Conference,  October  2012,  New  York  City  

Page 2: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

10th  anniversary  of  PyTables    Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values are stored in fixed-length fields. PyTables is intended to be easy-to-use, and tries to be a high-performance interface to HDF5. To achieve this, the newest improvements introduced in Python 2.2 (like generators or slots and metaclasses in new-brand classes) has been used. Pyrex creation extension tool has been chosen to access the HDF5 library. This package should be platform independent, but until now I've tested it only with Linux. It's the first public release (v 0.1), and it is in alpha state. "-­‐-­‐  Francesc  Alted  announcing  PyTables  0.1,  October  2002  

Page 3: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Overview  

•  What  PyTables  is?  

•  Data  structures  in  PyTables  

•  Compressing  data  

•  Advanced  capabili:es  in  PyTables  

Page 4: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Notebooks  for  tutorial  

http://pytables.org/download/PyData2012-NYC.tar.gz

Page 5: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

What  it  is  

•  A  binary  data  container  for  on-­‐disk,  structured  data  

•  Can  perform  opera:ons  with  data  *on-­‐disk*  

•  Based  on  the  standard  de-­‐facto  HDF5  format    •  Free  soNware  (BSD  license)  

Page 6: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

About  HDF5  (Hierarchical  Data  File  version  5)  

•  A  versa&le  data  model  that  can  represent  complex  data  objects  as  well  as  associated  metadata  

•  A  portable  file  format  with  no  limit  on  the  number  or  size  of  data  objects  in  the  collec:on  

•  Implements  a  high-­‐level  API  with  C,  C++,  Fortran  90,  and  Java  interfaces  

•  Free  soNware  (BSD,  MIT  kind  of  license)    

Page 7: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

PyTables  dis:nc:ve  features  

•  Supports  a  good  range  of  compressors:  Zlib,  bzip2,  LZO  and  Blosc  

•  Powerful  query  capabili:es  for  Table  objects,  including  indexing  

•  Can  perform  out-­‐of-­‐core  opera:ons  very  efficiently  

Page 8: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

What  it  is  not  

•  Not  a  rela:onal  database  replacement  

•  Not  a  distributed  database  

•  Not  extremely  secure  or  safe  (it’s  more  about  speed!)  

•  Not  a  mere  HDF5  wrapper  

Page 9: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

DATA  STRUCTURES  

Page 10: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Data  structures  

•  High  level  of  flexibility  for  structuring  your  data:  –   Datatypes:  scalars  (numerical  &  strings),  records,  enumerated,  :me…  

– Tables  support  mul:dimensional  cells  and  nested  records  

– Mu:dimensional  arrays  – Variable  length  arrays  

Page 11: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

The  Array  object  

•  Easy  to  create:  – file.createArray(mygroup,  ‘array’,  numpy_arr)  

•  Shape  cannot  change  •  Cannot  be  compressed  

Page 12: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

The  CArray  object  

•  Data  is  stored  in  chunks  •  Each  chunk  can  be  compressed  independently  

•  Shape  cannot  change  

Page 13: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

The  EArray  object  

•  Data  is  stored  in  chunks  •  Can  be  compressed  •  Shape  can  change  (either  enlarged  or  shrunk)  •  Shape  must  be  kept  regular  

 

Page 14: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

The  VLArray  object  

•  Data  is  stored  in  variable  length  rows  •  Can  be  enlarged  or  shrunk  •  Data  cannot  be  compressed  

 

Page 15: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

The  Table  object  

•  Data  is  stored  in  chunks  •  Can  be  compressed  •  Can  be  enlarged  or  shrunk  •  Fields  cannot  be  of  variable  length  

 

Col1  (int32)  

Col2  (string  10)  

Col3  (bool)  

Col4  (complex64)  

Col5  (float32)  

Page 16: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

 Dataset  hierarchy  

root  

group1  

table1   table2  

group2  

array  

Page 17: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Aiributes:    Metadata  about  data  

table1   Date:  Jul  24  2006  

Observa:ons:  555  

CF:  [0.1,  0.3,  0.6]  

Page 18: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

COMPRESSION  CAPABILITIES  

Page 19: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Why  compression?  

•  Lets  you  store  more  data  using  the  same  space  

•  Uses  more  CPU,  but  CPU  :me  is  cheap  compared  with  disk  access  

•  Different  compressors  for  different  uses:  bzip2,  zlib,  lzo,  blosc  

 

Page 20: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Why  Blosc?  

Page 21: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Accelera:ng  I/O    

Blosc

}}

Othercompressors

Page 22: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Memory  access  vs  CPU  cycle  :me  

Page 23: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Laptop  computer  back  in  2005  

Page 24: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

State  of  the  art  computer  in  2012  (single  node)  

Page 25: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

OUT-­‐OF-­‐CORE  COMPUTATIONS  

Page 26: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Opera:ng  with  disk-­‐based  arrays  

•  tables.Expr  is  an  op:mized  evaluator  for  expressions  of  disk-­‐based  arrays.    

•  It  is  a  combina:on  of  the  Numexpr  advanced  compu:ng  capabili:es  with  the  high  I/O  performance  of  PyTables.    

•  Similarly  to  Numexpr,  disk-­‐temporaries  are  avoided,  and  mul:-­‐threaded  opera:on  is  preserved.  

Page 27: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Avoiding  temporaries  with  Numexpr  

Tables.Expr  follows  the  same  approach,    but  with  disk  instead  of  memory  

Page 28: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Performing  out-­‐of-­‐core  computa:ons  with  PyTables  

Virtual  machine:  numexpr  

Page 29: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

ADVANCED  QUERY  CAPABILITIES  

Page 30: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Different  query  modes  

Regular  query: •  [ r[‘c1’] for r in table

if r[‘c2’] > 2.1 and r[‘c3’] == True)) ]

In-­‐kernel  query: •  [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]

Indexed  query: •  table.cols.c2.createIndex() •  table.cols.c3.createIndex() •  [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]

Page 31: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Regular  and  in-­‐kernel  queries  

Page 32: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Customizable  indexes  

Page 33: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Indexed  query  performance  The Starving CPU ProblemHigh Performance Libraries

Why Should You Use Them?In-Core High Performance LibrariesOut-of-Core High Performance Libraries

PyTables Pro Query Performance

Francesc Alted Large Data Analysis with Python

Page 34: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Concepts  to  take  home  

•  PyTables  is  op:mized  to  deal  with  data  on  disk  

•  Most  of  the  opera:ons  use  the  iterator/generator  machinery  in  Python:  the  goal  is  not  to  bloat  memory  with  data  

•  Queries,  indexes  and  out-­‐of-­‐core  opera:ons  are  good  examples  of  the  above  

Page 35: PyTables) · 10th)anniversary)of)PyTables)) Hi!, PyTables is a Python package which allows dealing with HDF5 tables. Such a table is defined as a collection of records whose values

Thank  you!  

•  Ques:ons?