View
154
Download
7
Category
Tags:
Preview:
DESCRIPTION
Reporting and analysis systems rely on coherent and reliable data, often from disparate sources. To that end, a series of well established data warehousing practices have emerged to extract data and produce a consistent data store. This talk will look at some options for composing workflows using Python. In particular, we'll explore beyond Celery's asynchronous task processing functionality into its workflow (aka Canvas) system and how it can be used in conjunction with SQLAlchemy's architecture to provide the building blocks for data stream processing.
Citation preview
Building data flows with Celery and SQLAlchemy
PyCon Australia 2013
Roger Barnes@mindsocketroger@mindsocket.com.auhttp://slideshare.net/mindsocket
Coming up
● Data warehousing– AKA data integration
● Processing data flows– SQLAlchemy– Celery
● Tying it all together
About me
● 15 years doing all things software● 11 years at a Business Intelligence vendor● Currently contracting
– This talk based on a real reporting system
Why Data Warehousing?
Why Data Warehousing?
But we need reports that are● Timely● Unambiguous● Accurate● Complete● … and don't impact production systems
What is a Data Warehouse
"… central repository of data which is created by integrating
data from one or more disparate sources" - Wikipedia
Extract, Transform, Load
Source: www.imc.com
Python can help!
● Rapid prototyping● Code re-use● Existing libraries● Decouple
– data flow management– data processing– business logic
Existing solutions
● Not a lot available in the Python space● People roll their own● Bubbles (Brewery 2)
– Framework for Python 3– "Focus on the process, not the data technology"
Ways to move data around
● Flat files● NOSQL data stores● RDBMS
SQLAlchemy is...
Python SQL toolkit
&
Object Relational Mapper
About SQLAlchemy
● Full featured● Mature, robust, documented, maintained● Flexible
Enterprise!
DB support
● SQLite● Postgresql● MySQL● Oracle
● MS-SQL● Firebird● Sybase● ...
Python support
cPython 2.5+cPython 3+Jython 2.5+Pypy 1.5+
Structure
SQLAlchemy Core
● Abstraction over Python's DBAPI● SQL language via generative Python
expressions
SQLAlchemy Core
● Good for DB performance– bulk operations– complex queries– fine-tuning– connection/tx management
Create a table
from sqlalchemy import *engine = create_engine('sqlite:///:memory:')metadata = MetaData()
vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),)
vehicles_table.create(bind=engine)
Insert data
values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 },]rows = engine.execute( vehicles_table.insert(), list(values)).rowcount
Query data
query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 )
results = engine.execute(query)
for row in results: print row
Encapsulating a unit of work
Example Processor Types
● Extract– Extract from CSV– Extract from DB table– Scrape web page
● Transform– Copy table from extract layer– Derive column– Join tables
Abstract Processor
class BaseProcessor(object): def dispatch(self): return self._run()
def _run(self): return self.run()
def run(self): raise NotImplementedError
Abstract Database Processorclass DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None
@contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield
def _run(self): with self._with_session(): return self.run()
CSV Extract Mixin
class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()
A Concrete Extractclass SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count
An Abstract Derive Transformclass AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None
def process_row(self, row): raise NotImplementedError
... # Profit!
A Concrete Transformfrom business_logic import derive_foo
class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)]
def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}
Introducing Celery
Distributed Task Queue
A Processor Task
class AbstractProcessorTask(celery.Task): abstract = True processor_class = None
def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch()
class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform
DeriveFooTask().apply_async() # Run it!
Canvas: Designing Workflows
● Combines a series of tasks● Groups run in parallel● Chains run in series● Can be combined in different ways
>>> new_user_workflow = (create_user.s() | group(... import_contacts.s(),... send_welcome_email.s()))... new_user_workflow.delay(username='artv',... first='Art',... last='Vandelay',... email='art@vandelay.com')
Sample Data Processing Flow
Extract sales
Extract customers
Extract product
s
Copy sales to transform
Copy customers to transform
Copy products to transform
Join table
s
Aggregate sales by customer
Normalise
currencyAggregate
sales by region
Customer data exception report
Sample Data Processing Flow
extract_flow = group((
ExtractSalesTask().si(),
ExtractCustTask().si(),
ExtractProductTask().si()))
transform_flow = group((
CopySalesTask().si() | NormaliseCurrencyTask().si(),
CopyCustTask().si(),
CopyProductTask().si())) | JoinTask().si()
load_flow = group((
QualityTask().si(),
AggregateTask().si('cust_id'),
AggregateTask().si('region_id')))
all_flow = extract_flow | transform_flow | load_flow
Monitoring – celery events
Monitoring – celery flower
Turning it up to 11
● A requires/depends structure● Incremental data loads● Parameterised flows● Tracking flow history● Hooking into other libraries
– NLTK– SciPy/NumPy– ...
Summary
● Intro to data warehousing● Process data with SQLAlchemy● Task dependencies with Celery
canvas
Resources
● SQLAlchemy core: http://bit.ly/10FdYZo● Celery Canvas: http://bit.ly/MOjazT● http://databrewery.org
– Bubbles: http://bit.ly/14hNsV0– Pipeline: http://bit.ly/15RXvWa
● http://schoolofdata.org
Thank You!
Questions?
http://slideshare.net/mindsocket
Recommended