Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
GC3: Grid Computing Competence Center
Introduction to Pythonprogramming, II(with a hint of MapReduce)
Riccardo MurriGrid Computing Competence Center,University of Zurich
Oct. 10, 2012
Today’s class
Explain more Python constructs and semantics bylooking at John Arley Burns’ MapReduce in 98 lines ofPython.
These slides are available for download from:http://www.gc3.uzh.ch/teaching/lsci2012/lecture03.pdf
LSCI2012 Python II Oct. 10, 2012
References
See the course website for an extensive andcommented list.
– Dean, J., and Ghemawat, S.: MapReduce:Simplified Data Processing on Large Clusters,OSDI’04
– Greiner, J., Wong, S.: Distributed ParallelProcessing with MapReduce
– Carter, J.: Simple MapReduce with Ruby andRinda
LSCI2012 Python II Oct. 10, 2012
What is MapReduce?
MapReduce is:
1. a programming model
2. an associated implementation
Both are important!!
LSCI2012 Python II Oct. 10, 2012
MapReduce
The Map functionprocesses akey/value pair toproduceintermediatekey/value pairs.
Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce
LSCI2012 Python II Oct. 10, 2012
MapReduce
The Reducefunction mergesall intermediatevalues associatedwith a given key.
Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce
LSCI2012 Python II Oct. 10, 2012
MapReduce: advantages of the model
Programs written in this style are automaticallyparallelized and executed on a large cluster ofmachines . . .
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
Example: word count
Input is a text file, to be split at line boundaries.
Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
LSCI2012 Python II Oct. 10, 2012
Example: word count
The Map function scans an input line and outputs apair (word, 1) for each word in the text line.
Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
LSCI2012 Python II Oct. 10, 2012
Example: word count
The pairs are shuffled and sorted so that each reducergets all pairs (word, 1) with the same word part.
Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
LSCI2012 Python II Oct. 10, 2012
Example: word count
The Reduce function gets all pairs (word, 1) with the sameword part, and outputs a single pair (word, count) wherecount is the number of input items received.
Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
LSCI2012 Python II Oct. 10, 2012
Example: word count
The global output is a list of pairs (word, count) where countis the number of occurences of word in the input text.
Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
LSCI2012 Python II Oct. 10, 2012
MapReduce: features of the implementation
The run-time system takes care of the details:– partitioning the input data,– scheduling the program execution,– handling machine failures,– managing the required inter-machine
communication.
These are all highly nontrivial tasks to handle!
The quality of a MapReduce implementation should bejudged by how effective it is at handling thenon-Map/Reduce part.
LSCI2012 Python II Oct. 10, 2012
Back to Python!
mapreduce.py by John Arley Burns is a simplePython class that simulates running a MapReducealgorithm using in-memory data structures.
A MapReduce algorithm is specified by subclassingthe MapReduce class and overriding methods toprovide the Split, Map, and Reduce functions.
(There’s no Partition/Shuffle function because all thedata is kept in memory and sorted there, so no localityissues.)
LSCI2012 Python II Oct. 10, 2012
import refrom mapreduce import MapReduce
class WordCount(MapReduce):def __init__(self, data):
MapReduce.__init__(self)self.data = data
def split_fn(self, data):def line_to_tuple(line):
return (None, line)data_list = [
line_to_tuple(line)for line in data.splitlines() ]
return data_list
def map_fn(self, key, value):for word in re.split(r’\W+’, value.lower()):
bareword = re.sub(r"[ˆA-Za-z0-9]*", r"", word);if len(bareword) > 0:
yield (bareword, 1)
def reduce_fn(self, word, count_list):return [(word, sum(count_list))]
def output_fn(self, output_list):sorted_list = sorted(output_list, key=operator.itemgetter(1))for word, count in sorted_list:
print(word, count)
The word countexample usingmapreduce.py
LSCI2012 Python II Oct. 10, 2012
Importing modules
import re
from mapreduce import MapReduce
class WordCount(MapReduce):# ...
def map_fn(self, key, value):
for word in re.split (...):
bareword = re.sub (...)if len(bareword) > 0:
yield (bareword, 1)
# ...
This imports the re(regular expressions)
module.
All names defined inthat module are nowvisible under the re
namespace, e.g.,re.sub, re.split.
LSCI2012 Python II Oct. 10, 2012
Importing names
import re
from mapreduce import MapReduce
class WordCount( MapReduce ):
def __init__(self, data):
MapReduce .__init__(self)
self.data = data
# ...
This imports theMapReduce name,
defined in themapreduce module,
into this module’snamespace.
So you need not usea prefix to qualify it.
LSCI2012 Python II Oct. 10, 2012
Defining objects
class WordCount(MapReduce):
def __init__(self, data):MapReduce.__init__(self)self.data = data
# ...
The class keywordstarts the definition
of a class (in the OOPsense).
The class definition isindented.
LSCI2012 Python II Oct. 10, 2012
Inheritance
class WordCount( MapReduce ):
def __init__(self, data):MapReduce.__init__(self)self.data = data
# ...
This tells Python thatthe WordCount class
inherits from theMapReduce class.
Every class mustinherit from some
other class; the rootof the class hierarchyis the built-in object
class.
LSCI2012 Python II Oct. 10, 2012
Declaring methods
class WordCount(MapReduce):
def init (self, data):
MapReduce.__init__(self)self.data = data
# ...
A method declarationlooks exactly like afunction definition.
Every method musthave at least one
argument, namedself.
(Why the doubleunderscore? More on
this later!)
LSCI2012 Python II Oct. 10, 2012
The self argument
class WordCount(MapReduce):
def __init__( self , data):
MapReduce.__init__( self )
self .data = data
# ...
self is a reference tothe object instance(like, e.g., this in
Java).
It is used to accessattributes and invoke
methods of theinstance itself.
LSCI2012 Python II Oct. 10, 2012
The self argument
Every method of a Python object always has selfas first argument.
However, you do not specify it when calling a method:it’s automatically inserted by Python:
>>> class ShowSelf(object):... def show(self):... print(self)...>>> x = ShowSelf() # construct instance>>> x.show() # ‘self’ automatically inserted!<__main__.ShowSelf object at 0x299e150>
The self variable is a reference to the object instanceitself. You need to use self when accessing methodsor attributes of this instance.
LSCI2012 Python II Oct. 10, 2012
The self argument
class WordCount(MapReduce):def __init__(self, data):
MapReduce.__init__(self)
self .data = data
# ...
Q: (1)
Why is the dataidentifier qualified
with the self.namespace?
LSCI2012 Python II Oct. 10, 2012
The self argument
class WordCount(MapReduce):def __init__(self, data):
MapReduce.__init__( self )self.data = data
# ...
Q: (2)
Why do we explicitlywrite self here?
LSCI2012 Python II Oct. 10, 2012
Name resolution rules
Within a function/method body, names are resolvedaccording to the LEGB rule:
L Local scope: any names defined in the currentfunction;
E Enclosing function scope: names defined inenclosing functions (outermost last);
G global scope: names defined in the toplevel ofthe current module;
B Built-in names (i.e., Python’s builtinsmodule).
Any name that is not in one of the above scopes mustbe qualified.
So you have to write self.data to call a method on thisinstance, re.sub to mean a function defined in module re,MapReduce. init to reference a method defined in theMapReduce class, etc.
LSCI2012 Python II Oct. 10, 2012
Object attributes
A Python object is (in particular) a key-value mapping:attributes (keys) are valid identifiers, values can beany Python object.
Any object has attributes, which you can access(create, read, overwrite) using the dot notation:
# create or overwrite the ‘name’ attribute of ‘w’w.name = "Joe"
# get the value of ‘w.name’ and print itprint (w.name)
So, in the constructor you create the required instanceattributes using self.var = ...
Note: also methods are attributes!
LSCI2012 Python II Oct. 10, 2012
No access control
There are no “public”/“private”/etc. qualifiers forobject attributes.
Any code can create/read/overwrite/delete anyattribute on any object.
There are conventions, though:
– “protected” attributes: name
– “private” attributes: name
(But again, note that this is not enforced by thesystem in any way.)
LSCI2012 Python II Oct. 10, 2012
Class attributes, I
Classes are Python objects too, hence they can haveattributes.
Class attributes can be created with the variableassignment syntax in a class definition block:
class A(object):class_attr = valuedef __init__(self):
# ...
Class attributes are shared among all instances ofthe same class!
LSCI2012 Python II Oct. 10, 2012
Class attributes, II
Methods are class attributes, too.
However, looking up a method attribute on an instancereturns a bound method, i.e., one for which self isautomatically inserted.
Looking up the same method on a class, returns anunbound method, which is just like a regular function,i.e., you must pass self explicitly.
LSCI2012 Python II Oct. 10, 2012
Constructors, I
class WordCount(MapReduce):
def init (self, data):MapReduce.__init__(self)self.data = data
# ...
The init methodis the instance
constructor.
It should neverreturn any value
(other than None).
LSCI2012 Python II Oct. 10, 2012
Constructors, II
The __init__ method is the instance constructor.It should never return any value (other than None).
However, you call a constructor by class name:
# make ‘wc’ an instance of ‘WordCount’wc = WordCount("some text")
(Again, note that the self part is automaticallyinserted by Python.)
LSCI2012 Python II Oct. 10, 2012
No overloading
Python does not allow overloading of functions.
Any function.
Hence, no overloading of constructors.
So: a class can have one and only one constructor.
LSCI2012 Python II Oct. 10, 2012
Constructor chaining
When a class is instanciated, Python only calls thefirst constructor it can find in the class inheritancecall-chain.
If you need to call a superclass’ constructor, youneed to do it explicitly:
class WordCount(MapReduce):def __init__(self, ...):
# do WordCount-specific stuff hereMapReduce.__init__(self, ...)# some more WordCount-specific stuff
Calling a superclass constructor is optional, and it canhappen anywhere in the __init__ method body.
LSCI2012 Python II Oct. 10, 2012
Multiple-inheritance
Python allows multiple inheritance.
Just list all the parent classes:
class C(A,B):# class definition
With multiple inheritance, it is your responsibility tocall all the needed superclass constructors.
Python uses the C3 algorithm to determine the callprecedence in an inheritance chain.
You can always query a class for its “methodresolution order”, via the __mro__ attribute:>>> C.__mro__(<class ’ex.C’>, <class ’ex.A’>, <class ’ex.B’>, <type ’object’>)
LSCI2012 Python II Oct. 10, 2012
Nested functions
import re
class WordCount(MapReduce):# ...
def split_fn(self, data):
def line to tuple(line):
return (None, line)data_list = [
line to tuple (line)
for line in data.splitlines()]return data_list
# ...
You can definefunctions (andclasses) within
functions.
The nested functionsare only visible within
the enclosingfunction.
(But they can captureany variable from the
enclosing functionenvironment by
name.)
LSCI2012 Python II Oct. 10, 2012
List comprehensions, I
class WordCount(MapReduce):# ...
def split_fn(self, data):def line_to_tuple(line):
return (None, line)
data list = [
line to tuple(line)
for line in data.splitlines() ]
return data_list
# ...
Q: What is this?
LSCI2012 Python II Oct. 10, 2012
An easy exercise
A dotfile is a file whose name starts with a dotcharacter “.”.
How can you list the full pathname of all dotfiles in agiven directory?
(The Python library call for listing the entries in adirectory is os.listdir(), which returns a list of filenames.)
LSCI2012 Python II Oct. 10, 2012
A very basic solution
Use a for loop to accumulate the results into a list:
dotfiles = [ ]for entry in os.listdir(path):
if entry.startswith(’.’):dotfiles.append(os.path.join(path, entry))
LSCI2012 Python II Oct. 10, 2012
List comprehensions, II
Python has a better and more compact syntax forfiltering elements of a list and/or applying a functionto them:
dotfiles = [ os.path.join(path, entry)for entry in dotfilesif entry.startswith(’.’) ]
This is called a list comprehension.
LSCI2012 Python II Oct. 10, 2012
List comprehensions, III
The general syntax of a list comprehension is:
[ expr for var in iterable if condition ]
where:
expr is any Python expression;
iterable is a (generalized) sequence;
condition is a boolean expression, depending onvar;
var is a variable that will be bound in turn toeach item in iterable which satisfiescondition.
The ‘if condition’ part is optional.
LSCI2012 Python II Oct. 10, 2012
Generator expressions
List comprehensions are a special case of generatorexpressions:
( expr for var in iterable if condition )
A generator expression is a valid iterable and can beused to initialize tuples, sets, dicts, etc.:
# the set of square numbers < 100squares = set(n*n for n in range(10))
Generator expressions are valid expression, so theycan be nested:
# cartesian product of sets A and BC = set( (a,b) for a in A for b in B )
LSCI2012 Python II Oct. 10, 2012
Generators
Generator expressions are a special case of generators.
A generator is like a function, except it uses yieldinstead of return:
def squares():n = 0while True:
yield n*nn += 1
At each iteration, execution resumes with thestatement logically following yield in the generator’sexecution flow.
There can be multiple yield statements in a generator.
Reference: http://wiki.python.org/moin/GeneratorsLSCI2012 Python II Oct. 10, 2012
Generators in action
class WordCount(MapReduce):# ...
def map_fn(self, key, value):for word in re.split(r’\W+’, value.lower()):
bareword = re.sub(r"[ˆA-Za-z0-9]*", r"", word);if len(bareword) > 0:
yield (bareword, 1)
# ...
This makes map fninto a generator that
return pairs (word, 1)
LSCI2012 Python II Oct. 10, 2012
The Iterator Protocol
An object can function as an iterator iff it implementsa next() method, that:
either returns the next value in the iteration,
or raises StopIteration to signal the end ofthe iteration.
An object can be iterated over with for if it implementsa __iter__() method.
Reference: http://www.python.org/dev/peps/pep-0234/
LSCI2012 Python II Oct. 10, 2012
class WordIterator(object):
def __init__(self, text):self._words = text.split()
def next(self):if len(self._words) > 0:
return self._words.pop(0)else:
raise StopIteration
def __iter__(self):return self
Iterate over the words inthe given text: split the
text at white spaces, andreturn the parts
one by one.
Source code available at:
http://www.gc3.uzh.ch/teaching/lsci2011/lecture08/worditerator.py
LSCI2012 Python II Oct. 10, 2012
class WordIterator( object ):
def __init__(self, text):self._words = text.split()
def next(self):if len(self._words) > 0:
return self._words.pop(0)else:
raise StopIteration
def __iter__(self):return self
Every class must inheritfrom a parent class.
If there’s no other class,inherit from the objectclass. (Root of the class
hierarchy.)
LSCI2012 Python II Oct. 10, 2012
Using iterators
Iterators can be used in a for loop:
>>> for word in WordIterator("a nice sunny day"):... print ’*’+word+’*’,...
*a* *nice* *sunny* *day*
They can be composed with other iterators for effect:
>>> for n, word in enumerate(WordIterator("a ...")):... print str(n)+’:’+word,...0:a 1:nice 2:sunny 3:day
See also: http://docs.python.org/library/itertools.html
LSCI2012 Python II Oct. 10, 2012
class WordIterator(object):
def __init__(self, text):self._words = text.split()
def next(self):if len(self._words) > 0:
return self._words.pop(0)else:
raise StopIteration
def __iter__(self):return self
Q: What is this?
LSCI2012 Python II Oct. 10, 2012
Exceptions
Exceptions are objects that inherit from the built-inException class.
To create a new exception just make a new class:
class NewKindOfError(Exception):"""Do use the docstring to documentwhat this error is about."""pass
Exceptions are handled by class name, so they usuallydo not need any new methods (although you are freeto define some if needed).
See also: http://docs.python.org/library/exceptions.html
LSCI2012 Python II Oct. 10, 2012
try:# code that might raise an exception
except SomeException:# handle some exception
except AnotherException, ex:# the actual Exception instance# is available as variable ‘ex’
else:# performed on normal exit from ‘try’
finally:# performed on exit in any case
The optional else clause is executed if and whencontrol flows off the end of the try clause.
The optional finally clause is executed on exit fromthe try or except block in any case.
Reference: http://docs.python.org/reference/compound stmts.html#try
LSCI2012 Python II Oct. 10, 2012
Raising exceptions
Use the raise statement with an Exception instance:
if an_error_occurred:raise AnError("Spider sense is tingling.")
Within an except clause, you can use raise with noarguments to re-raise the current exception:
try:something()
except ItDidntWork:do_cleanup()# re-raise exception to callerraise
LSCI2012 Python II Oct. 10, 2012
Exception handling example
Read lines from a CSV file, ignoring those that do nothave the required number of fields. If other errorsoccur, abort. Close the file when done.
job_state = { } # empty dicttry:csv_file = open(’jobs.csv’, ’r’)for line in csv_file:
line = line.strip() # remove trailing newlinetry:
name, jobid, state = line.split(",")except ValueError:continue # ignore line
job_state[jobid] = stateexcept IOError:raise # up to caller
finally:csv_file.close()
LSCI2012 Python II Oct. 10, 2012
A common case
The “cleanup” pattern is so common that Python has aspecial statement to deal with it:
with open(’jobs.csv’, ’r’) as csv_file:for line in csv_file:
line = line.strip() # remove trailing newlinetry:
name, jobid, state = line.split(",")except ValueError:continue # ignore line
job_state[jobid] = state
The with statement ensures that the file is closedupon exit from the with block (for whatever reason).
Reference: http://docs.python.org/reference/compound stmts.html#with
LSCI2012 Python II Oct. 10, 2012
The “context manager” protocol
Any object can be used in a with statement, providedit defines the following two methods:
__enter__()Called upon entrance of the with block; it return valueis assigned to the variable following as (if any).
__exit__(exc_cls, exc_val, exc_tb)Called with three arguments upon exit from the block.If an exception occurred, the three arguments are theexception type, value and traceback; otherwise, thethree argument are all set to None
Q: Can you think of other examples where this could beuseful?
See also: http://www.python.org/dev/peps/pep-0343/LSCI2012 Python II Oct. 10, 2012