Upload
pargery
View
278
Download
1
Embed Size (px)
DESCRIPTION
PyGotham 09:45 AM - 10:45 AM on August 17, 2014. If you're new to Python, you might find that you're using Python as if it were C. This talk will demonstrate how to take advantage of Python's special data structures to build tools for analyzing and creating nicely-formatted reports from CSV data.
Citation preview
Building flexible tools to store sums and report on CSV data
Presented by
Margery Harrison
Audience level: Novice09:45 AM - 10:45 AM
August 17, 2014Room 704
Python Flexibility
● Basic, Fortran, C, Pascal, Javascript,...● At some point, there's a tendency to think
the same way, and just translate it● You can write Python as if it were C● Or you can take advantage of Python's
special data structures.● The second option is a lot more fun.
Using Python data structures to report on CSV data
● Lists● Sets● Tuples● Dictionaries● CSV Reader
● DictReader● Counter
Also,
● Using tuples as dictionary keys● Using enumerate() to count how many
times you've looped– See “Loop like a Native”
http://nedbatchelder.com/text/iter.html
Code Development Method
● Start with simplest possible version● Test and validate● Iterative improvements
– Make it prettier
– Make it do more
– Make it more general
This is a CSV file
color,size,shape,number
red,big,square,3
blue,big,triangle,5
green,small,square,2
blue,small,triangle,1
red,big,square,7
blue,small,triangle,3
https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg
http://i239.photobucket.com/albums/ff263/peacelovebones/two-pandas-rolling-1.jpg
CSV DictReader
>>> import csv
>>> import os
>>> with open("simpleCSV.txt") as f:
... r=csv.DictReader(f)
... for row in r:
... print row
...
Running DictReader
DictReader is sequential
Tabulate All Possible Values
How many of each?
● It's nice to have a listing that shows the variety of objects that can appear in each column.
● Next, we'd like to count how many of each● And guess what? Python has a special data
structure for that.
collections.Counter
Playing with Counters
Index into Counters
Counter + DictReader
Let's use counters to tell us how many of each value was in each column.
Print number of each value
Output
colorblue : 3green : 1red : 2
shapesquare : 3triangle: 3
number1 : 13 : 22 : 15 : 17 : 1
sizesmall : 3big : 3
You might ask, why not this?
for row in r: for head in r.fieldnames: field_value = row[head] possible_values[head].add(field_value) #count_of_values[field_value]+=1 count_of_values.update(field_value) print count_of_values
Because, Counter likes to count
Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's': 6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2': 1, '7': 1, '5': 1})
color
blue : 0
green : 0
red : 0
shapesquare : 0triangle: 0
number1 : 13 : 22 : 15 : 17 : 1
sizesmall : 0big : 0
Output
colorblue : 3green : 1red : 2
shapesquare : 3triangle: 3
number1 : 13 : 22 : 15 : 17 : 1
sizesmall : 3big : 3
How many red squares?
● We can use tuples as an index into the counter
– (red,square)
– (big,red,square)
– (small,blue,triangle)
– (small,square)
Let's use a simpler CSV
color,size,shape
red,big,square
blue,big,triangle
green,small,square
blue,small,triangle
red,big,square
blue,small,triangle
Counting Tuplestrying to use magic update()
>>> c=collections.Counter([('a,b'),('c,d,e')])>>> cCounter({'a,b': 1, 'c,d,e': 1})>>> c.update(('a','b'))>>> cCounter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})>>> c.update((('a','b'),))>>> cCounter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})
Oh well>>> c.update([(('a','b'),)])>>> cCounter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1, ('a', 'b'): 1})>>> c[('a','b')]1>>> c[('a','b')]+=5>>> cCounter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1})
Combo Count Part 1: Initialize
Combo Count 2: Counting
Combo Count 3: Printing
Combo Count Outputcolorblue : 33 blue in 1 combinations:('blue', 'big'): 1('blue', 'small'): 23 blue in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2green : 11 green in 1 combinations:('green', 'small'): 11 green in 2 combinations:('green', 'small', 'square'): 1red : 22 red in 1 combinations:('red', 'big'): 22 red in 2 combinations:('red', 'big', 'square'): 2
shapesquare : 33 square in 1 combinations:3 square in 2 combinations:('red', 'big', 'square'): 2('green', 'small', 'square'): 1triangle: 33 triangle in 1 combinations:3 triangle in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2sizesmall : 33 small in 1 combinations:('blue', 'small'): 2('green', 'small'): 13 small in 2 combinations:('green', 'small', 'square'): 1('blue', 'small', 'triangle'): 2
big : 33 big in 1 combinations:('blue', 'big'): 1('red', 'big'): 23 big in 2 combinations:('red', 'big', 'square'): 2('blue', 'big', 'triangle'): 1
Well, that's ugly
● We need to make it prettier● We need to write out to a file● We need to break things up into Classes
Printing Combination Levels
Number of Squares
Number of Red Squares
Number of Blue Squares
Number of Triangles
Number of Red Triangles
Number of Blue Triangles
Total Red
Total Blue
Indentation per level
● If we're indexing by tuple, then the indentation level could correspond to the number of items in the tuple.
● Let's have general methods to format the indentation level, given the number of items in the tuple, or input 'level' integer
A class write_indent() methodIf part of class with counter and msgs dict,
just pass in the tuple:
def write_indent(self, tup_index):''' :param tup_index: tuple index into counter''' indent = ' ' * len(tup_index) msg = self.msgs[tup_index] sum = self.counts[tup_index] indented_msg = ('{0:s}{1:s}'.format( indent, msg, sum) return indented_msg
class-less indent_message()
def indent_message(level, msg, sum,\
space_per_indent=2, space=' '):
num_spaces = self.space_per_indent * level
indent = space * num_spaces
# We'll want to tune the formatting..
indented_msg = ('{0:s}{1:s}:{2:d}'.format(
indent, msg, sum)
return indented_msg
Adjustable field widths
Depending on data, we'll want different field widths
red squares 5
Blue squares 21
Large Red Squares in the Bronx 987654321
Using format to format a format string
>>> f='{{0:{0:d}s}}'.format(3)
>>> f
'{0:3s}'
>>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5)
>>> f
'{0:3s}{1:5d}'
>>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5)
>>> f
'{0:s}{1:3s}{2:5d}'
Format 3 values
● Our formatting string will print 3 values:– String of space chars: {0:s}
– Message: {1:[msg_width]s}
– Sum: Right justified {2:-[sum_width]d}
Class For Flexible Indentation
Flexible Indent Class Variables
Flexible Indent Method
Testing IndentMessages class
SimpleCSVReporter
● Open a CSV File● Create
– Set of possible values
– Set of possible tuples
– Counter indexed by each value & tuple
● Use IndentMessages to format output lines
SimpleCSVReporter class vars
readCSV() beginsinitialize sets..
readCSV() continued: Loop to collect & sum
Write to Report File
Using recursion for limitless indentation
Recursive print sub-levels
Word transform stubs
General method to test
Test with simpler CSV
Output for simpler CSV
A bigger CSV file
"CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSITEADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCLUSTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT","START_DATE","END_DATE"
4104147,"4/16/2013 12:00:00 AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00 PM","7/27/2004 8:30:00 PM"
5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500 - 6599 BLOCK OF PINEY BRANCH ROAD NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM",
● From http://data.octo.dc.gov/
Deleted all but 4 columns
"SHIFT","OFFENSE","METHOD","DISTRICT"
"MIDNIGHT","HOMICIDE","KNIFE","FIRST"
"MIDNIGHT","SEX ABUSE","KNIFE","FOURTH"
...
"DAY","THEFT/OTHER","OTHERS","SECOND"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"
"EVENING","BURGLARY","OTHERS","FIFTH"
...
Method to run crime report
Output - top
Output - bottom
Improvements
● Allow user-specified order for values, e.g. FIRST, SECOND, THIRD
● Other means of tabulating● Keeping track of blank values● Summing counts in columns● ...
https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg
LinksThis talk: http://www.slideshare.net/pargery/mnh-csv-python
● https://github.com/pargery/csv_utils2
● Also some notes in http://margerytech.blogspot.com/
Info on Data Structures
● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/
● http://nedbatchelder.com/text/iter.html
DC crime stats
● http://data.octo.dc.gov/“The data made available here has been modified for use from its original source, which is the Government of the District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one's own risk."