61
Building flexible tools to store sums and report on CSV data Presented by Margery Harrison Audience level: Novice 09:45 AM - 10:45 AM August 17, 2014 Room 704

Mnh csv python

  • Upload
    pargery

  • View
    278

  • Download
    1

Embed Size (px)

DESCRIPTION

PyGotham 09:45 AM - 10:45 AM on August 17, 2014. If you're new to Python, you might find that you're using Python as if it were C. This talk will demonstrate how to take advantage of Python's special data structures to build tools for analyzing and creating nicely-formatted reports from CSV data.

Citation preview

Page 1: Mnh csv python

Building flexible tools to store sums and report on CSV data

Presented by

Margery Harrison

Audience level: Novice09:45 AM - 10:45 AM

August 17, 2014Room 704

Page 2: Mnh csv python

Python Flexibility

● Basic, Fortran, C, Pascal, Javascript,...● At some point, there's a tendency to think

the same way, and just translate it● You can write Python as if it were C● Or you can take advantage of Python's

special data structures.● The second option is a lot more fun.

Page 3: Mnh csv python

Using Python data structures to report on CSV data

● Lists● Sets● Tuples● Dictionaries● CSV Reader

● DictReader● Counter

Page 4: Mnh csv python

Also,

● Using tuples as dictionary keys● Using enumerate() to count how many

times you've looped– See “Loop like a Native”

http://nedbatchelder.com/text/iter.html

Page 5: Mnh csv python

Code Development Method

● Start with simplest possible version● Test and validate● Iterative improvements

– Make it prettier

– Make it do more

– Make it more general

Page 6: Mnh csv python

This is a CSV file

color,size,shape,number

red,big,square,3

blue,big,triangle,5

green,small,square,2

blue,small,triangle,1

red,big,square,7

blue,small,triangle,3

Page 7: Mnh csv python

https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg

Page 8: Mnh csv python

http://i239.photobucket.com/albums/ff263/peacelovebones/two-pandas-rolling-1.jpg

Page 9: Mnh csv python

CSV DictReader

>>> import csv

>>> import os

>>> with open("simpleCSV.txt") as f:

... r=csv.DictReader(f)

... for row in r:

... print row

...

Page 10: Mnh csv python

Running DictReader

Page 11: Mnh csv python

DictReader is sequential

Page 12: Mnh csv python

Tabulate All Possible Values

Page 13: Mnh csv python

How many of each?

● It's nice to have a listing that shows the variety of objects that can appear in each column.

● Next, we'd like to count how many of each● And guess what? Python has a special data

structure for that.

Page 14: Mnh csv python

collections.Counter

Page 15: Mnh csv python

Playing with Counters

Page 16: Mnh csv python

Index into Counters

Page 17: Mnh csv python

Counter + DictReader

Let's use counters to tell us how many of each value was in each column.

Page 18: Mnh csv python

Print number of each value

Page 19: Mnh csv python

Output

colorblue : 3green : 1red : 2

shapesquare : 3triangle: 3

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 3big : 3

Page 20: Mnh csv python

You might ask, why not this?

for row in r: for head in r.fieldnames: field_value = row[head] possible_values[head].add(field_value) #count_of_values[field_value]+=1 count_of_values.update(field_value) print count_of_values

Page 21: Mnh csv python

Because, Counter likes to count

Counter({'e': 13, 'l': 12, 'a': 9, 'r': 9, 'g': 7, 'b': 6, 'i': 6, 's': 6, 'u': 6, 'n': 4, 'm': 3, 'q': 3, 't': 3, 'd': 2, '3': 2, '1': 1, '2': 1, '7': 1, '5': 1})

color

blue : 0

green : 0

red : 0

shapesquare : 0triangle: 0

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 0big : 0

Page 22: Mnh csv python

Output

colorblue : 3green : 1red : 2

shapesquare : 3triangle: 3

number1 : 13 : 22 : 15 : 17 : 1

sizesmall : 3big : 3

Page 23: Mnh csv python

How many red squares?

● We can use tuples as an index into the counter

– (red,square)

– (big,red,square)

– (small,blue,triangle)

– (small,square)

Page 24: Mnh csv python

Let's use a simpler CSV

color,size,shape

red,big,square

blue,big,triangle

green,small,square

blue,small,triangle

red,big,square

blue,small,triangle

Page 25: Mnh csv python

Counting Tuplestrying to use magic update()

>>> c=collections.Counter([('a,b'),('c,d,e')])>>> cCounter({'a,b': 1, 'c,d,e': 1})>>> c.update(('a','b'))>>> cCounter({'a': 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})>>> c.update((('a','b'),))>>> cCounter({'a': 1, ('a', 'b'): 1, 'b': 1, 'a,b': 1, 'c,d,e': 1})

Page 26: Mnh csv python

Oh well>>> c.update([(('a','b'),)])>>> cCounter({'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1, ('a', 'b'): 1})>>> c[('a','b')]1>>> c[('a','b')]+=5>>> cCounter({('a', 'b'): 6, 'a': 2, 'b': 2, (('a', 'b'),): 1, 'c,d,e': 1, 'a,b': 1})

Page 27: Mnh csv python

Combo Count Part 1: Initialize

Page 28: Mnh csv python

Combo Count 2: Counting

Page 29: Mnh csv python

Combo Count 3: Printing

Page 30: Mnh csv python

Combo Count Outputcolorblue : 33 blue in 1 combinations:('blue', 'big'): 1('blue', 'small'): 23 blue in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2green : 11 green in 1 combinations:('green', 'small'): 11 green in 2 combinations:('green', 'small', 'square'): 1red : 22 red in 1 combinations:('red', 'big'): 22 red in 2 combinations:('red', 'big', 'square'): 2

shapesquare : 33 square in 1 combinations:3 square in 2 combinations:('red', 'big', 'square'): 2('green', 'small', 'square'): 1triangle: 33 triangle in 1 combinations:3 triangle in 2 combinations:('blue', 'big', 'triangle'): 1('blue', 'small', 'triangle'): 2sizesmall : 33 small in 1 combinations:('blue', 'small'): 2('green', 'small'): 13 small in 2 combinations:('green', 'small', 'square'): 1('blue', 'small', 'triangle'): 2

big : 33 big in 1 combinations:('blue', 'big'): 1('red', 'big'): 23 big in 2 combinations:('red', 'big', 'square'): 2('blue', 'big', 'triangle'): 1

Page 31: Mnh csv python

Well, that's ugly

● We need to make it prettier● We need to write out to a file● We need to break things up into Classes

Page 32: Mnh csv python

Printing Combination Levels

Number of Squares

Number of Red Squares

Number of Blue Squares

Number of Triangles

Number of Red Triangles

Number of Blue Triangles

Total Red

Total Blue

Page 33: Mnh csv python

Indentation per level

● If we're indexing by tuple, then the indentation level could correspond to the number of items in the tuple.

● Let's have general methods to format the indentation level, given the number of items in the tuple, or input 'level' integer

Page 34: Mnh csv python

A class write_indent() methodIf part of class with counter and msgs dict,

just pass in the tuple:

def write_indent(self, tup_index):''' :param tup_index: tuple index into counter''' indent = ' ' * len(tup_index) msg = self.msgs[tup_index] sum = self.counts[tup_index] indented_msg = ('{0:s}{1:s}'.format( indent, msg, sum) return indented_msg

Page 35: Mnh csv python

class-less indent_message()

def indent_message(level, msg, sum,\

space_per_indent=2, space=' '):

num_spaces = self.space_per_indent * level

indent = space * num_spaces

# We'll want to tune the formatting..

indented_msg = ('{0:s}{1:s}:{2:d}'.format(

indent, msg, sum)

return indented_msg

Page 36: Mnh csv python

Adjustable field widths

Depending on data, we'll want different field widths

red squares 5

Blue squares 21

Large Red Squares in the Bronx 987654321

Page 37: Mnh csv python

Using format to format a format string

>>> f='{{0:{0:d}s}}'.format(3)

>>> f

'{0:3s}'

>>> f='{{0:{0:d}s}}{{1:{1:d}d}}'.format(3,5)

>>> f

'{0:3s}{1:5d}'

>>> f='{{0:s}}{{1:{0:d}s}}{{2:{1:d}d}}'.format(3,5)

>>> f

'{0:s}{1:3s}{2:5d}'

Page 38: Mnh csv python

Format 3 values

● Our formatting string will print 3 values:– String of space chars: {0:s}

– Message: {1:[msg_width]s}

– Sum: Right justified {2:-[sum_width]d}

Page 39: Mnh csv python

Class For Flexible Indentation

Page 40: Mnh csv python

Flexible Indent Class Variables

Page 41: Mnh csv python

Flexible Indent Method

Page 42: Mnh csv python

Testing IndentMessages class

Page 43: Mnh csv python

SimpleCSVReporter

● Open a CSV File● Create

– Set of possible values

– Set of possible tuples

– Counter indexed by each value & tuple

● Use IndentMessages to format output lines

Page 44: Mnh csv python

SimpleCSVReporter class vars

Page 45: Mnh csv python

readCSV() beginsinitialize sets..

Page 46: Mnh csv python

readCSV() continued: Loop to collect & sum

Page 47: Mnh csv python

Write to Report File

Page 48: Mnh csv python

Using recursion for limitless indentation

Page 49: Mnh csv python

Recursive print sub-levels

Page 50: Mnh csv python

Word transform stubs

Page 51: Mnh csv python

General method to test

Page 52: Mnh csv python

Test with simpler CSV

Page 53: Mnh csv python

Output for simpler CSV

Page 54: Mnh csv python

A bigger CSV file

"CCN","REPORTDATETIME","SHIFT","OFFENSE","METHOD","BLOCKSITEADDRESS","WARD","ANC","DISTRICT","PSA","NEIGHBORHOODCLUSTER","BUSINESSIMPROVEMENTDISTRICT","VOTING_PRECINCT","START_DATE","END_DATE"

4104147,"4/16/2013 12:00:00 AM","MIDNIGHT","HOMICIDE","KNIFE","1500 - 1599 BLOCK OF 1ST STREET SW",6,"6D","FIRST",105,9,,"Precinct 127","7/27/2004 8:30:00 PM","7/27/2004 8:30:00 PM"

5047867,"6/5/2013 12:00:00 AM","MIDNIGHT","SEX ABUSE","KNIFE","6500 - 6599 BLOCK OF PINEY BRANCH ROAD NW",4,"4B","FOURTH",402,17,,"Precinct 59","4/15/2005 12:30:00 PM",

● From http://data.octo.dc.gov/

Page 55: Mnh csv python

Deleted all but 4 columns

"SHIFT","OFFENSE","METHOD","DISTRICT"

"MIDNIGHT","HOMICIDE","KNIFE","FIRST"

"MIDNIGHT","SEX ABUSE","KNIFE","FOURTH"

...

"DAY","THEFT/OTHER","OTHERS","SECOND"

"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"

"MIDNIGHT","SEX ABUSE","OTHERS","THIRD"

"EVENING","BURGLARY","OTHERS","FIFTH"

...

Page 56: Mnh csv python

Method to run crime report

Page 57: Mnh csv python

Output - top

Page 58: Mnh csv python

Output - bottom

Page 59: Mnh csv python

Improvements

● Allow user-specified order for values, e.g. FIRST, SECOND, THIRD

● Other means of tabulating● Keeping track of blank values● Summing counts in columns● ...

Page 60: Mnh csv python

https://c1.staticflickr.com/3/2201/2469586703_cfdaf88195.jpg

Page 61: Mnh csv python

LinksThis talk: http://www.slideshare.net/pargery/mnh-csv-python

● https://github.com/pargery/csv_utils2

● Also some notes in http://margerytech.blogspot.com/

Info on Data Structures

● http://rhodesmill.org/brandon/slides/2014-04-pycon/data-structures/

● http://nedbatchelder.com/text/iter.html

DC crime stats

● http://data.octo.dc.gov/“The data made available here has been modified for use from its original source, which is the Government of the District of Columbia. Neither the District of Columbia Government nor the Office of the Chief Technology Officer (OCTO) makes any claims as to the completeness, accuracy or content of any data contained in this application; makes any representation of any kind, including, but not limited to, warranty of the accuracy or fitness for a particular use; nor are any such warranties to be implied or inferred with respect to the information or data furnished herein. The data is subject to change as modifications and updates are complete. It is understood that the information contained in the web feed is being used at one's own risk."