26
Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Embed Size (px)

Citation preview

Page 1: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Using a Simple Python Script to Download Data

Rob Letzler

Goldman School of Public Policy

July 2005

Page 2: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Overview

• Explain the problem

• Talk about the solution strategy

• Then walk through the code line by line; and explain the tools and ideas in the solution

Page 3: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

What’s not here that we might want to discuss in the future

• High speed numerical Python: a slow language; with fast libraries

• Writing your own objects• good program structure• Functional programming: map, filter, lambda, and

reduce commands. Good short overview at:http://scott.andstuff.org/FunctionalPython• (Stata generate / replace commands are roughly

map; and Stata drop if ~X is roughly filter)

Page 4: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

The Challenge• Download > 1000 daily and monthly electricity

market database files from the California Independent System Operator Website.

Page 5: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Overview

• Explain the problem

• Talk about the solution strategy

• Then walk through the code line by line; and explain the tools and ideas in the solution

Page 6: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Solution Strategy• Research the http:// location (URL) of each

database

• Write Python Code that executes once for each month t from the sample period

• Generate strings for the locations of the webpage and local disk file for month t

• Open the web page

• Create a local disk file

• Read the web page and save it in the local disk file

Page 7: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Disclaimer

• This is my first Python program.• I fear that I’ve reinvented a lot of wheels. This

program uses lots of basic Python functions rather than tapping into libraries and extensions in ways that would create a shorter program.

• This program structure – which has a main loop that is not in a function or object -- is fine for a simple program; but is dangerous for large, complex programs

Page 8: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Overview

• Explain the problem

• Talk about the solution strategy

• Then walk through the code line by line; and explain the tools and ideas in the solution

Page 9: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Python Syntax We’ll Need

• Loops

• Conditional Statements

• Functions

• File / web reading and writing

• Exception Handling

Page 10: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

For Loops in Python• Python loops over the elements of a list; not by

updating an integer.• Python requires a colon (:) between a conditional /

loop / function declaration and the block of additional statements it affects

For item in list: Do stuff

• Other programming languages would approach this as: For integer i = start to stop {Do stuff}

• Python’s range(start,stop+1) is identical to other languages’ start to stop

Page 11: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Solution Strategy• Research the database’s http:// location (URL)

• Write Python Code that executes once for each month t from the sample period

• Generate strings for the webpage and local disk file for month t

• open the web page

• create a local disk file

• Read the web page and save it in the local file

Page 12: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

The Main Loop Part Imonth_length = [31,28,31,30,31,30,31,31,30,31,30,31] #list of number of days in

each month

for year in range(2001,2005): #years 2001 to 2004 -- notice ranges include the#first num, but are strictly less than the last num for month in range(1,13): if ((year in range(2002,2004)) or

(year == 2001 and month > 3) or (year == 2004 and month < 10)):

#only begins executing the main block if we are in#the sample period

Red highlights:– Logical operators are words and and or; not & and |– To test whether a and b are the same use a == b with two equal signs; to put b in a use a=b

with one equal sign.

Page 13: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Functions• Functions are groups of statements other parts of the code

can call

def FunctionName (parameters):

statements

return optional return value• Functions may return a value. If the function returns a

value, you can call it in an assignment statement, like

result=FunctionName(inputs)• Functions and objects are crucial tools to design large

programs that are modular, flexible, and reliable. See McConnell, Code Complete for more detail.

Page 14: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

• Python passes scalar parameters by value. It passes more complex things as references to their memory locations. Different functions work on different copies of the values / references which can protect values from being accidentally changed.

• If you create a new object in the function, the original will be unaffected. list_var = list_var+[“C”, “D”]

• If you modify the original object without changing its memory address, the original will be changed: list_var.extend(["C", "D"]) or list_var[1]=“C”

• Any variable that is defined outside of a function or object is global and can get changed by any part of the code. Avoid using global variables because it can be difficult to find and fix errors involving changes in them.

Page 15: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Passing by Value and Referencenotice that test_list has changed to['A', 'B', 'C', 'D']but that test_integer is still 5 but the copy we returned is 5000

def python_copies_numbers_but_shares_lists_and_objects(list_input, integer_input):

integer_input = integer_input*1000 list_input.extend(["C","D"])

return integer_input

def main (): test_list = ["A","B"] test_integer = 5 updated_integer = python_copies_numbers_but_shares_lists_and_objects(test_list, test_integer) print "notice that test_list has changed to " print test_list print "but that test_integer is still " + fpformat.fix(test_integer,0) + " but the copy we returned has changed to " + fpformat.fix(updated_integer,0) return

main()

Page 16: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Solution Strategy• Research the http:// location (URL) of each

database

• Write Python Code that executes once for each month t from the sample period

• Generate strings for the webpage and local disk file for month t

• open the web page

• create a local disk file

• Read the web page and saves it in the local file

Page 17: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Main loop then Calls a Functions month_string = make_two_dgt_string(month)

import fpformat # fpformat formats floating point numbers into strings

def make_two_dgt_string(n):#takes a number and adds a leading zero if the number is less than

10#assumes that the input number is < 100 if n > 9: #check whether we need to pad the date with a

leading zero n_string = fpformat.fix(n,0) #if we don't need to

pad, convert the number directly to a string else: #pad low numbers with a leading zero n_string = "0"+fpformat.fix(n,0) #otherwise convert

to string and add a leading zero to the string. return n_string #either way, return the results.

Page 18: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Main Loop then creates strings and calls more functions

• #now, for each month in the sample, request a price data file

• #generate caiso URL• load_url = "http://oasis.caiso.com/…

&dstartdate="+fpformat.fix(year,0)+month_string…

• #generate file name for my hard disk• load_file_name =

"caiso_price_"+fpformat.fix(year,0)+"-"+month_string+"-"+"1-"+fpformat.fix(end_date,0)+".zip"

• #download and save the requested files.

• get_save_file(load_url,load_file_name)• #continue looping until we go through every month in the

sample...

Page 19: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Solution Strategy• We have:• Researched the http:// location (URL) of each database• Written Python Code that executes once for each month t

from the sample period• Generated strings for the webpage and local disk file for

month t • We’ve called but not seen the code that:• opens the web page • creates a local disk file• Reads the web page and saves it in the local file

Page 20: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Connect to the webpagedef get_save_file(url, file_name):#this function gets the file specified in URL from the

web and then saves it in#location FILE_NAME

#Designates the location in which to save the file path = "C:\\rjl\\ca_amp\\download\\price\\"+file_nametry: web_data = urllib.urlopen(url) #attempt to

create a shortcut / handle to the desired web page / web file

except IOError, msg: print "didn't open URL %s: %s", url, str(msg)

Page 21: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Creating and Using Objects

• Many python libraries are object oriented• An object bundles a kind of data with

“member functions” for manipulating that data.

• Steps: 1) create (“instantiate”) objects 2) use their functions.

objectName = libName.constructor(initial values)

objectName.doSomething(parameters)

Page 22: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Exceptions• try/except sequences handle routine problems like file not

found errors ("exceptions") gracefully rather than ending the whole program.

• try:– SomethingThatMightNotWork #this will either work or it fail

and generate an exception message of failureType• except failureType1

– {If we get failure type 1, do this and continue from here}

• Dividing by zero or inverting a singular matrix might throw exceptions.

• limited goto statement – if there is an exception, the program stops executing and jumps immediately to the next except statement that handles that error

Page 23: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

create a local file and save the downloaded page

try: f = open(path, "wb") #create a handle to a new file for "wb": _w_riting in _b_inary

f.write(web_data.read()) #write into the new file the results from downloading the webpage

f.close() #complete writing process. print "saved %s", path except IOError, msg: print "didn't save %s: %s", path, str(msg) return #end the routine

Page 24: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

File Manipulation in Python

Details on files: Python Tutorial Section 7.2

Start: Construct a file object using the open command

file_object_name = open(filename, mode)

Read/write string/data= file_object_name. read()

file_object_name. write(data to write)

Finish using the file

file_object_name. close()

Page 25: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Possible extensions

• Unzip the files that we downloaded (easy?)import os os.system(‘unzip ’+file_name) (See http://docs.python.org/lib/module-zipfile.html)• Test that downloaded data have expected

characteristics (e.g. four fields per line) using regular expressions

• Read in and manipulate the XML databases (harder?)• Enter these file names into a SAS or Stata import /

analysis code and run SAS / Stata

Page 26: Using a Simple Python Script to Download Data Rob Letzler Goldman School of Public Policy July 2005

Python can do far more with webpages

• Details on web: http://docs.python.org/lib/module-urllib.html

• Its sample programs include:– Webchecker.py (checks for broken links on a website)

– Websucker.py (downloads a whole website)

• I found their code a bit hard to follow. • I used snippets of those programs as examples for

this program