24
EXTRACTING & ANALYZING DATA FROM MUNICIPAL FINANCIAL DISCLOSURES Marc Joffe O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci

Keeping Governments Accountable with Open Data Science: Extracting and Analyzing Municipal Financial Data

  • Upload
    odsc

  • View
    67

  • Download
    0

Embed Size (px)

Citation preview

EXTRACTING & ANALYZING DATA FROM MUNICIPAL FINANCIAL

DISCLOSURESMarc Joffe

O P E ND A T AS C I E N C EC O N F E R E N C E_

BOSTON 2015

@opendatasci

Extracting and Analyzing Data from Municipal Financial Disclosures

Marc JoffePublic Sector Credit Solutions

Open Data Science ConferenceBoston, May 2015

The Research Question

• How is the cost of funding public employee pensions affecting California cities?• I hoped to answer the question by gathering pension expenditure

data for all cities in the state.• Main data points:• Current and future contribution amounts• Funded ratio

Data on City Pensions

• The best sources for information on local government pension costs are (1) the municipality’s audited financial statements (CAFRs) and (2) actuarial valuation reports published by the pension fund.• In California (and some other states), most cities rely on a multi-

employer pension system. The system in California, CalPERS, publishes one actuarial report for each local government pension plan it administers – about 3000 in all.• I was just interested in the roughly 1400 plans covering city employees.

CalPERS publishes a unique PDF for each plan.• The main challenge is thus to get the 1400 PDFs and extract key data

points (such as future actuarially required contributions) from them.

Gathering the Pension Data (1 of 2)

• Found a web page that had links to all the actuarial valuation PDFs.• In this case: http://www.calpers.ca.gov/index.jsp?bc=/

about/forms-pubs/calpers-reports/actuarial-reports/home.xml

• Downloaded this page and scraped all the links• This can be done with a python script (ideally leveraging an HTML processing

library like BeautifulSoup) or by copying/pasting to Excel. When copying content from a web page to Excel, it is better to use Internet Explorer than other browsers.

• Ran a command line script to download all the links. This shell script or windows command file can use curl or wget to retrieve the PDFs.

Gathering the Pension Data (2 of 2)

• Because the valuation PDFs have embedded text, no OCR was necessary. I pulled out the text with Poppler’s pdftotext command line executable, using the –layout option to make the outputs more readable.• Because the PDFs had very consistent formats (they appear to have

been output by a report generator), I could take advantage of patterns in the text. I wrote Python scripts to read each file and extract just the portions I needed. I output the strings I captured to a CSV file.• I loaded the CSV file into Excel for further analysis.

Answering the “So What Question” with Revenue Data

• The raw pension numbers are not that interesting unless placed into some context. I wanted to calculate the ratio of pension costs to total revenue for each city because that is a fiscal health measure. A ranking of cities by this measure is interesting – especially to cities near the top of the ranking!• The actuarial valuation reports provide actuarially required

contributions for the upcoming fiscal year. I could get revenue data from CAFRs but these are published on a delayed basis.• A more timely source proved to be a data set provided by the State

Controller via a Socrata Open Data platform. See http://bythenumbers.sco.ca.gov.

Mashing up the Data and Analyzing

• I now had two data sets: pension costs and revenues.• The remaining steps needed to calculate the pension cost/revenue ratios are

as follows:• Add up all the plans for each city to get total city pension costs.• Map the city names in the CalPERS data set to the city names in the State Controller

data set. This was generally straightforward, but there were a couple of oddities (such as Paso Robles = El Paso de Robles)

• Using the common key (i.e., standardized city name), combine the two data sets• Calculate the ratio• Sort in descending order

• I did the above in Excel and Google Sheets. I could have used Python or another scripting language but I find spreadsheets easier.

The Results….

Our next project: govwiki.us

URL: http://govwiki.usRepo: https://github.com/govwiki/govwiki.us

Online database of all US local governments.

• Obtained a list of 91,000 local governments from the US census

• Performed rough geocoding• Now gathering additional data from public

sources in California• Hope to launch in August• Also hope to create a Wikipedia interface

• Environment: MySQL, Node.js, Coffeescript

Original PDF Liberation Presentation – 1/2014• In January 2014, I worked with the Sunlight Foundation to host the

“PDF Liberation Hackathon” in New York, Washington, Chicago and San Francisco.• A list of PDF extraction solutions and sample PDF extraction problems

available at: http://pdfliberation.wordpress.com/• Following are some slides related to that event

An Example of How PDF Liberation Can Generate News• Working with Mortgage Resolution Partners, the City of Richmond has

proposed to use its power of eminent domain to refinance mortgages for underwater homeowners• In July, the media reported that 624 properties had been chosen• I wanted to know which ones, so I filed a California Public Records Act

request . . .

The Request…(Make it Very Specific)Dear Ms. Holmes, Pursuant to my rights under the California Public Records Act (Government Code Section 6250 et seq.), I ask to obtain a copy of the following, which I understand to be held by your agency: Attachments A, B and C to letters sent to mortgage servicers offering to purchase mortgage loans dated on or about July 31, 2013. The form letter is available on the internet at http://www.contracostatimes.com/west-county-times/ci_23760190/document-city-richmond-letter-mortgage-lenders?source=pkg. I understand that 32 such letters have been sent, so this request involves as many as 96 unique documents. The purpose of this request is to obtain a list of 624 mortgages which Richmond is offering to purchase containing the property addresses, mortgage amounts, appraised values, servicer names, and, if possible, the name of the Residential Mortgage Backed Securities (RMBS) deal holding each mortgage. If you can provide this listing in a more concise format, I will accept it in lieu of the attachments described in the previous paragraph. I ask for a determination on this request within 10 days of your receipt of it, and an even prompter reply if you can make that determination without having to review the record[s] in question. If you determine that some but not all of the information is exempt from disclosure and that you intend to withhold it, I ask that you redact it for the time being and make the rest available as requested. In any event, please provide a signed notification citing the legal authorities on which you rely if you determine that any or all of the information is exempt and will not be disclosed. If I can provide any clarification that will help expedite your attention to my request, please contact me by phone at 415-578-0558 or by email at [email protected]. I ask that the requested documents be sent to be in electronic format via return email. If you must provide paper documents, I ask that you notify me of any duplication costs exceeding $50 before you duplicate the records so that I may decide which records I want copied. I can visit your office to collect the documents once they have been duplicated. Thank you for your time and attention to this matter. Sincerely, Marc D. Joffe1655 North California Blvd. Unit 162Walnut Creek, CA 94596

The Response…

• Four PDFs

Processing

• Loaded the four PDFs into Able2Extract – a commercial PDF conversion tool that costs about $100*

• Converted the PDFs to Microsoft Excel• I had now had multiple lists of properties with different fields• I sorted the lists into the same order and then joined them together into one master

spreadsheet• I found that three properties had mortgage balances over $800,000 and was able to

connect the balances to the addresses• This made it possible to map the properties and to see the houses themselves on

Google Street View

* Tabula, an open source tool, is reaching the point at which it could perform the same function.

The Results …• Lead story in the business section of the Chronicle• Wall Street Journal blog post• Finding raised at City Council meeting• In December, Mayor Gayle McLaughlin altered the program to

exclude mortgages above the conforming loan limit ($729,500) and to focus on blighted neighborhoods.

By the way:The owner of the house on the right was apparently unaware that her home had been included in the program. So my initial theory that this had been a case of cronyism was not borne out.

Some of Our Challenges

• Government Financial Statements• IRS Form 990s (Non-Profit Disclosures)• House of Representative Financial Disclosures• Compiling a History of Torture

Government Financial Statements: Finding the Next Detroit

IRS Form 990s: Finding members of the 1% who work at not-for-profits

. . . And finding the 1% in Congress by dissecting House Financial Disclosures

This project was taken on by our second place prize winner. Their best results came from using Captricty.com.

Documenting a History of Torture: Parsing Amnesty International Annual Reports

This project was taken on by our first place prize winner.

Three Inter-Related Problems …

• Extracting data from PDFs that contain embedded text

• Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs

• Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load)

… and some Open Source Solutions

• Extracting data from PDFs that contain embedded textPDFBox, Poppler

• Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs

Tesseract• Transforming unstructured text and numbers into a form that can be

readily analyzed. A related IT term is ETL (Extract-Transform-Load)Tabula (for table identification), OpenRefine

… or Licensed Solutions

• Extracting data from PDFs that contain embedded textPDFLib Text Extraction Tool

• Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs

ABBYY (FineReader or Cloud SDK)• Transforming unstructured text and numbers into a form that can be

readily analyzed. A related IT term is ETL (Extract-Transform-Load)SIMX Text Converter