5
Overview Data source Data Scrape DataLoad.py DataClean.py Clean/Normalize DataWrite.py 1

capstone project

  • Upload
    yu-du

  • View
    133

  • Download
    0

Embed Size (px)

Citation preview

Overview

Datasource

DataScrape

DataLoad.py

DataClean.py

Clean/Normalize

DataWrite.py

1

Data Source

HTML format tag <div> with class “subfooter-region-info zsg-g_gutterless”

2

Data Retrieving

DataLoad.py is getting and looping through the ZIP code from a ZIP code---county FIPS code relationship txt file from US census, and using the ZIP code to query Zillow website, retrieving ZIP code and average home value from the website, store the value in the database created upfront. The County FIPS code is used to group the average home value and later for data visualization.

3

Data Clean/Re-fomat/Visualization

“Final Table” is created in the database to store two value for visualization, ID and rate. ID is the FIPS code for all the counties, and rate is a normalized value for average house price. The data then wrote into a TSV file for visualization. Used D3 for visualization, retrieved 1773 data points, still around 1300 missing as you can tell from the picture.

4

Data Clean/Re-fomat/Visualization

Also marked the top 50 ZIP codes on Google Map in terms of average house price. The 50 most expensive ZIP codes are concentrated in CA and NY, with two exception, Miami FL and Aspen CO. To my surprise the average home value in Aspen CO is around 1.5 million according to Zillow.com

5