23
HIST*4170 Data: Big and Small 29 January 2013

HIST*4170 Data : Big and Small

  • Upload
    conner

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

HIST*4170 Data : Big and Small. 29 January 2013. Today’s Agenda. Blog Updates A Short Introduction to Databases A Big Data Project: People In Motion Special Guest: Dr. Rebecca Lenihan. Blog Highlights. Ambition Consider scalability Consider source availability – local advantage? - PowerPoint PPT Presentation

Citation preview

Page 1: HIST*4170 Data :  Big and Small

HIST*4170Data: Big and Small29 January 2013

Page 2: HIST*4170 Data :  Big and Small

Today’s Agenda• Blog Updates• A Short Introduction to Databases• A Big Data Project: People In Motion• Special Guest: Dr. Rebecca Lenihan

Page 3: HIST*4170 Data :  Big and Small

Blog Highlights• Ambition

• Consider scalability• Consider source availability – local advantage?

• Keep your eye on the academic value• What do you want to teach? Learn?

• Themes: war, sport, family, mapping• Intellectual property/privacy• Resources:

• Google Sketchup• To make 3D buildings

Page 4: HIST*4170 Data :  Big and Small

Data Deluge• Bit, byte, kilobyte (kB) megabyte (MB), gigabyte, terbyte,

petabyte, exabyte, zettabytes....• Library of Congress = 200 terabytes

• “Transferring “Libraries of Congress” of Data”• IP traffic is around 667 exabytes• It’s a deluge...

• Ian Milligan “Preparing for the Infinite Archive: Social Historians and the Looming Digital Deluge.” (Mar 23, Tri-U history conference)

• “Big Data”• too large for current software to handle

• Don’t be intimidated• Not all DH sources (yet)

Page 5: HIST*4170 Data :  Big and Small

Introduction to Databases• Database – a system that allows for the efficient storage and

retrieval of information • We associate with...• Computers changed a lot• Problems: organization and efficient retrieval

• Organization = requires data structure• Efficient Retrieval = requires through algorithms

• Potential for Humanities?• ...new problems, questions visualization, and objects worthy

of study and reflection.

Page 6: HIST*4170 Data :  Big and Small

Database Design• The purpose of a database is to store information about a

particular domain and to allow one to ask questions about the state of that domain.

• Relational databases are more efficient because they store information separately• Attributes• Relationships

• Quamen reading is a nice introduction• Not as complicated as you might think, but following rules is

important• We will apply...

Page 7: HIST*4170 Data :  Big and Small

New approach: Crowdsourcing• An “online, distributed problem-solving and production

model.”• Daren C. Brabham (2008),

"Crowdsourcing as a Model for Problem Solving: An Introduction and Cases", Convergence: The International Journal of Research into New Media Technologies 14 (1): 75–90

• Cited in Wikipedia, where “Anyone with Internet access can write and make changes to Wikipedia articles...”

• reCAPTCHA• Luis von Ahn

• Others...• Google?

Page 8: HIST*4170 Data :  Big and Small
Page 9: HIST*4170 Data :  Big and Small
Page 10: HIST*4170 Data :  Big and Small

There are limitations...• Organization • Quality Control• Selection

Page 11: HIST*4170 Data :  Big and Small

A Database for Your Project?• Think about how you might use a database

• but perhaps not too big!• Databases can be very small and still be DH-worthy• Are there public docs out there that you can digest?

• Google Refine• Incorporate a search function into your website?• Resources

• MS Excel (spreadsheet)• MS Access (relational database)• Google Refine

• Cleaning data

Page 12: HIST*4170 Data :  Big and Small

Assignment for Next Week• Reading: TBD (3D guns?)

• Help someone else out with their project• Read their blog• Comment and provide detailed feedback• Find a collaborator?

Page 13: HIST*4170 Data :  Big and Small

People in Motion:Creating Longitudinal Data from

Canadian Historical Census

Page 14: HIST*4170 Data :  Big and Small

‘Unbiased’ links connecting individuals/households over several

census years

A comprehensive infrastructure of longitudinal data

What we are working towards

1851Census

1871Census

1881Census 1891

Census

1901Census

1906 Census

1916Census

1911Census

US 1880

Census

US 1900

Census

Page 15: HIST*4170 Data :  Big and Small

Current Work

100% of 1871

CensusAutomatic Linking

4,277,807 records

3,601,663 records

Partners and collaborators: FamilySearch (Church of Latter Day Saints), Minnesota Population Center, Université de Montréal, Université Laval/CIEQ University of Alberta

100% of 1871

Census

100% of 1871

Census

100% of 1881

Census

100% of 1871

Census

Page 16: HIST*4170 Data :  Big and Small

Existing (True) Links

• Ontario Industrial Proprietors – 8429 links• Logan Township – 1760 links• St. James Church, Toronto – 232 links• Quebec City Boys – 1403 links

• Bias concerns– family context– others? Logan Twp

Guelph

Page 17: HIST*4170 Data :  Big and Small

Attributes for Automatic Linking

• Last Name – string• First Name – string• Gender – binary• Birthplace – code• Age – number• Marital status – single, married, divorced,

widowed, unknown

Page 18: HIST*4170 Data :  Big and Small

Automatic Linkage

• The challenges:1) Identify the same person2) Deal with attribute characteristics3) Manage computational expense

• The system:

Page 19: HIST*4170 Data :  Big and Small

Data Cleaning and Standardization• Cleaning

– Names – remove non-alpha numerical characters; remove titles

– Age – transform non-numerical representations to corresponding numbers (e.g. 3 months);

– All attributes - deal with English/French notations (e.g. days/jours, married/mariee)

• Standardization– Birthplace codes and granularity– Marital status

Page 20: HIST*4170 Data :  Big and Small

Computational Expense

• Very expensive to compare all the possible pairs of records

• Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census)

• Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days. (Big Data)

Page 21: HIST*4170 Data :  Big and Small

Managing Computational Expense

• Blocking – By first letter of last name– By birthplace

• Using HPC– Running the system on multiple processors in

parallel

Page 22: HIST*4170 Data :  Big and Small

Record Comparison

• Comparing Strings– Jaro-Winkler– Edit Distance– Double Metaphone

• Age– +/- 2 years

• Exact matches – Gender– Birthplace

Page 23: HIST*4170 Data :  Big and Small

Linkage Results

Province Linkage Rate (%)

New Brunswick 24.45

Nova Scotia 21.50

Ontario 18.36

Quebec 17.45