41
A Robust Open-source GEDCOM Parser Dallan Quass [email protected] Ryan Knight [email protected]

A Robust Open-source GEDCOM Parser

Embed Size (px)

DESCRIPTION

A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012 Parses GEDCOM files into a "de facto" object model; includes round-tripping for the vast majority of GEDCOM files.

Citation preview

Page 1: A Robust Open-source GEDCOM Parser

A Robust Open-source GEDCOM Parser

Dallan Quass [email protected] Knight [email protected]

Page 2: A Robust Open-source GEDCOM Parser

What's a GEDCOM?

0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan

If this looks unfamiliar to you,you may not get a lot out of this talk

On the other hand,the purpose of this project is to

handle this for you,

so you can develop cool projects in genealogyand let this be unfamiliar to you!

Page 3: A Robust Open-source GEDCOM Parser

Why is parsing GEDCOMs so hard?

Page 4: A Robust Open-source GEDCOM Parser

Challenge #1 – Character set detection

0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan

Should be easy, except...

Page 5: A Robust Open-source GEDCOM Parser

Challenge #1 – Character set detection

GeneWeb ASCII → ANSI

Geni.com ANSEL → UTF8

Geni.com UNICODE → UTF8

GENJ UNICODE → UTF8

All others UNICODE → UTF16

ASCII/MacOS Roman → x-MacRoman

Page 6: A Robust Open-source GEDCOM Parser

Challenge #1 – Character set detection

ANSEL

Page 7: A Robust Open-source GEDCOM Parser

Challenge #2 – Custom tags

The GEDCOM specification hasn't been updated in a LONG time

Page 8: A Robust Open-source GEDCOM Parser

Challenge #3 – Misused tags

Page 9: A Robust Open-source GEDCOM Parser

Shout out

Tim Forsythe

VGed - GEDCOM validator

http://ancestorsnow.blogspot.com/ 2011/07/vged.html

Page 10: A Robust Open-source GEDCOM Parser

ALIA

1 SEX M1 ALIA /Ted/1 BIRT

Page 11: A Robust Open-source GEDCOM Parser

SOUR

0 @N6@ NOTE1 CONT adopted surname Termaat2 SOUR @S9@

Page 12: A Robust Open-source GEDCOM Parser

DATA

2 SOUR @S2149874917@3 DATA4 DATE 11 Sep 19243 NOTE ...3 DATA4 TEXT ...

2 SOUR @S99@3 DATA4 TEXT William Donald ...4 DATE 1 Sep 1997

2 SOUR @S28@3 PAGE Indian Prarie...3 QUAY 33 DATE 28 Feb 2005

Page 13: A Robust Open-source GEDCOM Parser

Challenge #4 – Unused tags

EventPhone

Event Agency

Source Citation Event Type

Page 14: A Robust Open-source GEDCOM Parser

Challenge #5 – Names

Page 15: A Robust Open-source GEDCOM Parser

GEDCOM Standard?

The code is more what you'd call

"guidelines" than actual rules.

Page 16: A Robust Open-source GEDCOM Parser

Two goals

Page 17: A Robust Open-source GEDCOM Parser

Goal #1 – Parse GEDCOMs into a de facto object model

De Facto:

In fact or in practice; in actual use or existence, regardless of official or legal status. – Wictionary.org

Model should be straightforward, easy to use and understand

Page 18: A Robust Open-source GEDCOM Parser

Goal #2 – Round-trip

From GEDCOM

To Object Model

Back to GEDCOMwithout information loss

Page 19: A Robust Open-source GEDCOM Parser

Nirvana

Page 20: A Robust Open-source GEDCOM Parser

There is no Nirvana

Page 21: A Robust Open-source GEDCOM Parser

But we can get pretty close

94%

Page 22: A Robust Open-source GEDCOM Parser

How is it done?

???

Page 23: A Robust Open-source GEDCOM Parser

Object model

Page 24: A Robust Open-source GEDCOM Parser

People

Page 25: A Robust Open-source GEDCOM Parser

Extensions

Page 26: A Robust Open-source GEDCOM Parser

GedML

Originally by Michael Kayhttp://users.breathe.com/mhkay/gedml/

Enhanced by Lynn Monsonhttp://lmonson.com/blog/?page_id=64

Further enhanced by Nathan Powell & Dallan Quasspart of this project

GEDCOM → SAX eventsANSEL reader & writer

Page 27: A Robust Open-source GEDCOM Parser

Parser

Written in Java

~1500 LoC for parser + ~4000 LoC for POJOs

Handles SAX events emitted by GedML

Separate functions called to handle each tag

Maintains a stack of model objects

Attach unexpected tags to model objects as extensions

Fast

Easily extendible

Tree parser also available

Page 28: A Robust Open-source GEDCOM Parser

GEDCOM Export

Visitor pattern

600 LoC

Page 29: A Robust Open-source GEDCOM Parser

JSON

GEDCOM POJO JSON POJO GEDCOM

Simple model persistence using Google GSON

Page 30: A Robust Open-source GEDCOM Parser

Further thoughts

Page 31: A Robust Open-source GEDCOM Parser

Do we need a radically-different data-exchange model for genealogy?

Page 32: A Robust Open-source GEDCOM Parser

I don't know

A new proposed object model could use this project tomigrate existing GEDCOMs to the de facto model,

then translate the de facto model objectsto the new model

Page 33: A Robust Open-source GEDCOM Parser

Do we need GEDCOM validation tools?

Page 34: A Robust Open-source GEDCOM Parser

Definitely!

A list of “standard” custom tagswould also be pretty helpful

Page 35: A Robust Open-source GEDCOM Parser

We live in the real world

Page 36: A Robust Open-source GEDCOM Parser

Purpose of this project

Page 37: A Robust Open-source GEDCOM Parser

Demonstration of Gedcom Server

Demonstrates GEDCOM -> model -> json -> model -> GEDCOM

Built with Play 1.2.4 - A Java Web framework

Allows for rapid development of web applications with a fully integrated stack

Deployed to Heroku – Cloud Application Platform

Heroku allows one step deployment with git

Page 38: A Robust Open-source GEDCOM Parser

Demonstration of Gedcom Server

Page 39: A Robust Open-source GEDCOM Parser

Demonstration of Gedcom Server

Page 40: A Robust Open-source GEDCOM Parser

Conclusion

Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license

Parsing GEDCOMs is hard

• it's like parsing HTML in the 1990's

But getting it right is pretty important

especially if you want to retain existing information

Open source algorithm is now freely available

http://github.com/DallanQ/Gedcom

simple object model with extensions, 94% round-trip

Hopefully others will benefit from this effort

Page 41: A Robust Open-source GEDCOM Parser