Upload
dallan-quass
View
2.805
Download
3
Tags:
Embed Size (px)
DESCRIPTION
A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012 Parses GEDCOM files into a "de facto" object model; includes round-tripping for the vast majority of GEDCOM files.
Citation preview
A Robust Open-source GEDCOM Parser
Dallan Quass [email protected] Knight [email protected]
What's a GEDCOM?
0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan
If this looks unfamiliar to you,you may not get a lot out of this talk
On the other hand,the purpose of this project is to
handle this for you,
so you can develop cool projects in genealogyand let this be unfamiliar to you!
Why is parsing GEDCOMs so hard?
Challenge #1 – Character set detection
0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan
Should be easy, except...
Challenge #1 – Character set detection
GeneWeb ASCII → ANSI
Geni.com ANSEL → UTF8
Geni.com UNICODE → UTF8
GENJ UNICODE → UTF8
All others UNICODE → UTF16
ASCII/MacOS Roman → x-MacRoman
Challenge #1 – Character set detection
ANSEL
Challenge #2 – Custom tags
The GEDCOM specification hasn't been updated in a LONG time
Challenge #3 – Misused tags
Shout out
Tim Forsythe
VGed - GEDCOM validator
http://ancestorsnow.blogspot.com/ 2011/07/vged.html
ALIA
1 SEX M1 ALIA /Ted/1 BIRT
SOUR
0 @N6@ NOTE1 CONT adopted surname Termaat2 SOUR @S9@
DATA
2 SOUR @S2149874917@3 DATA4 DATE 11 Sep 19243 NOTE ...3 DATA4 TEXT ...
2 SOUR @S99@3 DATA4 TEXT William Donald ...4 DATE 1 Sep 1997
2 SOUR @S28@3 PAGE Indian Prarie...3 QUAY 33 DATE 28 Feb 2005
Challenge #4 – Unused tags
EventPhone
Event Agency
Source Citation Event Type
Challenge #5 – Names
GEDCOM Standard?
The code is more what you'd call
"guidelines" than actual rules.
Two goals
Goal #1 – Parse GEDCOMs into a de facto object model
De Facto:
In fact or in practice; in actual use or existence, regardless of official or legal status. – Wictionary.org
Model should be straightforward, easy to use and understand
Goal #2 – Round-trip
From GEDCOM
To Object Model
Back to GEDCOMwithout information loss
Nirvana
There is no Nirvana
But we can get pretty close
94%
How is it done?
???
Object model
People
Extensions
GedML
Originally by Michael Kayhttp://users.breathe.com/mhkay/gedml/
Enhanced by Lynn Monsonhttp://lmonson.com/blog/?page_id=64
Further enhanced by Nathan Powell & Dallan Quasspart of this project
GEDCOM → SAX eventsANSEL reader & writer
Parser
Written in Java
~1500 LoC for parser + ~4000 LoC for POJOs
Handles SAX events emitted by GedML
Separate functions called to handle each tag
Maintains a stack of model objects
Attach unexpected tags to model objects as extensions
Fast
Easily extendible
Tree parser also available
GEDCOM Export
Visitor pattern
600 LoC
JSON
GEDCOM POJO JSON POJO GEDCOM
Simple model persistence using Google GSON
Further thoughts
Do we need a radically-different data-exchange model for genealogy?
I don't know
A new proposed object model could use this project tomigrate existing GEDCOMs to the de facto model,
then translate the de facto model objectsto the new model
Do we need GEDCOM validation tools?
Definitely!
A list of “standard” custom tagswould also be pretty helpful
We live in the real world
Purpose of this project
Demonstration of Gedcom Server
Demonstrates GEDCOM -> model -> json -> model -> GEDCOM
Built with Play 1.2.4 - A Java Web framework
Allows for rapid development of web applications with a fully integrated stack
Deployed to Heroku – Cloud Application Platform
Heroku allows one step deployment with git
Demonstration of Gedcom Server
Demonstration of Gedcom Server
Conclusion
Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license
Parsing GEDCOMs is hard
• it's like parsing HTML in the 1990's
But getting it right is pretty important
especially if you want to retain existing information
Open source algorithm is now freely available
http://github.com/DallanQ/Gedcom
simple object model with extensions, 94% round-trip
Hopefully others will benefit from this effort