24
Globally Unique Identifiers and Life Science Identifiers Dave Thau [email protected] University of Kansas California Academy of Sciences www.learningsite.com

Globally Unique Identifiers and Life Science Identifiers

  • Upload
    nellis

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Globally Unique Identifiers and Life Science Identifiers. Dave Thau [email protected] University of Kansas California Academy of Sciences www.learningsite.com. Outline. Describe Global Unique Identifiers Show how they’re relevant Describe one GUID system (LSIDs) - PowerPoint PPT Presentation

Citation preview

Page 1: Globally Unique Identifiers and Life Science Identifiers

Globally Unique Identifiersand

Life Science Identifiers

Dave [email protected]

University of KansasCalifornia Academy of Sciences

www.learningsite.com

Page 2: Globally Unique Identifiers and Life Science Identifiers

Outline

1. Describe Global Unique Identifiers

2. Show how they’re relevant

3. Describe one GUID system (LSIDs)

4. Outline some issues around using GUIDs for TDWG-related activities

5. Provide some resources

6. Open discussion

Page 3: Globally Unique Identifiers and Life Science Identifiers

GUID Is Not An Ugly Word

It ’s guid to be merry and wise, It ’s guid to be honest and true,       Robert BurnsHere’s a Health to Them that ’s Awa’.

Pteroptochos tarnii AKA Guidguid

Image From: animaldiversity.ummz.umich.edu

Page 4: Globally Unique Identifiers and Life Science Identifiers

GUID: Globally Unique Identifier

• A short name for a complex entity

• Useful for locating information about the entity

• Each name identifies only one entity

• There is some sense of permanence

Page 5: Globally Unique Identifiers and Life Science Identifiers

Some things which fit this description

• GenBank accession numbers: AP006480.1

• US Patent numbers: 5443036 (laser guided cat exercise)

• Digital Object Identifier: 10.121/3212

Page 6: Globally Unique Identifiers and Life Science Identifiers

In Our Domain

SDD Document – Representing some data set.

<ClassName id="1"> <Label> <Representation language="en">  <Text>Cypselurus heterurus (Rafinesque, 1810)</Text>   </Representation>  </Label> <Link>  <LSID>lsid.gbif.net:www.fishbase.org:1029</LSID>   </Link>  <Rank>sp</Rank> </ClassName>

SDD Document – Representing some data set.

<ClassName id="1"> <Label> <Representation language="en">  <Text>Cypselurus heterurus (Rafinesque, 1810)</Text>   </Representation>  </Label> <Link>  <LSID>lsid.gbif.net:www.fishbase.org:1029</LSID>   </Link>  <Rank>sp</Rank> </ClassName>

Napier Schema Document – Representing some taxon.

<TaxonConcept id=“urn:lsid:bioguid.org:seek:121212“ type="original"> <Name type="scientific">  <NameSimple>Canis lupus</NameSimple> </Name>… <Relationships> <Relationship type=“is child of">  <ToTaxonConcept ref=“urn:lsid:bioguid.org:seek:5743" /> </Relationship> </Relationships></TaxonConcept>

Napier Schema Document – Representing some taxon.

<TaxonConcept id=“urn:lsid:bioguid.org:seek:121212“ type="original"> <Name type="scientific">  <NameSimple>Canis lupus</NameSimple> </Name>… <Relationships> <Relationship type=“is child of">  <ToTaxonConcept ref=“urn:lsid:bioguid.org:seek:5743" /> </Relationship> </Relationships></TaxonConcept>

Page 7: Globally Unique Identifiers and Life Science Identifiers

Features of a GUID system

• Global uniqueness scoped to Internet

• Should be easily resolvable by a computer or human

• Should identify things down to whatever level of granularity necessary

• Should not be limited to proprietary systems

• Should serve up all sorts of data– Database records– Text files– Images

• It would be nice if the identifier had associated metadata

Page 8: Globally Unique Identifiers and Life Science Identifiers

Life Science Identifiers

• Official standard of the Object Management Group (OMG)

• Support for metadata and authentication• Supports multiple protocols (e.g. HTTP, SOAP)• Can serve up data in any format• Decentralized – anyone can issue an LSID• LSID code available in Java and Perl.• A young standard, but increasingly used.

Page 9: Globally Unique Identifiers and Life Science Identifiers

Organizations Using LSIDs

• National Center for Biotech Information (NCBI)– Pubmed– Genbank

• European Bioinformatics Institute (EBI)• US Long Term Ecological Research Network (LTER)• BioMOBY – an biological database interoperability

program (biomoby.org)• Open Bioinformatics Foundation (open-bio.org)• myGrid– a BioGRID project (mygrid.org.uk)

Page 10: Globally Unique Identifiers and Life Science Identifiers

A Small Pause For More Squid Humor

Page 11: Globally Unique Identifiers and Life Science Identifiers

LSID Format

• urn – indicates that this is a URN• lsid – indicates that it’s an LSID-type urn• bioguid.org – the authority who issued the LSID

– Doesn’t have to be a domain name – but for now probably should be.– bioguid.org does not necessarily have the data or metadata.– There may not even be a machine called bioguid.org.

• seek – a name space id internal to that authority– The name space is meaningless to systems outside that authority.

• 117866 – the local identifier within that authority– Also internal to the authority

• v1 – an optional version number– If no version, no trailing colon either.

urn:lsid:bioguid.org:seek:117866:v1

Page 12: Globally Unique Identifiers and Life Science Identifiers

Data and Metadata

• An LSID has data– Examples

• The gene sequence in GenBank• The actual LTER data set, maybe in excel, or in a text file

– The data should never change• An LSID also has metadata

– Example metadata• The format of the data• A display title for clients displaying the LSID• Dublin core metadata• Anything you want

– The metadata can change

Page 13: Globally Unique Identifiers and Life Science Identifiers

Example LSIDs

• An LTER fish abundance data set– urn:lsid:limnology.wisc.edu:dataset:ntlfi02

• A PubMed reference:– urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:pubmed:12441808

• A GenBank sequence:– urn:lsid:ncbi.nlm.nih.gov.lsid.biopathways.org:genbank_gi:30350027

Page 14: Globally Unique Identifiers and Life Science Identifiers

How LSIDs work

LSIDClient

Maybe LaunchpadMaybe HaystackMaybe BioFerretMaybe myGRIDMaybe Yours!

DNSFind DNS recordResolve it to get

Address of Authority

LSID Authority

1. Find the authority for this LSID

Returns the LSID Authority Server

2. Query authority for available services

Returns WSDL for this LSID

3. Chose a service, get the goods

HTTP, SOAP, FTP, others

Data Store

Metadata Store

Page 15: Globally Unique Identifiers and Life Science Identifiers

LSID Promises

• I promise to never change the data behind an LSID.

• I will make sure my LSIDs are being served, or give them to someone who can do it.

• I will give my LSIDs metadata – at least give them a title and a format

Page 16: Globally Unique Identifiers and Life Science Identifiers

Other GUID systems

• URLs– Files move – The data change– Unstructured metadata

• UUIDs – 128 bit string, guaranteed unique– 58f202ac-22cf-11d1-b12d-002035b29092 – No resolution– No metadata

• Handle System / DOIs (10.12/2312)– Non standard protocol– Centralized resolution– Unstructured metadata (for Handle System)– High costs (for DOI)

Page 17: Globally Unique Identifiers and Life Science Identifiers

Issues For This Community

• What gets a GUID?

• For each of those things, what’s the data, what’s the metadata?

• One GUID per item?

• Centralization – who issues GUIDs?

Page 18: Globally Unique Identifiers and Life Science Identifiers

What Gets a GUID?

• These things probably should get GUIDs– Taxonomic concepts– Specimens– Publications– People

• These things might get GUIDs– Taxonomic names– Journals– Data providers– Observations

Page 19: Globally Unique Identifiers and Life Science Identifiers

Specimen Data? Metadata?

• If specimens get a GUID – what does it identify?– The physical specimen?– A collection’s database record of the specimen?– What about multiple labels?– Main question – what doesn’t change about a

specimen?– Other main question – how should the data be

represented? • Darwin core includes current institution location. Not a good

idea for the data of a GUID since that may change.

Page 20: Globally Unique Identifiers and Life Science Identifiers

One GUID Per Item?

• No GUID system inherently enforces a 1:1 mapping between GUID and data.

• Everyone should TRY to limit the number of GUIDs per item.

• Should there be any centralization to help achieve this?

Page 21: Globally Unique Identifiers and Life Science Identifiers

Degrees of Centralization

• An index– List your GUID authority in an index so your GUIDs are easy to find.

• A central authority– One authority could be responsible for issuing GUIDs to the community for

specific types of information – you’d have to get one from here.• GBIF?• The IC_Ns? (ICZN, ICBN….)• lsidauthority.org?

– This would help enforce a 1:1 mapping of GUIDs and data items– It would also alleviate data providers from the need to maintain their own

authorities– It MAY also reduce the likelihood of GUIDs becoming unresolvable– It may also be infeasible technically, or socially.

• A respected authority– With LSIDs, an authority can be set up to serve its own GUIDs and proxy other

authorities.– This would help enforce a 1:1 mapping for those who use the authority– It may also be more feasible.

Page 22: Globally Unique Identifiers and Life Science Identifiers

LSID Resources

• LSID Articles and code from IBM– http://www-124.ibm.com/developerworks/oss/lsid/#whatislsid

• Current LSID specification– http://www.omg.org/cgi-bin/doc?dtc/04-05-01

• Launchpad – An LSID resolver for Windows IE– available from first link

• A website which resolves LSIDs– http://lsid.biopathways.org/resolver/

• URN specification– http://www.ietf.org/rfc/rfc2141.txt

Page 23: Globally Unique Identifiers and Life Science Identifiers

Acknowledgements

• My work on GUIDs has been funded by the SEEK project – seek.ecoinformatics.org.

• SEEK is funded by National Science Foundation award 0225676.

• Thanks to Ben Szekely at IBM for his LSID articles, his LSID java code, and for answering all my questions.

Page 24: Globally Unique Identifiers and Life Science Identifiers

Questions for Discussion

• Do we need GUIDs?

• What gets a GUID?

• One GUID per item?

• Centralization?