31
LTER CONTROLLED VOCABULARY WORKSHOP MAY 26-27, 2011

LTER Controlled Vocabulary Workshop May 26-27, 2011

Embed Size (px)

DESCRIPTION

LTER Controlled Vocabulary Workshop May 26-27, 2011. “Scientists seeking data should be able to efficiently and reliably locate LTER datasets through searching, browsing …“ Get feedback on general direction of working group activities Resolve some specific issues Decide on “Next Steps” - PowerPoint PPT Presentation

Citation preview

LTER CONTROLLED VOCABULARY WORKSHOP MAY 26-27, 2011

OBJECTIVES

“Scientists seeking data should be able to efficiently and reliably locate LTER datasets through searching, browsing …“ Get feedback on general direction

of working group activities Resolve some specific issues Decide on “Next Steps” Products

Comments to be acted on White paper concerning specific issues and

“next steps”

Time Activity

9:00 AM Introductions, Review of Agenda

9:15 AM Introduction to the LTER Controlled Vocabulary – Past and Future10:00 AM Break10:15 AM Discussion: Locating LTER Data – around-the-room experiences

What are your experiences with finding LTER data? What would be most helpful in finding data in the future? Review of “use cases”

11:15 AM Tour of draft LTER Controlled Vocabulary12-noon Lunch1:30 PM Feedback to entire group on things in the controlled vocabulary that

need improvement Things to be removed Things to be added Things to be reorganized

2:30 PM Break2:45 PM Discussion of specific issues

Core areas Are related-terms needed, or is a hierarchy sufficient? Management of the vocabulary – role of researchers

3:00 PM Next StepsHow do we engage larger LTER community?How much, and what sort of engagement is needed?

4:00 PM Adjourn

THE CHALLENGE

Eclectic use of terms to used for discovering LTER data makes it difficult to perform reliable or efficient searches

Often several terms for one concept One site uses CO2 another Carbon Dioxide, another Carbon-dioxide Carbon to Nitrogen Ratio, C:N, C:N Ratio, Carbon-to-nitrogen Ratio

No way to relate broader terms with narrower terms Searching on “Landscape Change” doesn’t find data sets

related to “desertification” even though desertification is a kind of landscape change

2006 ANALYSIS OF LTER KEYWORDSSource Numbe

r of Terms

Number used at 5 or more

sites

Most Frequently used

EML Keywords* 2,711 86 LTER (1002), Temperature (701)

EML Titles 2,480 921 And (768), Data (394), LTER (350)

DTOC Keywords*

2,774 103 ARC (1645), Temperature (732)

Bibliography Titles

13,538 1,855 Of (12,611), Forest (2,050)* Allows multi-word terms

Only 3.2%!

PAST

We started off by surveying what terms were already being used in a variety of LTER documents

Our goal was to see if there were any existing lexical resources that we could simply adopt

TEST OF LIST VS NBII THESAURUS - 2008

58% of LTER terms were not

found in the NBII Thesaurus

Results suggested that we needed to develop our own resource

GOALS FOR DEVELOPMENT OF KEYWORD LIST

Identify a list of preferred terms that would be used by sites in creating metadata documents

Focus on LTER-wide searches Want to facilitate cross-site synthesis People searching LTER Metacat rather than individual

sites are interested in relevant data from multiple sites Want to hit the “sweet spot” for the number of

terms Too many terms make keywording documents difficult,

and results in searches with too few datasets Too few terms make it hard to locate usably small

numbers of datasets

STEPS TAKEN

Assembled list of words already in LTER Metadata (EML documents)

Selected using criteria: Keywords shared with GCMD and NBII, or Keywords used at more than one LTER site

Reviewed by Information Managers Removals and additions were suggested

Edited based on voting Created a Draft set of Taxonomys

Included some additions and deletions

STRUCTURING THE CONTROLLED VOCABULARY

Goal: Improve Searching & Browsing Reliability (of all the suitable target

documents, what percentage did you find) Efficiency (of the documents your search

returned, what percentage were suitable) A list alone is not sufficient to support

browsing and sophisticated searching of data – more structure is needed

STRUCTURESList Synonym

RingTaxonomy Thesaurus Ontology

=

=

==

Complexity

Multiple taxonomys are a Polytaxonomy

TAXONOMYS – RULES OF THE ROAD Relationships should be independent of context

Must pass “Some-not-all test” Each taxonomy should include only one type of

entity (listed in Z39.19 section 6.3.2) Things and their physical parts (birds, trees, leaves) Materials (wood, nitrogen, sand) Activities or processes (acidification, production) Events or occurrences (germination, death) Properties or states of persons, things, materials or

actions (age, speed, nitrogen content) Disciplines or subject fields (ecology, ornithology) Units of measurement (m, km, miles) Unique entities (LTER,HJ Andrews Forest)

You can get into trouble if you start “mixing and matching” things within a single taxonomy!

EXAMPLESGood BadForests Boreal Forest Hardwood ForestGrassland Tallgrass Praire Tundra

Forests Fire Ecology

OK – these are all the same type of entity – all are THINGS

Mixing THINGS and PROCESSES and DISCIPLINES

Rodents Mice Rats

Desert Plants Cacti Grasses

OK – Is not dependent on context. Mice and rats are ALWAYS rodents

Problem: Context dependent, not all cacti or grasses are desert plants. Some occur in other systems. Fails “Some-not-all” test.

ACTIVITIES

The VOCAB Working Group has created a draft set of 10 taxonomys containing 713 terms Includes additional “broader” terms needed for

grouping Includes synonyms (non-preferred terms)

Some terms originally in the list have been removed because the were perceived to be too ambiguous or context-sensitive to be useful for the purposes of searching or browsing E.g., “Aboveground”

Some “related” terms have also been identified

APPROVALS

In 2010 a request for information was forwarded to the LTER Executive Board:

“The Information Management Committee has studied how keywords are used at LTER sites, how LTER keywords relate to external lexographical resources, and compiled a draft keyword. We request guidance from the LTER Executive Board on how a controlled vocabulary might be implemented within the context of LTER to improve the reliability of data searches. “

The EB generally endorsed the idea of a LTER Controlled Vocabulary, and agreed to help have scientists participate in vetting the list and deciding on next steps (THIS WORKSHOP)

HOW LIST AND POLYTAXONOMY WILL BE USED

Permit use of a browse interface Make searches more sophisticated

See “Use case” for searching search includes synonyms plus narrower terms

and/or related terms Develop tools to help in adding keywords

to LTER metadata documents Prototype versions of a couple are already

available See Keywording “Use Case”

TASK 1: LOCATING DATA

What are your experiences with finding LTER data?

What would be most helpful in finding data in the future?

Review of “Use Cases”

TASK 2: REVIEWING THE LIST & TAXONOMY Evaluate the utility of the draft polytaxonomy

Is it better than the existing LTER Metacat interfaces?

Are there large changes that need to be made? Elimination of specific taxonomys? Creation of new taxonomys? Addition of related terms to make a thesaurus?

Are there small changes needed? Removal or replacement of terms

CHANGES: IMPLICATIONS FOR SITES

Improvement of existing documents Review existing keywords and change to preferred forms Note: even without doing this the synonym ring will help

improve searching and browsing Use preferred terms for new documents

Ideally at least one term from each of the relevant taxonomys

Note: addition of new terms to the list, should require review of all existing documents to see if they should be added – so term additions should be rare

Changes in taxonomys and term relationships do not require re-keywording of existing documents

Time Activity

9:00 AM Introductions, Review of Agenda

9:15 AM Introduction to the LTER Controlled Vocabulary – Past and Future10:00 AM Break10:15 AM Discussion: Locating LTER Data – around-the-room experiences

What are your experiences with finding LTER data? What would be most helpful in finding data in the future? Review of “use cases”

11:15 AM Tour of draft LTER Controlled Vocabulary12-noon Lunch1:30 PM Feedback to entire group on things in the controlled vocabulary that

need improvement Things to be removed Things to be added Things to be reorganized

2:30 PM Break2:45 PM Discussion of specific issues

Core areas Are related-terms needed, or is a hierarchy sufficient? Management of the vocabulary – role of researchers

3:00 PM Next StepsHow do we engage larger LTER community?How much, and what sort of engagement is needed?

4:00 PM Adjourn

GROUP FEEDBACK Todd & Margaret

Focus on INTERFACE Ways to present the data

Allow “query within result set” Intersect query sets

Group options – by site, by time side by side comparisons

Be able find where different types of data intersect Can be very difficult due to missing data etc. Problem extends beyond query interface

Interface needs to be a higher priority – sooner rather than later

Recommendation to IMC/NISAC/EB

GROUP FEEDBACK

Rodger and Kristin Highest level of hierarchy

Found some things to change or add “root production”, “belowground productivity”

Were generally happy with overall organization Need system for adding new keywords – this is just a start Intrigued by theory and where we go from here

How does it matter what is in one place or another? Want to make sure things are well-organized…. Data vs research question Does not matter where it is when adding to keyword list Need to have “best practices” for adding keywords

How will that effect sites? How many data sets have no preferred terms?

BEST PRACTICE

At least one word from list At least one from at least 5 of the 10

taxonomys Signature datasets should be flagged

with “signature dataset” tag Should include Core area(s)

CORE AREAS Core area - Problems with definitions Some datasets are either none, or all core areas

Weather data Change entities to core areas? People will want to look for this Would not have hierarchy?

That would be OK – can have related terms Could link to signature datasets

Need “signature dataset” keyword – used to weight Or prioritize signature datasets for adding preferred terms

Treat as unique: Primary Production (core area)

Data can be applied to MANY core areas - won’t map e.g. Climate

Try adding core area taxonomy and then add core areas and related terms?????

May not be needed or appropriate – we are asking the data catalog to do too much – need catalog of research topics

“SIGNATURE DATA” CONCENSUS

Want to search for signature datasets at top level of the hierarchy Needs to be one click away

GROUP REPORTS

Julia and Don Would be interesting to tally the

number of hits for each keyword for each site

Tally of number of datasets for each site

GIS should be preferred term Can mean Geographical Information

Science

GROUP REPORTS – JULIA & DON STRUCTURAL CHANGES

Atmospheric processes cross listed under hydrologic properties Evapotranspiration should be above transpiration and evaporation Snow not under precipitation Geographical Properties ->Spatial Properties

Move imagery under that with satellite and photos under that – depricate landsat

Methods – field, spatial, lab, analytical subcategories Also cores, dendrometers etc. tools could go under this

Entities For detailed ones, tried to find other homes Diseases to disease and move under bio processes

Levels of organization for communities, populations, species Are these useful terms? How often used

Biomes instead of Ecosystems

TASK 3: SPECIFIC ISSUES

Core areas Do we need a special taxonomy for core

areas? Are related-terms needed, or is a

polytaxonmy (hierarchy) sufficient? Management of the vocabulary – role of

researchers? Preferred terms – are all really preferred?

E.g., Permanent forest plots

TASK 4: NEXT STEPS

How do we engage larger LTER community? How much, and what sort of engagement

is needed? Requests we should make to the EB or

IMC? Managing the controlled vocabulary What technology development is

needed, and who should pursue it?

PROPOSED MANAGEMENT PLAN

Anyone can propose adding, editing, deleting or moving terms within the hierarchy, with justification.

Proposals would be evaluated by the Controlled Vocabulary Working Group according to the following criteria: The proposed terms should provide clear utility for searching

and browsing, and not introduce ambiguity The proposed terms should be suitable for inclusion (e.g., not

locations or specific taxonomic identifiers) Proposed terms should not be redundant with existing term(s)

already in the vocabulary Terms and their proposed places in taxonomys or thesauri

should conform in form with NISO Z39.19 2005 and successor documents (e.g., sections 6.5.1, 8.3)

RECOMMENDATIONS

Best Practices for adding keywords Preferred terms (and preferred preferred terms

) Presentation to PIs

Statistics on numbers of hits Add workshop participants to VOCAB

Put in supplement proposal for development of search interface Write it up now – Shovel Ready! Like MALS – need to have all sites sign up with

letters of endorsement