39
A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Embed Size (px)

Citation preview

Page 1: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

A Field Linguist’s Guide to Making Long Lasting Texts and Databases

LSA Organized Session

January 4, 2007

Anaheim, California

Page 2: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Organized by:

Jeff Good and Heidi Johnson

Open Language Archives Community (OLAC)

Outreach Committee

Moderator:

Laura Welcher

Speakers:

Debbie Anderson,

Michael Appleby, Jessica Boynton,

Naomi Fox, Connie Dickinson

Page 3: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Presentations from this session will be posted at

http://www.language-archives.org/news.html#olac07

Page 4: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Best Practice in Your Back Pocket: Getting the Most Out

of the Tools You Have

Laura Welcher

The Rosetta Project / Long Now Foundation

Page 5: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

A great way to freak out a linguist

“To be in compliance with best practice recommendations (ahem), your interlinear glossed text needs to be in XML format with morphosyntactic tags that reference the GOLD ontology.”

Page 6: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Reality Check• There’s a difference between ideal best practice resources

(which is still somewhat of a moving target) and a good, sufficient approximation.

• Some common practices are far from ideal or sufficient (like saving the dictionary you worked 5 years on as a Microsoft Word document file).

• We can easily modify these practices to produce archivable resources that will last.

• And this can be done using tools that you already have, and knowledge that is easy to acquire.

• Hence the title: Best practice in your back pocket: getting the most out of the tools that you have.

Page 7: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Best Practice • E-MELD project (Electronic Metastructure for

Endangered Languages Data)

• Goals:– Help preserve endangered languages data– Develop infrastructure for electronic archives

• Defining best practice – E-MELD summer workshops http://www.emeld.org

• Promoting best practice:– “School of Best Practice” at

http://www.emeld.org/school/index.html

Page 8: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Good, Better, Best Practice

• The information presented here comes from presentations of the E-MELD team, particularly the following:

• Simons and Dry (2006) Good, Better, and Best Practice The Experience of the E-MELD Project http://www.linguistlist.org/emeld/documents/Bielefeld-Dry-Simons.pdf

Page 9: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

The first consideration:working, presentation and

archival formats

• The process of creating digital language resources usually involves creating files in different formats:– Working format

– Presentation format

– Archival format

Page 10: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Working Format• The saved format of whatever program you are

working in:– .doc (MS Word)– .xls (Excel)– .fp7 (FileMaker Pro)

• This format is what you use for your own convenience and productivity– Typically this format is proprietary– Less typically, people may work in programs whose native

format is not proprietary, automatically saving in .txt (plain text), .xml or .html (types of formatted plain text)

• A proprietary working file format is not the only format you should have!

Page 11: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Archival Format

• A very important format -- this format helps ensure that your resource will last and be usable well into the future

• An archival format has LOTS of good qualities (Simons, 2004)– Lossless– Open Standard– Transparent– Supported by multiple vendors

Page 12: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Archival Format: Lossless• Avoid compressed formats that lose content• A good rule-of-thumb is to use uncompressed formats:

– Text: .txt, .html, .xml– Images: .tiff, .bmp– Audio: .wav (Windows), .aiff (Apple), .au (Sun, Java, Unix)

but make sure it is PCM (uncompressed)– Video: .avi (some codecs), .rtv

• Most compressed formats lose content, but some are lossless (.zip for text, black and white .gif for images, .ale Apple Lossless Encoding for audio, jpeg2000 video codec) -- use with caution!

Page 13: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Archival Format: Open• Avoid proprietary formats like .doc, .xls, .fp7• The company that produces the software may

stop supporting the format, rendering your file unreadable

• For your archival format, choose a file format that is “open standard” like .xml, .html, .pdf or .rtf

• “Open standard” means that the specification of the format is publically available, and anyone can implement it.

Page 14: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Archival Format: Transparent

• Use a file format that is easy to interpret• Example: text files (.txt)

– Have common characters like letters, numbers, punctuation

– Virtually no formatting (tabs, returns)– Because of the simplicity of this file type, many programs

can read it and make use of the data

• Other transparent formats: .wav, .aiff can be read by any audio program

• Not transparent: .zip, .mp3 (require a special algorithm for interpretation)

Page 15: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Archival Format: Supported

• Prefer formats that are widely supported

• If more vendors support it, it is less likely to become obsolete

• This is another reason to prefer an open standard format to a proprietary one

Page 16: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Presentation Format

• Presentation formats are those you choose for the convenience and ease of accessibility and display

• It is fine that presentation formats be compressed, so long as you make a lossless archival copy as well

• Examples of presentation formats include .pdf files, .mp3 files, .jpg images, MPEG-2 video

Page 17: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

So far, so good?

• As a responsible linguist creating digital language documentation that will last well into the future you…– Know the difference between a working,

presentation, and archival file format– Know what makes a good archival format (LOTS)– Maintain an archival format of your data

• Anything beyond this? Yes, a bit more…

Page 18: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Best Practice Digital Resources are…

• Preservable in formats that are not vulnerable to decay or obsolescence (see LOTS)

• Intelligible so that content that is easily understood by future scholars

• Accessible so that resources are easily discovered and accessed

• They are also interoperable, but this is mostly a concern of archives and services

(Simons and Dry, 2006)

Page 19: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Create Preservable Resources

• Linguists are responsible for making preservable resources

• That is, creating archival formats that follow the principles of LOTS

Page 20: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Create Intelligible Resources

• In order to create resources that are intelligible to others, you must document your practices!

• Documentation includes:– Your markup practices– The encoding you use– Metadata about your resources

• This information should be kept a file or files in an archival format, and archived along with your resources.

Page 21: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Presentational Markup

• Many people use presentational markup, particularly in the working formats like Microsoft Word.

• Presentational markup means that aspects of the presentation (like bold, italics, indenting) are themselves meaningful

• For example…

Page 22: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Example of Presentational Markup

AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas.

<bold>AS_5.2.1978_audio</bold><plain.text>AliceSpear</plain.text><italics>“Crane Boy”</italics><plain.text>May 2, 1978</plain.text><plain.text>Mayetta, Kansas</plain.text>

Page 23: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Presentational Markup

• Presentational markup is not recommended. BUT if you do use it, describe all meaningful aspects (e.g. “bold” means head word, “italics” is used for the part of speech)

Page 24: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Descriptive Markup• It is better practice to use descriptive markup,

like XML• XML is basically text with “tags” that provide

information about what is between the tags– <headword>mnomen</headword>– <gloss>rice</gloss>

• Tags can be also used to group information, much like you would group information in a database record, and have a whole set of information in a database

Page 25: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Example of Descriptive Markup

AS_5.2.1978_audio: Alice Spear, Potawatomi, “Crane Boy”, May 2, 1978, Mayetta, Kansas.

<ID>AS_5.2.1978_audio</ID><speaker>Alice Spear</speaker><description>“Crane Boy”</description><recording.date>May 2, 1978</recording.date><location>Mayetta, Kansas</location>

Page 26: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Descriptive Markup: XML<?xml version=“1.0" encoding=“UTF-8"?><?xml-stylesheet type=“text/xsl" href=“archive.xsl"?>

<my.archive><record><identifier>AS_5.2.1978_audio</identifier><subject.language code=“x-sil-POT"/><language code="en"/><format>Analog audio recording on Cassette tape</format><contributor refine="speaker">Alice Spear</contributor><contributor refine="researcher">Laura Buszard-Welcher</contributor><description>“Crane Boy” narrative told in Potawatomi and in English</description><date code=“1978-05-02"/><coverage>Mayetta, Kansas</coverage><relation>digital audio: AS_5.2.1978_audio.wav, interlinear text: AS_5.2.1978_audio.txt</relation><type.linguistic code=“primary_text"/><rights>Some restrictions; contact field linguist</rights></record></my.archive>

Page 27: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Descriptive Markup: XML• It is a good practice to use standard tags where they

are available.– OLAC has a set of tags that you would use for metadata to

describe your resources– GOLD has a set of tags used for morphosyntactic

description

• Otherwise, be sure to document the meaning of the tags that you use

• Although some people feel comfortable working in XML, many don’t like to use it as a working format.

• Fortunately many common programs now allow you to save your work as an XML file.

Page 28: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

The Advantage of XML

• Besides creating an archival data file, XML has other advantages

• By creating stylesheets, you can give the same XML file different presentation forms

• For example…

Page 29: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Delimited Text• Another kind of markup that you might find yourself using is

delimited text.

• Spreadsheet and database programs allow you to export your data as text, delimited by a particular character– Comma separated text (.csv)

– Tab separated text (.tab)

• To help with intelligibility, create an initial record where the name of each field / cell is given inside the record itself. That way, the names of your fields / cells will be exported and saved along with the rest of your data.

• Text data exported this way is good practice, particularly if you are careful about documenting your practices inside your fields / cells (for more on this see following slides).

Page 30: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Other aspects of markup• Document any special conventions that you use• What do your morpheme boundary markers mean (+ /

- / = …any others?)• What glossing conventions do you use? Give the full

names of abbreviations (e.g. POS means ‘possessive’, PV means ‘preverb’).

• Describe grammatical terms that you use (like ‘aorist’, or ‘preverb’) and what it means for the language you are describing. You don’t have to write a grammar -- a sentence or two describing the term is sufficient)

• Also note if you are using standard terminology sets, like Leipzig Glossing Rules, or GOLD terminology

Page 31: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Document the Encoding

• Identify the character set you are using• Document any non-standard characters• Best practice is to use Unicode

Page 32: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Create Metadata• You will need to create some additional

information about your resources

• Metadata usually includes information about:– The setting (time, date, participants, location)– The language (ISO 636-3)– Linguistic type (text, grammar, lexicon) and subject– Access restrictions

• There are metadata standards for language resources: OLAC and IMDI

Page 33: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

OLAC Metadata Elements

Contributor (content) Language (audience)

Coverage (e.g. location) Publisher

Creator (content) Relation (to another resource

Date Rights (controlled vocab.)

Description Source (say, for re-elicited data)

Format Subject (controlled vocab.)

Encoding Format (character set) Subject Language (ISO 636-3 code)

Markup Format (XML schema) Title

Identifier (file name, URL Linguistic Type

http://www.language-archives.org/OLAC/olacms.html

Page 34: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Create Metadata• Keep a metadata record for each of your resources.• The records should themselves be in an archival format. This

could be:– A text file (good)– Delimited text, exported from a simple database file (good)– An XML file (better)– An OLAC or IMDI formatted XML file (best)

• Your archivist may have a preference about metadata formats, and prefer something relatively simple (like a paper form) if the archive will be manually entering the metadata.

• Archive this file along with the rest of your resources.

Page 35: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Make your resources accessible• Archive, archive, archive! (Not just on your own, or your

departmental server. Archives are committed to the long-term preservation and availability of your resources.)

• Before you leave to do fieldwork, or when you are writing your grant, establish contact with the archive where you intend to deposit your resources

• Archivists will – give you guidelines for creating archival files– help you select the best metadata set– give you information about setting access levels

• When you return, the first thing to do is send your files, along with the metadata and markup descriptions to the archive

• Most archives will then give you an ID number for your resources that you can then cite in your publications

Page 36: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

A Community Responsibility• Best practice involves what individual field linguists

do, but also how we collectively use and care for these resources

• This broader community involves– Other researchers like yourself who create resources

– A growing set of interconnected digital language archives that care for, protect, and disseminate your resources

– People who develop tools and services to make your resources locateable, searchable, and reusable

– Others: linguistics organizations, organizations like OLAC and DELAMAN, funding agencies who promote the work of this community

Page 37: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Unicode• Debbie Anderson “A field linguists’ guide to

Unicode”

• Michael Appleby “How to use Unicode on your computer”

Page 38: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Field Case Studies: Texts and Databases

• Jessica Boynton – “Transcription, Time-Alignment and Annotation”

• Naomi Fox– “Using Filemaker Pro to produce archivable

language documentation”

• Connie Dickonson – “The Tsafiki Text Factory”

Page 39: A Field Linguist’s Guide to Making Long Lasting Texts and Databases LSA Organized Session January 4, 2007 Anaheim, California

Panel Session

• Talks are 25 minutes, consecutive.

• Please remember or write down your questions!

• We will field them in a panel session after the talks.