22
FITS Demo Digital Preservation 2012 Andrea Goethals File Information Tool Set

FITS Demo

  • Upload
    luann

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

FITS Demo. Digital Preservation 2012 Andrea Goethals. File Information Tool Set. Why FITS? original motivation:. Offset risk of accepting any format Web archives, email attachments, opaque objects No single format identification tool can suffice (format support varies, accuracy varies - PowerPoint PPT Presentation

Citation preview

Page 1: FITS Demo

FITS DemoDigital Preservation 2012

Andrea Goethals

File Information Tool Set

Page 2: FITS Demo

Offset risk of accepting any format◦ Web archives, email attachments, opaque objects

No single format identification tool can suffice (format support varies, accuracy varies

Difficult to use multiple tools together (language differs)

Unsustainable to only use “library” tools - want to incorporate tools from any domain

Why FITS? original motivation:

Page 3: FITS Demo

Develop a tool manager instead of a tool Include open source tools from any domain Make highly configurable, tweak over time

as experience & knowledge is gained Account for tool inaccuracy in the design Check the tools against each other

◦ Do any disagree?◦ How many are in agreement?

FITS Strategy

Page 4: FITS Demo

Identify many file formats Validate a few file formats Extract technical metadata Calculate basic file info (file size, MD5, etc.) Output technical metadata

◦ Community-standard metadata schemas Identify problem files

◦ Conflicting opinions on format, metadata values◦ Unidentifiable file formats◦ Empty files◦ Technical metadata can’t be generated

What does it do?

Page 5: FITS Demo

The process

Any file

FITS wrapper + XSL

JHOVE

FITS wrapper + XSL

DROID

FITS wrapper + XSL

NLNZ ME

FITS wrapper + XSL

ExifTool

FITS wrapper + XSL

File utility

FITS wrapper + XSL

FFIdent

FITS XML

StandardXML

consolidator

FITS XML

FITS XML

FITS XML

FITS XML

FITS XML

exporter

FITS XML

Page 6: FITS Demo

<fits><identification>// format name, version, registry IDs</identification><fileinfo>// file name, size, MD5, etc.</fileinfo><filestatus>// validity info</filestatus><metadata>// normalized, combined metadata</metadata><toolOutput>// native tool output</toolOutput>

</fits>

Fits output

Page 7: FITS Demo

cmd (open up a shell)..\..\Program Files\Fits\fits-0.6.1 (navigate to install).\fits.bat –h (see parameters).\fits.bat –i RELEASE.txt (FITS metadata only)

Demos: basic command line

Page 8: FITS Demo

<?xml version="1.0" encoding="UTF-8"?><fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1" timestamp="7/20/12 5:01 PM"> <identification> <identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1"> <tool toolname="Jhove" toolversion="1.5" /> <tool toolname="file utility" toolversion="5.03" /> <tool toolname="Droid" toolversion="3.0" /> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">x-fmt/111</externalIdentifier> </identity> </identification> <fileinfo> <size toolname="Jhove" toolversion="1.5">7838</size> <filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">C:\Program Files\Fits\fits-0.6.1\RELEASE.txt</filepath> <filename toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">RELEASE.txt</filename> <md5checksum toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">7dc74a990c85006fa028ec8fbdbc0d20</md5checksum> <fslastmodified toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">1335359242000</fslastmodified> </fileinfo> <filestatus> <well-formed toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</well-formed> <valid toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</valid> </filestatus> <metadata> <text> <linebreak toolname="Jhove" toolversion="1.5">CR/LF</linebreak> <charset toolname="Jhove" toolversion="1.5">US-ASCII</charset> </text> </metadata></fits>

Page 9: FITS Demo

.\fits.bat –i RELEASE.txt –x (standard technical metadata only)

Demos: basic command line

Page 10: FITS Demo

<?xml version="1.0" encoding="UTF-8"?> <textMD:textMD xmlns:textMD="info:lc/xmlns/textMD-v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="info:lc/xmlns/textMD-v3 http://www.loc.gov/standards/textMD/textMD-v3.01a.xsd"> <textMD:character_info> <textMD:charset>US-ASCII</textMD:charset> <textMD:linebreak>CR/LF</textMD:linebreak> </textMD:character_info> </textMD:textMD>

Page 11: FITS Demo

.\fits.bat –i RELEASE.txt –xc (FITS metadata+ standard technical metadata)

Demos: basic command line

Page 12: FITS Demo

<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1" timestamp="7/20/12 5:11 PM"> <identification> <identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1"> <tool toolname="Jhove" toolversion="1.5" /> <tool toolname="file utility" toolversion="5.03" /> <tool toolname="Droid" toolversion="3.0" /> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">x-fmt/111</externalIdentifier> </identity> </identification>.(snip). <metadata> <text> <linebreak toolname="Jhove" toolversion="1.5">CR/LF</linebreak> <charset toolname="Jhove" toolversion="1.5">US-ASCII</charset> <standard> <textMD:textMD xmlns:textMD="info:lc/xmlns/textMD-v3"> <textMD:character_info> <textMD:charset>US-ASCII</textMD:charset> <textMD:linebreak>CR/LF</textMD:linebreak> </textMD:character_info> </textMD:textMD> </standard> </text> </metadata></fits>

Page 13: FITS Demo

.\fits.bat –i RELEASE.txt –o demo\RELEASE_out1.txt (FITS metadata only written to a file)

Demos: basic command line

Page 14: FITS Demo

od_1000012.xml◦ premis:fixity (MD5)◦ premis:size (file size)◦ premis:format ◦ premis:creatingApplication◦ premis:objectCharacteristicsExtension

(documentMD)◦ hulDrsAdmin:fileIdentification◦ hulDrsAdmin:formatValidation◦ hulDrsAdmin:suppliedFilename◦ hulDrsAdmin:suppliedDirectory

In our AIPs

Page 15: FITS Demo

In fits-0.6.1/xml directory Key items

◦ Enable/disable tools◦ Add new tools◦ Tools to prefer ◦ Prevent tools from processing files by file

extension◦ Option to include tools’ native output◦ Report or ignore conflicts

Main configuration: fits.xml

Page 16: FITS Demo

In fits-0.6.1/xml directory To indicate more specific formats

<branch format="JPEG 2000"><branch format="JPEG 2000 JP2"/><branch format="JPEG 2000 JPX"/>

</branch>

Configuration: fits_format_tree.xml

Page 17: FITS Demo

Conflict reportsC:\Program Files\Fits\fits-0.6.1>.\fits.bat -i demo\Acknowledgements.rtf

<?xml version="1.0" encoding="UTF-8"?><fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1" timestamp="7/21/12 3:51 PM"> <identification status="CONFLICT"> <identity format="Plain text" mimetype="text/plain" toolname="FITS" toolversion="0.6.1"> <tool toolname="Jhove" toolversion="1.5" /> </identity> <identity format="Rich Text Format" mimetype="application/rtf, text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="Droid" toolversion="3.0" /> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.5</version> <version toolname="Droid" toolversion="3.0" status="CONFLICT">1.6</version> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/50</externalIdentifier> <externalIdentifier toolname="Droid" toolversion="3.0" type="puid">fmt/51</externalIdentifier> </identity> <identity format="Rich Text Format" mimetype="text/rtf" toolname="FITS" toolversion="0.6.1"> <tool toolname="ffident" toolversion="0.2" /> </identity></identification>

Page 18: FITS Demo

Indicate tool inaccuracies and/or areas for educating ourselves

To resolve◦ Is Rich Text Format a more specific form of Plain

Text? If so, adjust fits_format_tree.xml

◦ What should the MIME media-type for Rich Text Format? (consult specification if possible) Normalize the tool output to this MIME media-type

Conflict reports

Page 19: FITS Demo

Different values for the same metadata◦ “inches” vs “2” vs “in.”◦ “Grayscale” vs “Greyscale”

Different names for the same format◦ ‘JPEG2000’ vs ‘JPEG 2000’ vs ‘JPEG 2000 image”

Different ways of saying it can’t identify it ‘Unknown Binary’ vs ‘bytestream’ vs ‘data’ vs no value ‘application/octet-stream’ vs ‘application/unknown’ vs

no value Different ways metadata is output

◦ Ex: bits per sample (single or multiple values)

Value normalization

Page 20: FITS Demo

11 OS releases since July 2009

7/1/20

09

8/8/20

09

9/15/2

009

10/23

/2009

11/30

/2009

1/7/20

10

2/14/2

010

3/24/2

010

5/1/20

10

6/8/20

10

7/16/2

010

8/23/2

010

9/30/2

010

11/7/2

010

12/15

/2010

1/22/2

011

3/1/20

11

4/8/20

11

5/16/2

011

6/23/2

011

7/31/2

011

9/7/20

11

10/15

/2011

11/22

/2011

12/30

/2011

2/6/20

12

3/15/2

012

4/22/2

012

0

2

4

6

8

10

12

Series1

Page 21: FITS Demo

http://fits.googlecode.com◦ Downloads: download the newest version◦ Mailing list: fits-users (new releases announced

here)◦ Issues: File any bugs, upload patches

Code home

Page 22: FITS Demo

Support for container files & ContainerMD◦ arc.gz, .gz, .zip

Improved video support Additional tools as needed

◦ Apache Tika (docs, pdf, mbox, rtf, containers)◦ JHOVE 2 (shapefiles)◦ Mediainfo (audio, video)◦ Aduna Aperture (docs, pdf, email)

Analysis of tool overlaps and “niches” Performance efficiencies Better documentation!

Future plans