Upload
duraspace
View
397
Download
0
Tags:
Embed Size (px)
Citation preview
DSpace 4.2 Advanced Training –Content Transmission
DSpace 4.2 Advanced Training by James Creel is licensed under a Creative Commons Attribution 4.0 International License. Special thanks to the DuraSpace Foundation and the Texas Digital Library for making this course possible.
Module Outline
• Harvesting and Disseminating with OAI/PMH
• Reading content with REST
• Export and Import with SIPs
• Depositing content with SWORD
• Importing content with the Simple Archive Format (SAF)
Introduction to Harvesting
• Open Archives Initiative
• Protocol for Metadata Harvesting
• Object Reuse and Exchange
• Harvesting with DSpace XMLUI
• Choice of collection source
• Replicate metadata (OAI-PMH) or metadata + data (OAI-PMH + OAI-ORE)
• What an excellent way to rapidly populate one’s repository!
Introduction to Harvesting
• Go ahead and create a new collection wherever you please.
• We will be harvesting content from remote DSpace repositories.
• Having created the collection, one is taken to the edit view. Click the tab for Content Source
How do we learn about the harvest source?• Point your browser to http://repository.tamu.edu/dspace-oai/request?verb=ListSets to see a list of collections at TAMU.
• There are several interesting verbs for which an OAI server will grant requests-
• Point your browser to http://www.openarchives.org/OAI/openarchivesprotocol.html for details
• In the 1.8.x days, one would need to keep that page open when trying to craft queries to OAI. Under 3.x and higher, there is a lovely stylesheet courtesy of Lyncode that makes typical queries easy and automatic.
Configuring the Content Source
• A sample OAI Provider – OAK Trust: The Texas A&M Digital Repository: http://repository.tamu.edu/dspace-oai/request
• OAI Set spec: com_1969.1_5670
• Test the settings to make sure things are copasetic, then save.
Your oaiwebapp provides a machine-readable dissemination service.• Try some requests:
• http://localhost:8080/oai/request?verb=Identify
• http://localhost:8080/oai/request?verb=ListMetadataFormats
• http://localhost:8080/oai/request?verb=ListSets
• http://localhost:8080/oai/request?verb=ListRecords&metadataPrefix=ore
We can experiment with harvesting from each other’s repositories
• From your command line, run ipconfig
• Your ip address will be listed as the IPv4 address
• You can craft a OAI request URL for your server using the ip address as the host name.
• If you like, invite a neighbor to harvest one of your collections.
Automating Harvesting (1/3)
• Requests to harvest large collections can easily time out.
• This calls for a scheduler that runs independently of the browser.
• Find it in the XMLUI under
the control panel.
Automating Harvesting (2/3)
• When automated, the harvester will conduct its activity on all collections that are configured to harvest.
• Once started, the harvester will operate at regular intervals as specified by harvester.harvestFrequency in modules/oai.cfg.
Automating Harvesting (3/3)
• Start – initiate the periodic process
• Pause – wait for the current operation to complete, then suspend further operations
• Stop – wait for the currently harvested item to complete, then suspend further operations (which will likely break further harvests of the containing collection)
• Reset Harvest Status – clears the status of each harvested collection so that they may be initiated anew
Which formats are available to your harvester?
• This is configurable in [dspace-install-dir]\modules\oai.cfg under the harvester.oai.metadataformats.[declared-metadata-format-name] values
• Where [declared-metadata-format-name] is declared in your xoai.xml
• Let’s add “rdf” to that list and try harvesting with it.
Dissemination – Metadata Crosswalks
• Metadata in DSpace exist in key-value pairs with field names given by the metadata registry.
• Fields may be exported in the formats that oai indicates from the ListMetadataFormats verb.
• Dissemination crosswalks are encoded as XSL files inside the [dspace-install-dir/config/crosswalks]directory
• The .properties seem to have stopped being used for OAI dissemination since DSpace went to version 3.x
• The crosswalks are active in specific contexts that can be configured.
Configuring Metadata Crosswalks –XOAI Configuration Entities• Open up the C:\dspace\config\crosswalks\oai\xoai.xmlfile with jEdit.
• The top level Configuration element contains <Contexts>, <Formats>, <Transformers>, <Filters>, and <Sets>.
• Each of these contain, in turn, what you would expect -<Context> elements, <Format> elements, <Transformer> elements, <Filter> elements, and <Set> elements.
• Each of these does its own thing.
Configuring Metadata Crosswalks –XOAI Configuration – Setting up Contexts• The <Context> element refers to instances of all the other
elements.
• The baseurl attribute determines how to address the context in your url path
• The <Format> elements name the crosswalks to be available
• The <Transformer> element names a stylesheet to apply to the final XML output
• The <Filter> elements name Java classes that will eliminate results unacceptable to the context
• The <Set> element appears simply to alias the set of all records in the context.
Configuring Metadata Crosswalks –XOAI Configuration – Setting up Formats• The <Format> elements have an id attribute which allows
them to be referenced in the <Context>
• They also contain, minimally, a
• <Prefix> by which they are addressed in OAI requests
• <XSLT> designating the xsl file doing the crosswalk
• And should include
• <Namespace> designating the namespace of XML output
• <SchemaLocation> designating the schema specification of that XML
Configuring Metadata Crosswalks –XOAI Configuration – Setting up Transformers
• The <Transformer> element contains an id attribute by which it is referenced in the <Context> and an <XSLT> element designating its XSL file.
Configuring Metadata Crosswalks –XOAI Configuration – Setting up Filters• The <Filter> elements contain an id attribute by which
they are referenced in the <Context> and
• <Class> which names the java class doing the filtering
• <Parameter> with a key attribute and one or more <Value> elements that are used to parameterize the filtering method.
Configuring Metadata Crosswalks –XOAI Configuration – Setting up Sets• The <Set> element has the usual id attribute and
• <Pattern> which renders as the set spec in the OAI response
• <Name> which renders as the set’s name
Exercise – A Custom Context
• Let’s imagine a use case where there is a requirement to be harvested by a vendor or partner.
• Only items with certain fields are suitable for their index (for example, those with a title, author, and type)
• Create a new context with an appropriate filter.
Configuring Metadata Crosswalks –Styling for Human Readability• The webapps\oai\static\style.xsl stylesheet is used to render
the OAI responses in a nice readable format with the links of interest also provided.
• One may also change the stylesheet being used by OAI by changing the stylesheet attribute of the<Configuration> root element of xoai.xml.
• Let’s experiment with some changes to the style –
• New branding
• Links to each of the contexts
The REST Webapp (1/4)
• Representational State Transfer – A scaleable, simple approach to web services.
• Stateless on the server side – client maintains any session data
• Cacheable – responses should indicate whether the client can save them in a web cache
• Layerable – Client need not know or care whether the server is behind a proxy
• Simple, Uniform Requests – resources identifiable by URI, responses report their format and their cacheability
The REST Webapp (2/4)
• Read Only in 4.x
• JSON or XML depending on your HTTP Header: Accept
• Possible values are application/xml and application/json
• Your browser may default to one or the other, but your application code (or developer’s browser) can specify.
• Communities, Collections, Items and Bitstreams are queryableresources
• The ?expand query parameter followed by a comma delimited list will provide more detail than the default queries
The REST Webapp (3/4)
• Communities
• /rest/communities lists all
• /rest/communities/:id gets one
• ?expand possibilities: parentCommunity, collections, subCommunities, logo, all
• Collections
• /rest/collections lists all
• /rest/collections/:id gets one
• ?expend possibilities: parentCommunityList, parentCommunity, items, license, logo, all
The REST Webapp (4/4)
• Items
• /items/:id lists one
• ?expand possibilities: metadata, parentCollection, parentCollectionList, parentCommunityList, bitstreams, all
• Bitstreams
• /bitstreams/:bitstreamID lists one
• /bitstreams/:bitstreamID/retrieve to download
• ?expend possibilities: parent, all
The DSpace Packager
• Utilized with the dspace packager command-line script
• Submission Information Packages
• Dissemination Information Packages
Submission Packages (SIPs)
• Four package formats supported by default:
• DSpace Archival Information Package (AIP) – used for backing up and restoring DSpace repository content
• DSPACE-ROLES – used for backing up and restoring DSpace groups and epersons
• METS – A zipfile containing MODS descriptive metadata and designating content bitstreams and their disposition
• PDF – A single PDF file can be considered a package (supposing its embedded metadata are suitable
Submission Packages (SIPs)
• An example – importing a PDF as a package
• Track down a pdf on the interwebs – here’s one!
• http://hdl.handle.net/1969.1/2313
• Copy it to [dspace-install-dir] i.e. C:\dspace
• Learn about the packager with the C:\dspace\bin\dspace packager --help --type PDF command
• Can you craft the command to make the submission?
Submission Packages (SIPs) –PDF example• We need a –t for type, -p for parent collection, -e for eperson
email, and finally the name of the “package”
• Once this succeeds, however, the quality of the metadata is likely to be very poor indeed! Embedded metadata are seldom well populated.
Submission Packages (SIPs)
• An example – importing a METS package
• Of interest as this is also the package used by default for SWORD deposits
• Find the file mets-sip-example.zip in the W:\Development\resources directory.
• Copy it to [dspace-install-dir] i.e. C:\dspace
• Learn about the packager with the C:\dspace\bin\dspace packager --help --type METS command
• Can you craft the command to make the submission?
Submission Packages (SIPs) –METS example• We need at least the –t flag for type, -p for parent collection, -
e for eperson, and finally the filename of the package.
• C:\dspace\bin\dspace packager –t METS –p [collection-handle] –e [email protected]
Dissemination Packages (DIPs)
• DSpace Archival Information Package
• DSPACE-ROLES
• METS
• No need to export PDFs, we might suppose.
• As a final packaging exercise, use the packager to disseminate an item. This will require the additional –i (identifier, i.e. handle of the object) and –d (disseminate instead of the default, submit)
• Can you craft the command?
Dissemination Packages (DIPs)
• A successful dissemination:
• Let’s complete the circle by submitting this package to another (or even the same) collection.
SWORD
• Simple Web Service Offering Repository Deposit
• DSpace comes with servers for v1 and v2
• Big innovation of v2 is ability to update items, but client support is currently limited
• Accessible via a client or (e.g.) a cURL command.
• Accepts deposits via METS packages by default
• Requires an administrative eperson account
SWORD – accessing via cURLcommand• A cURL executable is provided at W:\Development\curl-
7.37.0-win32\
• Copy that directory to your own C:\Development\.
• This command is an extremely robust tool that enables communication of data over protocols with and without encryption – we here are interested just in HTTP today.
SWORD – accessing via cURLcommand – getting the servicedocument
• Clues to the meaning may be found at http://curl.haxx.se/docs/manpage.html
SWORD – accessing via cURLcommand – Making a deposit• A long, long command indeed…
• curl • -i
• --data-binary "@mets-sip-example.zip"
• -H "Content-Disposition: filename=mets-sip-example.zip"
• -H "Content-Type: application/zip"
• -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP"
• -H "X-No-Op: false“
• -H "X-Verbose: true“
• --user "[email protected]:admin" http://localhost:8080/sword/deposit/123456789/26
SWORD – accessing via cURLcommand – Making a deposit• Find that text in the W:\Development\resources\curl-deposit-
notes file.
• In an amusing turn of events, this deposit will fail from most of our localhost machines, as behind the scenes the SWORD server will attempt to write a temporary file named after your IP address which contains colon characters which are illegal in Windows filenames.• This can be gleaned from the
C:\Development\tomcat\logs\localhost.[today].log
• Instead, let’s experiment with deposits to other servers in the room.
SWORD – Bringing up the DSpace Client• Activate the aspect in xmlui.xconf
• Target repositories are configured in the [dspace-install-dir]\config\modules\sword-client.cfg file
SWORD – Utilizing the DSpaceSWORD Client• Serves at this time only to copy existing items to another
SWORD-enabled repository.
• To utilize, navigate to the item’s page while logged in as an administrator.
• Let’s try some
deposits to
localhost and
our neighbors.
SWORD – Looking Forward to Sword v2 in Practice• Sword v2 offers the capability to change the content and
metadata of previously deposited items
• Java libraries for the client are available, but I have not seen an implemented GUI.
• cURL usage is also theoretically quite possible, but also looks like a little bit of heavy lifting.
Batch Imports
• DSpace Simple Archive Format (SAF)
• The DSpace import script
• Adding items
• Replacing items
• Deleting items
• Importing from real sources
• Example: CSV
• Example: MARC XML
DSpace SAF (1/3) - Overview
• The top level directory contains one directory for each item in the batch.
• Each item directory must contain:
• The bitstream files
• A contents manifest contents
• A metadata file dublin_core.xml
• Optionally, other metadata files with names like metadata_[schema].xml where [schema] is the schema’s name.
Scott Phillips provides a fine guide at http://www.scottphillips.com/2009/05/howto-dspace-batch-ingest/
DSpace SAF (2/3) – Contents Manifest• The contents manifest contents names each bitstream
that will be in the item as well as it’s disposition:
• Bundle
• Permissions
• Primacy
DSpace SAF (3/3) – Metadata
• The SAF uses a specific XML format for the encoding of Dublin Core style metadata.
• dublin_core.xml
• metadata_[schema].xml where [schema] is another metadata schema in your repository’s registry
• The containing element is dublin_core with a schemaattribute.
• The field elements are dcvaluewith schema, element, and qualifier attributes.
Example imports…
• Provided are some rough code examples that will parse a CSV metadata file (and associated content files) or a MARC XML file (and associated content files).
• The code examples are in Java and best comprehended in a nicely configured development environment, but we can work with them using jEdit and the command line.
• We will conduct these imports into the repository and consider the advantages and disadvantages of the approach.
An example import: CSV
• Create the import processor application in your C:\Development\SAFCreator directory
• mvn clean package
• Run it with java –jar target\SAFCreator-0.0.1-SNAPSHOT.one-jar.jar
• You will be presented with a Java Swing interface where you can specify a csv metadat a file, a directory for source files, and directory for SAF output, and other details for the batch.
An example import: CSV
• Import the SAF as follows:
• c:\dspace\bin\dspace import -a -e [email protected] -s c:\Development\SAF\test-output -c 123456789/2 -m c:\Development\SAF\test-output\map.map
An example import: MARC XML
• This example may be found in the import/marc directory
• Create the program with
• javac –sourcepath . *.java
• jar cfm xslimporter.jar manifest.mf *.class
• Run with
• java –jar xslimporter.jar
To see a common import difficulty, attempt an import as we did for the CSV example.
-This will result in some schema-related errors, a very common problem when doing imports.
An example import: MARC XML
• Add the following to a new thesis metadata schema and re-attempt the import.
• degree.name
• degree.level
• degree.discipline
• degree.department
Consider the Import Results
• Idiosyncrasies of certain field values are more apparent in different syntactic contexts.
• Different metadata origins entail different complexities in the processing.
• Importation into a digital repository is a crucial step in the life of a digital resource, as it is a chance to refine metadata, after which it can be easily transmitted via crosswalks.
• However, it is a time when metadata are at risk of loss for lack of care.
Final Thoughts on Content Transmission• Along with preservation, one of the greatest services provided
by digital repositories
• Yet, like preservation, good transmission requires constant work
• Crosswalks must be maintained to standards as well as local practices
• Our means of importing content are constantly improving but face a moving target
• New collection types inevitably require new development work if their ingestion is to be automated