Language Resource Processing Configuration and Run

Embed Size (px)

Citation preview

IntroThis memo describes steps to configure and run a language resource processing. It is intended for internal use only.Architecture overviewMain componentsThere are three main components involved in the language resources processing: The Resource Server (hereafter RS) manages information about resources, their status and associated files.

The Workflow Server (hereafter WS) is responsible to process resource input files to output files that are loaded to the Virtuoso server. The WS is implemented using Oozie and Hadoop.

DERI and others participants processing components

Data and Processing FlowThe following diagram shows communication between WS and RS during processing a resource:RS-WS-coop_v3.pngThe flow: The flow is started by the administrator with an http call to the RS REST API. The call URL contains resource ID as a parameter. Example: POST /resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed

First step in the processing is done by the RS. It downloads the resource input file, uploads it to the SCP server with name: ${resource_id}.ext

The resource server then selects flow by resource type, sets flow properties and starts the flow using WS API of Oozie.

Oozie executes the flow that contains data moving steps and execution of the resource processing components. The penultimate step in the flow moves is the loading of data to the Virtuoso server, that is done by the miniLoader java action.

The last step in the Oozie flow is notification of the resource server about Virtuoso load status. The resource server then notify LRPMA about processing status.

Processing set up overviewThe whole processing is configured by following stepsresource type definition

registration of resource

definition of workflow

Processing set upDefinition of the resource type1st is necessary to create an resource type using the resource server. Creating of the resource type is the HTTP POST request so it is possible to do it either by command line HTTP tool like curl or using a REST client. There are screen-shots from the Postman REST client in following text for illustration. Beside it there are also request parameters in table because it is easier to read. (and copy&paste). The HTTP header ContentType should be set to application/json.

The resource server address is http://54.201.101.125:9999. Suppose that it is necessary to process resources provided by Paradigma ltd. That contains a lexicon so result of processing will be one graph.

RequestPOST http://54.201.101.125:9999/resourcestypes

Example body{ "id":"paradigma", "description": "type intended for processing of resources provided by Paradigma ","graphsSuffixes": ["lexicon"]}

Example response{ "id": "paradigma"}

The resource type define which workflow is used for processing of the resource and the resource type id is used as a name of subfolder on HDFS for Oozie workflow.

Registration of the resourceThe language resource should be registered in the resource server. Normally it is done via the LRPMA but it it is possible to do it manually for test purposes using the resource server REST API.

RequestPOST http://54.201.101.125:9999/resources

Example body{ "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0", "resourceType": "paradigma", "downloadUri": "scp://[email protected]/home/ubuntu/ParadigmaData/hotel_ca_tricks.csv", "credentials": "-----BEGIN RSA PRIVATE KEY----- ..., "language": "ca", "domain": "hotel", "provider": "Paradigma ltd", "licence": "LRGPL", "graphNamesPrefix": "http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/" }

Example response{ "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0"}

Definition of WorkflowProcessing steps are defined by XML work flow file that should be copied to Hadoop Distributed File System to the location that is configured in the Resource file configuration. The flow contains actions. Every action defines next action in case of its success. Properties populated by the resources server are used in the workflow definition XML files.Properties of flows populated by the Resource Server:

Properties calculated or retrieved from the resource properties:PropertyDescription

rsresourceidid of the resource

rsgraphprefixprefix for graphs, please see the miniLoader java action description below

rsgraphsufix0, [rsgraphsufix1]...graph suffixes, one for each file produced by the flow

rsdomaindomain of the processed resource

rslanguagelanguage of the processed resource

rsproviderprovider

rslicenselicense

oozie.wf.application.path${hdfs-folder-uri}/${resourceTypeId}hdfs-folder-uri is specified in conf.properties of the rs, resourceTypeId is property of the resource on the rs

The resource server also copy properties from the resource server configuration file conf/job.properties to the flow properties. It can be used for properties common for all flows like:

PropertyDescription

nameNodeHDFS name node address

jobTrackerMap reduce job tracker address

queueNameMap reduce jobs queue name

user.nameuser used to run the OOzie flow

inputfolderwhere downloaded resource files are stored

rspfilesdirfolder for processed files

rsvirtuosoloadfolderabsolute path to the folder where files for loading are stored

rsvirtuosohosthostname or address of the virtuoso server

rsvirtuosojdbcportJDBC port

rsvirtuosojdbcuserruser

rsvirtuosojdbcpasswdpassword

rsprocessedurlurl to send result of the virtuoso load

Example:

Configuring ActionsWork flows usually contains following sequenceMove of data to place when it can be reached by the first processing component

Processing by the first component

Move of data to place when it can be reached by the second processing component

Processing by second component

.

Load to the Virtuoso triple store

Moving the resource file to the processing componentsThe following snippet shows an example of configuration of first step in flow to move the resource files to folder where it can be picked up by a processing component.

ubuntu@ptwf ${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv

Configuring processingThe following xml snippet shows an example of processing by the Lomon Marl generator.

ubuntu@ptnuig ~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}${rsgraphsufix0}

Moving data to Virtuoso ServerThe following xml snippet shows an action which move output of previous step to the Virtuoso server.

ubuntu@ptnuig ${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}${rsresourceid}.ttl

Load data to the Virtuoso ServerThe following xml snippet shows an example configuration of the miniLoader component that is used for load of the processed resources files to the Virtuoso server.

${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} com.sindice.miniloader.Miniloader ${rsvirtuosohost} ${rsvirtuosojdbcport} ${rsvirtuosojdbcuser} ${rsvirtuosojdbcpasswd} ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rsgraphprefix}${rsgraphsufix0}

Notifying the resource serverLast step notifies the RS that data was loaded to the Virtuoso server.

${jobTracker} ${nameNode} curl -H Content-Type:application/json -X POST -d ${wf:actionData('load2virtuoso')['miniloader_json4rs']} ${rsprocessedurl}${rsresourceid}/processed

Copy the configuration to the HDFSThe property hdfs-folder-uri in conf.properties RS configuration file define the path where the configuration should be stored.

The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to check if exists:

If the folder for given resource file does not exists yet it is necessary to create it.Now is necessary to copy the workflow and required jars. In this case only the miniloader jar is required and it should be copied to the lib subfolder.

hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig-flows/paradigma/lib

Processing ResourcesProcessing is started by HTTP POST request to the RS server with empty body.

It is possible to control status of the processing using Oozie web console:

clicking the running line the detail window appears

When processing finished all step should have status OK

When resource is processed successfully it is possible to make a sparql request to verify the content.

Appendix A: example of whole flow definition

ubuntu@ptwf ${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv

ubuntu@ptnuig ~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}${rsgraphsufix0}

ubuntu@ptnuig ${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}${rsresourceid}.ttl ${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} com.sindice.miniloader.Miniloader ${rsvirtuosohost} ${rsvirtuosojdbcport} ${rsvirtuosojdbcuser} ${rsvirtuosojdbcpasswd} ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rsgraphprefix}${rsgraphsufix0} ${jobTracker} ${nameNode} curl -H Content-Type:application/json -X POST -d ${wf:actionData('load2virtuoso')['miniloader_json4rs']} ${rsprocessedurl}${rsresourceid}/processed ${jobTracker} ${nameNode} mkdir ${rspfilesdir}/${rsresourceid}

${jobTracker} ${nameNode} mv ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rspfilesdir}/${rsresourceid}

SSH action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]