Upload
lumina
View
40
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Working Group: Practical Policy Rainer Stotzka, Reagan Moore. Agenda. Thursday March 27, 2014 3:30-5:00 PM Introduction to policy-based data management Discussion of data policy manager for EUDAT (Mark van de Sanden) Presentation on natural language rule processing ( Chitta Baral ) - PowerPoint PPT Presentation
Citation preview
Working Group: Practical Policy
Rainer Stotzka, Reagan Moore
2
Thursday March 27, 2014 3:30-5:00 PM Introduction to policy-based data management Discussion of data policy manager for EUDAT (Mark van de Sanden) Presentation on natural language rule processing (Chitta Baral) Initial presentation of summary of policies across data centers and
research projects (Jewel Ward) Friday March 28, 2014 11:00-12:30 PM
Discussion of policy summary Identification of best practices
Discussion of policy testing – interoperability testbed Integration with deliverables from other working groups
Persistent identifiers Linked-data – HIVE Type registry Data Foundation and Terminology Preservation interest group
Agenda
3
Identify the most important policies Practical implementations for managing research data
collections Provide recommendations for a “starter kit” Testbeds:
Evaluate standard policies Test interoperability across WGs
Policy: Assertion or assurance that is enforced about a collection or a dataset
Practical Policy Working Group Focuses:
Concept Graph by Reagan MooreCollectionPurpose Defines
Defines
PolicyProperty Defines ProcedureControls UpdatesPersistent
State Information
Consistency
HasFeature
Integrity
Isa
Workflow
Isa
Function
Chains
SysChksumDataObj
Isa
CollectionPurpose
Completeness
Correctness
Consensus
Defines
Consistency
Attribute
HasFeature
HasFeature
HasFeature
Has
Defines
Policy
Has
Property Defines ProcedureControls Updates
Client Action
Periodic Assessment
Criteria Policy
Policy Enforcement
Point
Workflow
Invokes
HasSubType Isa
Function
Chains
Operation
Isa
Persistent State
Information
Isa
Digital Object
Updates
Has
Has
Replication Policy
Checksum Policy
Quota Policy
Data Type Policy
Isa
IsaIntegrity
Isa
AuthenticityIsa
Access control
Isa
GetUserACL
SetDataType
SetQuota
DataObjRepl
SysChksumDataObj
Isa
Isa
Isa
Isa
Isa
DATA_ID DATA_REPL_NUM DATA_CHECKSUM
Isa Isa IsaIsa
Isa
HasFeature
Concept Graph by Reagan Moore
Policy Categories
Collection-based
Policies
Integrity
Data Lifecycle Management
Data Staging
Federation
Description
Publication
Compliance
Data Management
Plans
AccessControl
PreservationProvenance
Replication
Regulatory
ManagementAdministrativeAssessment
7
List of policies in the RDA Wiki
Monthly telephone conferences (RDA)
“Policy of the month”Review of policies that have been submitted
54 persons registered
Management
Testbeds iRODS
Renaissance Computing Institute E-iRODS
DataNet Federation Consortium – DFC dCache
Institute of Physics of the Academy of Sciences, CESNET
DataVerseOdum Institute
8
Data Foundation and Terminology WG Discussion of a vocabulary for operationsPreservation Infrastructure IG Policies for preservation Persistent Identifiers Properties versus operations on identifiersData Citation WG Type registryMetadata Linked-data vocabularies
Interactions with other WGs
9
Peisar – Storage Policies at CESNET
EUDAT Data Policy Manager
10
Why? Users or domain experts need not learn the syntax of the rule
language. They specify their rules using natural language.
How? Natural language specification of rules is translated to rules in the
syntax of the rule language – in two steps though Step 1: Natural language to an intermediate language (focus is on correct
translation of natural language and dealing with the challenges and quirkiness of natural language)
Step 2: Intermediate language to Rule language (Should be more straightforward as both languages are formal languages, and the intermediate language has a very restricted vocabulary)
Our focus in this presentation is on Step 1.
Natural Language Rule Processing
11Underlying Technical Approach
Montague’s approach: The meaning of words and phrases are Lambda calculus formulas The meaning (or translation) of sentences are obtained by combining the meaning of its words and phrases.
Usually as dictated by a grammar Categorial Grammar (especially CCG) are often used as they give directionality regarding how to combine.
12
Print financial report [S]
Print [S/NP] financial report [NP]
financial [NP/N] report [N]
λz. print(z)
λy. y@finance λx. report(x)
report(finance)
(λy. y@finance) @ (λx. report(x))( λx. report(x))@finance
report(finance)
print(report(finance))
NL to Policy Example
13Illustration of Montague’s approach using CCG and λ-calculus
Every boxer walks.
))()(()(__xwalkxboxerxSwalksboxerEvery
)@)((.))\/((_xvxboxerxv
NPSSboxerEvery
)@@(..)/))\/(((xvxuxvu
NPNPSSEvery )(.
)(yboxery
Nounboxer
)(.)\(
zwalkzNPSwalks
14The Key Issue(s)
Where do we get the Lambda expressions from? Handcrafting them is not scalable
Lambda expressions get complex in a hurry and handcrafting creates a bottleneck Too many words Since target language is not unique we can not painstakingly make new dictionaries for each target language Target languages evolve
Other standard issues Ambiguity: Multiple meanings of words; word sense disambiguation; etc.
15
How to get the lambda expressions? How we learned natural languages? Often
We know the meaning of a sentence We know the meaning of most of the individual words in that
sentence But we do not a-priori know the meaning of some particular
word(s) in that sentence We are able to correctly guess the meaning of those words
Follow a similar approach Given a set of training examples and an initial dictionary, learn
the lambda expressions for the words in those examples that are not in the dictionary
Inverse Lambda operators
16Inverse λ Example
Every boxer walks.
))()(()(__xwalkxboxerxSwalksboxerEvery
)@)((.))\/((_xvxboxerxv
NPSSboxerEvery
)@@(..)/))\/(((xvxuxvu
NPNPSSEvery )(.
)(yboxery
Nounboxer
)(.)\(
zwalkzNPSwalks
17Inverse λ – another Example
Print financial report [S]
Print [S/NP] financial report [NP]
financial [NP/N] report [N]
λz.print(z)
λx. report(x)
report(finance)
print(report(finance))
λy. y@finance
18Another Example
Send email to curator of the collection [S]
Send email [S/NP]
to curator of the collection {NP]
curator [NP]of the collection[NP\NP]
λz. send(email,z)
λx. curator(x)
curator(collection)
send(email , curator(collection))
λy. y@collection
Send [(S/NP)/NP]
λy. λz. send(y,z)
email [NP]
to [NP/NP]curator of the collection [NP]
λx.x
curator(collection)
of [(NP\NP)/NP]the collection [NP]
collectionλx.λy. y@x
19NL2KR System Architecture
NL2KR-L NL2KR-T
20
Generate all parse trees of the sentences
Learn lexicon using Inverse-λ and Generalization Generalize complete lexicon
Parameter Estimation
NL2KR-L System Learning Process
NL2KR-L
21
Generate all parse trees of the sentences
Generalize the missing meanings of words and recomputed parse trees
PCCG to rank the translation
NL2KR-T System Translation Process
NL2KR-T
22Current Status
We have a prototype that translates English description of policy rules to a formal representation Working towards making it usable in iRODS Step 1: English to a formal policy specification (in an
intermediate language) Step 2: Formal policy specification to Rules (in a lower level
language)
23Illustration: Training Data Set
Policy IPDL TranslationGenerate audit_trail for all changes to rules
generate(audit_trail(changes(rules)))
Transfer ownership to rods transfer(ownership, rods)
Generate report listing all preservation_attributes
generate(report(list(preservation_attributes)))
Migrate files to new storage migrate(files, storage(new))
Protect the integrity of Data_folder protect(integrity(data_folder))
Generate audit_trail for notifications on problems
generate(audit_trail(notifications(problems)))
Create AIP template from SIP template
create(template(aip); template(sip))
Create rule based-on AIP template create(rule; template(aip))
On deletion of files from collection erase metadata
When deletion(collection(f iles)); do erase(metadata)
Generate report summarizing information of micro_services
generate(report(summary(information(micro_services))))
24Illustration: Initial Lexicon
Word CCG category SemanticsTransfer (S\NP)/NP λx. λy. transfer(x,y)
ownership N ownership
rods N rods
all NP/N;N/N;NP/NP λx. x
the NP/N;N/N;NP/NP λx. x
Generate (S\NP)/NP λx. generate(x)
25Illustration: Iteration 1 of Inverse λ
26Illustration: Lexicon after parameter estimationWord CCG category Semantics Weight
Transfer (S\NP)/NP λx. λy.transfer(x, y)λx. λy.transfer(x @y)λx. transfer(x)
0.07646726-0.024746018-0.024746014
ownership N ownershipλx. ownership(x)
0.07570592-0.023981703
rods N rodsλx. rods(x)
0.07493635-0.023250459
to (NP\(S\NP))/NP(NP\NP)/NP
λy. λx. x@yλy. λx. x@y
0.10719467-0.0895291
report N reportλx. report(x)
-0.088597520.105146274
Protect (S\NP)/NP λx. λy. protect(x, y)λx. λy. protect(x @y)λx. protect(x)
0.07548905-0.024013432-0.024013432
listing (NP\(S\NP))/NP λy. λx. x@list(y) 0.009448431
… … … …
… … … …
27NL2KR Webpage
28NL2KR Download Page
29From the NL2KR manual
30
Described an approach to translate natural language (NL) specification to an intermediate (formal) language - which can then be translated to rules.
Theory: Augmented Inverse-Lambda based learning to Montague’s Lambda Calculus based approach.
System: Developed the NL2KR system. Used the NL2KR system to build a translation system
from NL to Intermediate Policy Description Language. Nl2KR system can be used for developing translation
systems from natural language to other formal languages. Has been evaluated in domains such as Geoquery, Robocup language,
puzzles, and Biology questions.
Natural Language Rule Processing: Conclusion
31
We are seeking:Data experts & Domain scientists !
Provide policies already in use: RDA Wiki Description Implementation
Express wishes about policies you might need Discuss and analyze policies Enhance the cross-over to other WGs, IGs and initiatives
Invitation
32
Policy ImportanceIntegrity 217Preservation 150Access control 126Provenance 108Data Management plans 99Publication 75Replication 66Data staging 52Federation 37Metadata sharing 23Regulatory 16Collection properties 7Identifiers 7Data sharing 7Versioning 7Licensing 6Format 6Data Life Cycle 6Arrangement 5Processing 5
Survey of 30 Institutions for Highest Priority Policies
33
1. Policy for data retention. How long, how short? Need preservation, or not? (5) Retention and disposition
2. Notification policies. (Ex. must warn data researcher that their data will be deleted at X time.) (6) notification on event
3. Transferability policies. The data must be transferable from the repository back to the researcher and the repository of origin. Or, in the event of defunding, the data must be de-accessioned and moved to another repository (or not, depending on relevant SOPs, agreements, etc.).
4. Policies re: costs and who pays for all of this data storage (8)5. Policies around context. Sometimes the original data and additional metadata are needed.
Sometimes, the context or derived data is what matters, and not the data itself. (7)6. Policies re: tagging/annotating data7. Search/Information Retrieval policies. What parts of the data will you search on, or not
search on? (4) Controlling search8. Standard Sys Admin policies: (1) replication, back up, (2) integrity checks, syncing with back
ups.9. Content policies: do we care what content and file formats users upload? Some do, some
don't. (3) Transformative migration10. Policy to educate researchers about all of the different policies relevant to the data
repository. For example, a user agreement/Terms & conditions statement that researchers must check off.
Summary of policies in production use
34
Consensus on a policy Use at multiple institutions Generality
Best practice policy components Name of operation that policy controls Constraints that policy implements State information that policy uses or modifies Verification policy Example of running code Documentation
Best Practices for production policies
35
Paper posted that lists 70 operations Policy-verification.docx
Candidate operations Access control Backups Data retention Descriptive metadata Format creation Integrity checks Notification Policy constraints Replication Restricted search Storage cost Tags Use agreements
Operations managed by policies
36Types of policies
Policy type OperationAccess Set access control Check access control Audit access controlBackups (time-stamped copies) Create copy Set timestamp Verify timestampsContextual metadata Extract metadata Register metadata Verify metadataData Retention Set retention period Check retention Verify retentionDisposition Define migration location Migrate data Verify migration
37Policy TypesPolicy type OperationFormat requirements Specify required format Create format Verify formatsIntegrity checks Set checksum Verify checksumNotification Define events Send e-mail on event Log noticesPolicy constraints by collection, researcher, funding Select constraint Apply constraint to policy Verify constraintsRestricted searching Set search limits Execute restricted searchSigning of use agreements Generate use form Store agreement Verify agreementStorage cost tracking Record usage Audit usage Generate storage cost report
38
Operation that is being controlled Replicate a file
Controls When is replication done?
When file is ingested When file is changed
Which files are replicated? Choose based on: Collection User Size
Replication properties Choice of replication location Choice of access controls on replica Requirement for checksum Verification of checksum on replica creation
Variants: Versioning of changes vs replication Backups vs replication (time-stamped copy)
Verification When should replica existence be verified
Replication Policy
39Policy : Operation : Constraints : State Information
Policy type Operation Constraints State informationReplication Set replica properties When? Default policy enforcement points Number of replicas Default number Where is replicate put? Default replica location Which files (collection/user/size)? Default policy selection criteria Default criterium value Set replica access controls? Default access control Require checksum? Replica checksum flag When audit? Default time period Replicate Delayed or immediate Replica location Replica creation time Replica access control Replica name Replica owner Replica number Verify replica numbers Periodic rule Audit time stamp Log of problems and actions Replace missing replicas Replica location Replica creation time Replica access control Replica name Replica owner Replica number
40
Interoperability testbed Demonstrate that RDA recommendations can be jointly implemented
Control policies Demonstrate that a desired practice can be applied consistently
Assessment policies Verify that a recommended practice is followed
Integration Demonstrate semantic consistency across systems level integration Example – are data objects considered to be immutable
Interactions with other Working Groups
41
Interoperability testbed provided by Practical Policy WG Persistent identifiers
Handle system Metadata
HIVE linked-data vocabularies Type registry
Expect implementation for integration Data Foundation and Terminology
Exchange of concepts based on use cases Preservation interest group
ISO 16363 assessment policies
Practical Policy WG Interfaces with the other WGs
42
New interest group is driven by the need to have testbeds with a longer lifetime than the Practical Policy working group.
Current testbeds Dataverse dCache iRODS
Testbed functions Demonstrate interoperability Provide platform to evaluate proposed best practices / software
We need working groups to provide software systems or policies for testing. Need a liaison to each working group
Proposal - Special Interest Group on Interoperability Testbeds
43
Interested participants include: David Antos CESNET Jon Crabtree Dataverse Marcio Faerman OSU Patrick Fuhrmann dCache testbed, DESY Thomas Jejkal KIT Data Manager repository Tibor Kalman Persistent identifier consortium Reagan Moore DataNet Federation Consortium Jakub Peisar dCache testbed Raphael Ritz MPG
Special Interest Group on Interoperability Testbeds