93
Implementing Database Access Control Policy from Unconstrained Natural Language Text LAS Research Presentation John Slankas June 24 th , 2015 1 Relation Extraction slides are from Dan Jurafsky’s NLP Course on Coursera

Implementing Database Access Control Policy from ... Database Access Control Policy from Unconstrained Natural Language Text LAS Research Presentation John Slankas June 24th, 2015

Embed Size (px)

Citation preview

Implementing Database Access Control Policy from Unconstrained Natural

Language Text

LAS Research Presentation

John Slankas June 24th, 2015

1

Relation Extraction slides are from Dan Jurafsky’s NLP Course on Coursera

Research Path & Publications

2

Policy

2012

NaturiliSE

2013

ICSE Doctoral

Symposium 2013

PASSAT

2013ACSAC

2014

ESEM

20151

RE 20143

ESEM 20142

ASE Science

Journal 2013

Feasibility

Classification

Access Control

Extraction

Database Model

Extraction

1 to be submitted

2 2

nd Author

3 3

rd Author

Agenda

• Motivation

• Research Goal

• Background and Related Work – focus on Relation Extraction

• Solution - Role Extraction and Database Enforcement

• Studies

• Classification

• Access Control Extraction

• Database Model Extraction & End to End Implementation

• Limitations

• Future Work

• Research Goal Evaluation & Contributions

3

Motivation Goal Related Work Solution Studies Limitations Future Work

2015 – The Year of Healthcare Hack [Peterson 2015]

Two major breaches

Anthem – 80 million records

Premera – 11 million records

Experts fault Anthem for lack of robust access control [Bennett 2015] [Husain 2015] [Redhead 2015] [Westin 2015]

4

5

Motivation Goal Related Work Solution Studies Limitations Future Work

A Possibility…

Motivation Goal Related Work Solution Studies Limitations Future Work

Research Goal

Improve security and compliance by ensuring access

control rules (ACRs) explicitly and implicitly defined

within unconstrained natural language product

artifacts are appropriately enforced within a system’s

relational database.

6

Motivation Goal Related Work Solution Studies Limitations Future Work

Background

Access Control Rules (ACRs)

Regulate who can perform actions on resources

(subject, action, object)

Database Model Elements (DMEs)

Organization of stored data

Entities: “thing” in the real world

Attributes: property the describes an entity

Relationships: association between two entities

7

Extracting relations from text

Company report: “International Business Machines Corporation (IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”

Extracted Complex Relation: Company-Founding

Company IBM Location New York Date June 16, 1911 Original-Name Computing-Tabulating-Recording Co.

But we will focus on the simpler task of extracting relation triples

Founding-year(IBM,1911)

Founding-location(IBM,New York)

Extracting Relation Triples from

Text

The Leland Stanford Junior

University, commonly referred to as

Stanford University or Stanford, is

an American private research

university located in Stanford,

California … near Palo Alto,

California… Leland

Stanford…founded the university in

1891 Stanford EQ Leland Stanford Junior University

Stanford LOC-IN California

Stanford IS-A research university

Stanford LOC-NEAR Palo Alto

Stanford FOUNDED-IN 1891

Stanford FOUNDER Leland Stanford

Why Relation Extraction?

Create new structured knowledge bases, useful for

any app

Augment current knowledge bases

Adding words to WordNet thesaurus, facts to FreeBase

or DBPedia

Support question answering

The granddaughter of which actor starred in the movie

“E.T.”? (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

But which relations should we extract?

10

Automated Content Extraction (ACE)

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL

PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

Geographical

Subsidiary

Sports-Affiliation

17 relations from 2008 “Relation Extraction Task”

Automated Content Extraction (ACE)

Physical-Located PER-GPE

He was in Tennessee

Part-Whole-Subsidiary ORG-ORG

XYZ, the parent company of ABC

Person-Social-Family PER-PER

John’s wife Yoko

Org-AFF-Founder PER-ORG

Steve Jobs, co-founder of Apple…

12

Databases of Wikipedia Relations

13

Relations extracted from Infobox

Stanford state California

Stanford motto “Die Luft der Freiheit weht”

Wikipedia Infobox

Relation databases

that draw from Wikipedia Resource Description Framework (RDF) triples

subject predicate object

Golden Gate Park location San Francisco

dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco

DBPedia: 1 billion RDF triples, 385 from English

Wikipedia

Frequent Freebase relations: people/person/nationality, location/location/contains

people/person/profession, people/person/place-of-birth

biology/organism_higher_classification film/film/genre 14

Ontological relations

IS-A (hypernym): subsumption between classes

Giraffe IS-A ruminant IS-A ungulate IS-A mammal

IS-A vertebrate IS-A animal…

Instance-of: relation between individual and class

San Francisco instance-of city

Examples from the WordNet Thesaurus

How to build relation extractors

1. Hand-written patterns

2. Supervised machine learning

3. Semi-supervised and unsupervised

Bootstrapping (using seeds)

Distant supervision

Unsupervised learning from the web

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

What does Gelidium mean?

How do you know?`

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

What does Gelidium mean?

How do you know?`

Hearst’s Patterns for extracting IS-A

relations

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and|or) X)”

“such Y as X”

“X or other Y”

“X and other Y”

“Y including X”

“Y, especially X”

Hearst’s Patterns for extracting IS-A relations

Hearst pattern Example occurrences

X and other Y ...temples, treasuries, and other important civic buildings.

X or other Y Bruises, wounds, broken bones or other injuries...

Y such as X The bow lute, such as the Bambara ndang...

Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare.

Y including X ...common-law countries, including Canada and England...

Y , especially X European countries, especially France, England, and Spain...

Hand-built patterns for relations

Plus:

Human patterns tend to be high-precision

Can be tailored to specific domains

Minus

Human patterns are often low-recall

A lot of work to think of all possible patterns!

Don’t want to have to do this for every relation!

We’d like better accuracy

Supervised machine learning for

relations

Choose a set of relations we’d like to extract

Choose a set of relevant named entities

Find and label data

Choose a representative corpus

Label the named entities in the corpus

Hand-label the relations between these entities

Break into training, development, and test

Train a classifier on the training set 22

How to do classification in

supervised relation extraction

1. Find all pairs of named entities (usually in same sentence)

2. Decide if 2 entities are related

3. If yes, classify the relation

Why the extra step?

Faster classification training by eliminating most pairs

Can use distinct feature-sets appropriate for each task.

23

Relation Extraction

Classify the relation between two entities in a sentence

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

SUBSIDIARY

FAMILY EMPLOYMENT

NIL

FOUNDER

CITIZEN

INVENTOR …

Word Features for Relation Extraction

Headwords of M1 and M2, and combination Airlines Wagner Airlines-Wagner

Bag of words and bigrams in M1 and M2 {American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}

Words or bigrams in particular positions left and right of M1/M2 M2: -1 spokesman

M2: +1 said

Bag of words or bigrams between the two entities

{a, AMR, of, immediately, matched, move, spokesman, the, unit}

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1 Mention 2

Named Entity Type and Mention Level

Features for Relation Extraction

Named-entity types

M1: ORG

M2: PERSON

Concatenation of the two named-entity types

ORG-PERSON

Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)

M1: NAME [it or he would be PRONOUN]

M2: NAME [the company would be NOMINAL]

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1 Mention 2

Parse Features for Relation Extraction

Base syntactic chunk sequence from one to

the other

NP NP PP VP NP NP

Constituent path through the tree from one to

the other

NP NP S S NP

Dependency path

Airlines matched Wagner said

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said

Mention 1 Mention 2

Gazeteer and trigger word features

for relation extraction Trigger list for family: kinship terms

parent, wife, husband, grandparent, etc. [from WordNet]

Gazeteer:

Lists of useful geo or geopolitical words

Country name list

Other sub-entities

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

Classifiers for supervised methods

Now you can use any classifier you like MaxEnt

Naïve Bayes

SVM

...

Train it on the training set, tune on the dev

set, test on the test set

Evaluation of Supervised Relation

Extraction

Compute P/R/F1 for each relation

31

P =# of correctly extracted relations

Total # of extracted relations

R =# of correctly extracted relations

Total # of gold relations

F1 =2PR

P+R

Summary: Supervised Relation

Extraction

+ Can get high accuracies with enough hand-

labeled training data, if test similar enough to

training

- Labeling a large training set is expensive

- Supervised models are brittle, don’t generalize

well to different genres

Motivation Goal Related Work Solution Studies Limitations Future Work

Selected Related Work Access Control Extraction

• Requirements-based Access Control Analysis and Policy

Specification [He 2009]

• Automated Extraction and Validation of Security Policies from

Natural Language Documents [Xiao 2009]

Database Model Extraction

• English Sentence Structures and Entity-Relationship Diagrams

[Chen 1983]

• Heuristics-based Entity-Relationship Modeling through NLP

[Omar 2004]

• Conceptual Modeling of Natural Language Functional

Requirements [Sagar 2014]

33

Motivation Goal Related Work Solution Studies Limitations Future Work

Role Extraction and Database Enforcement

34

Role Extraction andDatabase Enforcement

Text DocumentsDatabase DesignDomain Knowledge

Generated SQL Commands for access controlCompleteness and Conflict ReportTraceability Report

1) Parse natural language product artifacts2) Classify sentence3) Extract access control elements4) Extract database model elements5) Map data model to physical database schema6) Implement access control

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 1: Parse Natural Language Product Artifacts

Generate intermediate representation from text

35

“A nurse can order a lab procedure for a patient.”

Named Entities:

A action

R resource

S subject

Parts of Speech:

NN noun

VB verb

Relationships:

dobj direct object

nn noun compound modifier

nsubj nominative subject

prep_for prepositional modifier – for

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 2: Classify Sentence

Performs two classifications on each sentence:

1. Does the sentence contain ACRs?

2. Does the sentence contain DMEs?

Example 1:

36

ACRs – Yes

DMEs – Yes

Example 2:

“Lab procedures have a date-ordered, lab-type,

and current status.”

ACRs – No

DMEs – Yes

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 3: Semantic Relation Extraction

37

Specific Actionnsubj dobjVBA

NNS*

NNR*

Generate Seed

Patterns

Match Subject

and Resources

Apply

Patterns

Known Subjects

& Resources

Access Control

Rules

Subject &

Resource Search

Pattern Extraction

and Transformation

Classify

Patterns

Pattern

Set

Inject Patterns

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 3: Extract Access Control Elements

38

Semantic Relations:

(order, nurse, lab procedure)

(order_for, nurse, patient)

Relational Pattern:

order – nsubj – nurse, – dobj – lab procedure

order – nsubj – nurse, - prep_for - patient

order

lab procedure

nsubj

prep_for

dobj

NN

VBA

Rnurse

NNSpatient

NNRcan

aux

MD

Use semantic relations to extract information

Access Control Rules

(nurse, order, lab procedure, create)

(nurse, order_for, patient, read)

39

Semantic Relations:

(order, nurse, lab procedure)

(order_object_for, lab procedure, patient)

Semantic Relational Pattern:

order – nsubj – nurse, – dobj – lab procedure

order – dobj – lab procedure, - prep_for - patient

order

lab procedure

nsubj

prep_for

dobj

NN

VBA

Rnurse

NNSpatient

NNRcan

aux

MD

Use semantic relations to extract information

Database Elements:

Entities: lab procedure, patient

Relationship: nurse orders lab procedure

lab procedure for patient

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 4: Extract Database Model Elements

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 5: Map Data Model to Physical Database Schema

• Merge ACRs and database model elements

• Map subjects to roles

• Map objects to tables

40

Access Control Rules

(nurse, order, lab procedure, create)

(nurse, order_for, patient, read)

Database Elements:

Entities: lab procedure, patient

Relationship: nurse orders lab procedure

lab procedure for patient

Physical Database Schema:

lab_procedure_tbl

patient_tbl

lab_procedure_patient_tbl

role: nurse_rl

Merged ACRs

(nurse, order, lab procedure, create)

(nurse, order_for, patient, read)

(nurse, order, lab procedure_patient, create)

(nurse, order_for, lab procedure_patient, read)

Database Access Rules

(nurse, lab procedure, create)

(nurse, patient, read)

(nurse, lab procedure_patient, create/read)

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 6: Implement Access Control

• Perform Sanity Checks

• Conflict detection

• Unmapped subjects and resources

• Generate SQL Commands

create role nurse_rl;

grant insert on lab_procedure_tbl to nurse_rl;

grant select on patient_tbl to nurse_rl;

grant insert, select on lab_procedure_patient_tbl to nurse_rl;

41

Motivation Goal Related Work Solution Studies Limitations Future Work

Process Challenges

• Ambiguity

• Pronouns

• Missing elements

• “Generic” words – (e.g., list, item, data)

• Synonyms

• Negativity

• Schema mismatches

• Names

• Cardinality

42

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: Classification Study [NaturaliSE 2013]

43

Research ability to classify sentences

Why?

• What needs to be processed further

• Prevent false positives

Focus

• Processing activities needs

• Determine appropriate sentence representation(s)

• Classifier and feature performance

Study Documents: 11 Healthcare related documents

44

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: What Classification Algorithm to Use?

Classifier P R 𝑭𝟏

Weighted Random .047 .060 .053

50% Random .044 .502 .081

Naïve Bayes .227 .347 .274

SVM .728 .544 .623

NFR Locator k-NN .691 .456 .549

Classifying Non Functional Requirements

45

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: What Classification Algorithm to Use?

Similarity Ratio 𝑭𝟏 % Classified

0.5 .85 56%

0.6 .82 63%

0.7 .78 74%

0.8 .75 86%

0.9 .71 96%

1.0 .70 96%

∞ .63 100%

Security Requirements: k-NN with Similarity Check

Conclusion: Use ensemble-based classifier

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Access Control Extraction [PASSAT 2013] [ASE Science Journal 2013] [ACSAC 2014]

46

Research ability to identify and extract access control rules

Why?

• Determine access control to implement

Focus

• Bootstrap knowledge to find ACRs

• Extend pattern set while preventing false positives

Study Documents

Document Domain Document Type

# of Sentences

# of ACR Sentences

# of ACRs

Fleiss’ Kappa

iTrust Healthcare Use Case 1160 550 2274 0.58 iTrust for Text2Policy Healthcare Use Case 471 418 1070 0.73 IBM Course Mgmt Education Use Case 401 169 375 0.82 CyberChair Conf. Mgmt Seminar Paper 303 139 386 0.71 Collected ACP Docs Multiple Sentences 142 114 258 n/a

47

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: What Properties do ACR Sentences Have?

iTrust iTrust_t2p IBM CM CyberChair Collected Text2Policy Pattern – Modal Verb 210 130 46 71 93 Text2Policy Pattern – Passive voice w/ to Infinitive

66 21 10 39 9

Text2Policy Pattern – Access Expression 32 7 5 1 18 Text2Policy Pattern – Ability Expression 45 21 14 11 3

Number of sentences with multiple types of ACRs

383 146 77 105 36

Number of patterns appearing once or twice

680 173 162 184 97

ACRs with ambiguous subjects (e.g. “system”, “user”, etc.)

193 119 139 1 13

ACRs with blank subjects 557 206 29 187 5 ACRs with pronouns as subjects 109 28 5 11 11 ACRs with ambiguous objects (e.g., entry, list, name,etc.)

422 228 45 47 34

Total Number of ACR Sentences 550 418 169 139 114 Total Number of ACR Rules 2274 1070 375 386 258

48

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Identifying ACR Sentences

Document Precision Recall F1

iTrust for Text2Policy .96 .99 .98

iTrust .90 .86 .88

IBM Course Management .83 .92 .87

CyberChair .63 .64 .64

Collected ACP .83 .96 .89

All documents, 10-fold .81 .84 .83

49

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Identifying ACR Sentences without Training Sets

Classification Performance (F1) by Completion %

50

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Extracting ACRs

Precision Recall F1

iTrust for Text2Policy .80 .75 .77

iTrust for ACRE .75 .60 .67

IBM Course Management .81 .62 .70

CyberChair .75 .30 .43

Collected ACP .68 .18 .29

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Database Model Extraction [ESEM 2015 (to submit)]

51

Research ability to extract database model and implement

process from start to finish

Why?

• Need to map ACRs to environment

Challenges

• Patterns

• Completeness

Case Study: Open Conference System

52

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Classification Results

Precision Recall F1 OCS (train using CyberChair) ACR .82 .29 .42 CyberChair (train using OCS) ACR .75 .61 .67 OCS, 10-fold self-validation ACR .81 .78 .79 OCS (train using CyberChair) DME .82 .29 .42 CyberChair (train using OCS) DME .75 .61 .67 OCS, 10-fold self-validation DME .83 .78 .79

Does a sentence have ACRs and/or DMEs?

53

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Extracting DMEs

Precision Recall F1

Perfect Knowledge from ACRs 1.00 .89 .94 Results from ACR Process 1.00 .81 .90

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Database Design Extraction

54

Number of System Oracle Process ACRs 52 730 686 resolved subjects 524 686 resolved objects 39 35 merged rules 272 481 discovered roles 7 21 1 discovered entities 52 223 213

55

Motivation Goal Related Work Solution Studies Limitations Future Work

Limitations

Limitations

• Text-based process

• Conditional access

• Rule-based resolution

• Only considered access at a table level

• Mapping discovered roles and entities to the actual database

manually performed

• Only examined one system within a given problem domain for

end-to-end validation

• System implementation may not match documentation • Different functionality

• Effective dating/status in place of deletes/updates

Motivation Goal Related Work Solution Studies Limitations Future Work

Future Work

• Access control rules

• Temporal orderings

• Conditions / constraints

• Database model elements

• Field types

• Values / ranges

• Human computer interaction

56

Research Goal Evaluation

Improve security and compliance by

• identify and extract access control rules (ACRs)

• identify and extract database model elements (DMEs)

• implement defined access control rules in a system’s

database

Confirmation

• Identify ACR Sentences: .83 𝐹1

• Extract ACRs: .29 to .77 𝐹1

• Identify DME Sentences: .79 𝐹1

• Extract DMEs: .90 𝐹1

• Generated # of ACRs: 272

57

Motivation Goal Related Work Solution Studies Limitations Future Work

Contributions

• Approach and supporting tool *

• Sentence similarity algorithm

• Bootstrapping algorithms

• Labeled corpora*

• Pattern distributions

* https://github.com/RealsearchGroup/REDE

58

References [Bennett 2015] Bennett, Cory. Weak Login Security at Heart of Anthem Breach.

http://thehill.com/policy/cybersecurity/232158-weak-login-security-at-heart-of-anthem-breach

Accessed: 3/15/2015

[Chen 1983] Chen, Peter. English Sentence Structure and Entity-Relationship Diagrams. Information Series 29:

127-149. 1983.

[He 2009] He, Q. and Antón, A.I., Requirements-based Access Control Analysis and Policy Specification

(ReCAPS). Information and Software Technology, vol. 51, no. 6, pp 993-1009, 2009.

[Husain 2015] Husain, Azam. What the Anthem Breach Teaches US About Access Control.

http://www.healthitoutcomes.com/doc/what-the-anthem-breach-teaches-us-about-access-control-

0001. Accessed 3/15/2015

[Omar 2004] Omar, Nazlia. Heuristics-Based Entity-Relationship Modelling through Natural Language

Processing, PhD Dissertation, University of Ulster, 2004.

[Peterson 2015] Peterson, Andrea. 2015 is already the year of the health-care hack – and it’s only going to get

worse. Washington Post. Washington D.C., 3/20/2015.

[Redhead 2015] Redhead, C. Stephen. Anthem Data Breach: How Safe Is Health Information Under HIPAA,

http://fas.org/sgp/crs/misc/IN10235.pdf. Congressional Research Service Report. Accessed

3/16/2015

[Sagar 2014] Sagar, Vidhu Bhala R. Vidya and Abrirami, S. Conceptual Modeling of Natural Language Functional

Requirements, Journal of Systems and Software, v 88, 25-41, 2014

[Westin 2015] Westin, Ken. How Anthem Could be Breached. http://www.tripwire.com/state-of-security/incident-

detection/how-the-anthem-breach-could-have-happened/. Accessed: 3/15/2015

[Xiao 2009] Xiao, X., Paradkar, A., Thummalapenta, S. and Xie, T. Automated Extraction of Security Policies

from Natural-Language Software Documents. International Symposium on the Foundations of

Software Engineering (FSE), Raleigh, North Carolina, USA, 2012.

59

References

[Slankas 2015] Slankas, John, and Williams, Laurie, "Relation Extraction for Inferring Database Models from Natural Language

Artifacts" , 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

(ESEM 2015) to be submitted.

[Slankas 2014] Slankas, John, Xiao, Xushang, Williams, Laurie, and Xie, Tao, "Relation Extraction for Inferring Access Control

Rules from Natural Language Artifacts" , 2014 Annual Computer Security Applications Conference (ACSAC

2014), New Orleans, LA.

[Riaz 2014b] Riaz, Maria, Slankas, John, King, Jason, and Williams, Laurie, "Using Templates to Elicit Implied Security

Requirements from Functional Requirements − A Controlled Experiment", ACM / IEEE 8th International

Symposium on Empirical Software Engineering and Measurement (ESEM 2014), Torino, Italy, September 18-19,

2014

[Riaz 2014a] Riaz, Maria, King, Jason, Slankas, John, and Williams, Laurie, "Hidden in Plain Sight: Automatically Identifying

Security Requirements from Natural Language Artifacts", 2014 Requirements Engineering (RE 2014), Karlskrona,

Sweeden, August 25-29, 2014

[Slankas 2013d] Slankas, John and Williams, Laurie, 2013. Access Control Policy Identification and Extraction from Project

Documentation, Academy of Science and Engineering Science Journal Volume 2, Issue 3. p145-159.

[Slankas 2013c] Slankas, John and Williams, Laurie, "Access Control Policy Extraction from Unconstrained Natural Language

Text", 2013 ASE/IEEE International Conference on Privacy, Security, Risk, and Trust (PASSAT 2013),

Washington D.C., USA, September 8-14, 2013.

[Slankas 2013b] Slankas, John and Williams, Laurie, "Automated Extraction of Non-functional Requirements in Available

Documentation", 1st International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE

2013), San Francisco, CA.

[Slankas 2013a] Slankas, John, "Implementing Database Access Control Policy from Unconstrained Natural Language Text", 35th

International Conference on Software Engineering - Doctoral Symposium (ICSE DS 2013), San Francisco, CA.

[Slankas 2012] Slankas, John and Williams, Laurie, "Classifying Natural Language Sentences for Policy", IEEE International

Symposium on Policies for Distributed Systems and Networks (POLICY 2012)

60

Backup slides

61

Additional Information

Other Solutions to Inappropriate Data

Access

62

• Auditing

• Intrusion detection

• Manually establish access control

• Completeness

• Correctness

• Effort

Additional Information

Machine Learning Background

63

• Combines computer science and statistics

• Supervised vs. Unsupervised

• Sample algorithms

• k-nearest neighbor (k-NN)

• Naïve bayes

• Decision trees

• Regression

• k-means clustering

Additional Information

Semantic Relation Related Work

64

1992 Hearst – Automatic Acquisition of Hyponyms from Large Text

Corpora

2004 Snow et al., Learning Syntactic Patterns for Automatic

Hypernym Discovery

2005 Zhou et al., Exploring Various Knowledge in Relation

Extraction

Additional Information

Natural Language Parsers

65

Apache OpenNLP: http://opennlp.apache.org/

Berkeley Parser: http://nlp.cs.berkeley.edu/

BLLIP (Charniak-Johnson): http://bllip.cs.brown.edu/

GATE: https://gate.ac.uk

MALLET: http://mallet.cs.umass.edu/

Python Natural Language Toolkit: http://www.nltk.org/

Stanford Natural Language Parser: http://nlp.stanford.edu/

Criteria:

• Performs well

• Open-source, maintained, well-documented

• Java

Additional Information

NLP Outputs

66

POS Tagging:

The/DT nurse/NN can/MD order/VB a/DT lab/NN procedure/NN

for/IN a/DT patient/NN ./.

Parse: (ROOT

(S

(NP (DT The) (NN nurse))

(VP (MD can)

(VP (VB order)

(NP (DT a) (NN lab) (NN procedure))

(PP (IN for)

(NP (DT a) (NN patient)))))

(. .)))

Typed Dependency: det(nurse-2, The-1)

nsubj(order-4, nurse-2)

aux(order-4, can-3)

root(ROOT-0, order-4)

det(procedure-7, a-5)

nn(procedure-7, lab-6)

dobj(order-4, procedure-7)

prep(order-4, for-8)

det(patient-10, a-9)

pobj(for-8, patient-10)

Additional Information

Precision, Recall, F1 Measure

Precision (P) is the proportion of correctly predicted access control

statements: 𝑃 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃).

Recall (R) is the proportion of access control statements found:

𝑅 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)

F1 Measure is the harmonic mean between P and R:

F1 = 2 × 𝑃 × 𝑅 /(𝑃 + 𝑅)

67

Expected

Classification

Yes No

Predicted

Classification

Yes True

Positive

False

Negative

No False

Negative

True

Negative

Additional Information

Inter-rater Agreement (Fleiss’ Kappa)

How well do multiple raters agree beyond what’s possible by

chance?

𝜅 = 𝑃 − 𝑃𝑒

1 − 𝑃𝑒

Degree of agreement attained above chance divided by the degree

of agreement possible above chance

68

Fleiss’ Kappa Agreement Interpretation

<= 0 Less than chance

0.01 – 0.20 Slight

0.21 – 0.40 Fair

0.41 – 0.60 Moderate

0.61 – 0.80 Substantial

0.81 – 0.99 Almost perfect

Step 3: Semantic Relations

Use semantic relation extraction to extract access control

elements from natural language text.

Semantic relation: underlying meaning between two concepts

Examples:

69

Hypernymy (is-a) users, such as nurses authenticate …

Meronymy (whole-part) a patient’s vital signs

Verb Phrases

customers rent cars

𝑨( 𝒔 , 𝒂 , 𝒓 , 𝒏 , 𝒍 , 𝒄 , 𝑯, 𝒑)

𝑠 vertices composing the subject

𝑎 vertices composing the action

𝑟 vertices composing the resource

𝑛 vertex representing negativity

𝑙 vertex representing limitation to a specific role

𝑐 vertices providing context to the access control policy

𝐻 subgraph required to connect all previous vertices

𝑝 set of permission associated with the current policy

𝐴( 𝑛𝑢𝑟𝑠𝑒 , 𝑜𝑟𝑑𝑒𝑟 , 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒 , , , 𝑉: 𝑛𝑢𝑟𝑠𝑒, 𝑜𝑟𝑑𝑒𝑟, 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒;

𝐸: (𝑜𝑟𝑑𝑒𝑟, 𝑛𝑢𝑟𝑠𝑒, 𝑛𝑠𝑢𝑏𝑗); (𝑜𝑟𝑑𝑒𝑟, 𝑙𝑎𝑏 𝑝𝑟𝑜𝑐𝑒𝑑𝑢𝑟𝑒, 𝑑𝑜𝑏𝑗) ), 𝑐𝑟𝑒𝑎𝑡𝑒)

𝐴( 𝑛𝑢𝑟𝑠𝑒 , 𝑜𝑟𝑑𝑒𝑟 , 𝑝𝑎𝑡𝑖𝑒𝑛𝑡 , , , (𝑉: 𝑛𝑢𝑟𝑠𝑒, 𝑜𝑟𝑑𝑒𝑟, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡; 𝐸: (𝑜𝑟𝑑𝑒𝑟, 𝑛𝑢𝑟𝑠𝑒, 𝑛𝑠𝑢𝑏𝑗); (𝑜𝑟𝑑𝑒𝑟, 𝑝𝑎𝑡𝑖𝑒𝑛𝑡, 𝑝𝑟𝑒𝑝_𝑓𝑜𝑟) ), 𝑟𝑒𝑎𝑑)

70

Additional Information

Access Control Rule Representation

Additional Information

Database Model Element Representation

71

Entities

𝐷𝑒({𝑒}, 𝐻)

Attributes of Entities

𝐷𝑎({𝑒}, {𝑎}, 𝐻)

Relationships

𝐷𝑟({𝑟}, {𝑒1}, {𝑒2}, 𝐻)

Additional Information

Step 4: Database Model Patterns

72

%

Entity_2

dobj

NN

VBR

E

Entity_1NNE

nsubj

Relationship: Association

Relationship: Aggregation / Composition

have

part

dobj

NN

VBR

E

wholeNNE

nsubj

Relationship: Inheritance

be

GeneralEntity

prep of

NN

VBR

SpecificEntityNN

nsubj

E E

Attribute

EntityNN

NNA

E

poss

Entity-attributes

Attribute

EntityNN

NNA

E

prep_of

nsubj

EntityNNE

%VB

EntityNNE

%VB

dobj prep_%

EntityNNE

%VB

Entity

Additional Information

Step 4: Extract Database Design

73

ClassifyPatterns

Known Entities and

Relationships

Generate patterns from

templates

Wildcard Patterns

Pattern Set

PatternSearch

Extract Database Design Elements

Manually Identified Patterns

InjectAdditional Patterns

Extracted Access Control Rules

TransformPatterns

74

Additional Information

REDE Application

75

Additional Information

Negativity

• Specific adjectives (unable)

• Adverbs (not, never)

• Determiners (no, zero, neither)

• Nouns (none, nothing).

• Negative verbs (stop, prohibit, forbid)

• Negative prefixes for verbs

1. What document types contain NFRs in each of the 14

different categories?

2. What characteristics, such as keywords or entities do

sentences assigned to each NFR category have in

common?

3. What machine learning classification algorithm has the

best performance to identify NFRs?

4. What sentence characteristics affect classifier

performance?

76

Additional Information

Study 1: Research Questions

• Started from Cleland-Huang, et al.

• Combined performance and scalability

• Separated access control and audit from security

• Added privacy, recoverability, reliability, and other

77

Additional Information

Study 1: Non-functional Requirement

Categories

J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc, “Automated Classification of Non-functional Requirements,”

Requirements Engineering, vol. 12, no. 2, pp. 103–120, Mar. 2007.

Access Control Privacy

Audit Recoverability

Availability Performance & Scalability

Legal Reliability

Look & Feel Security

Maintenance Usability

Operational Other

Lawrence Chung’s NFRs

accessibility, accountability, accuracy, adaptability, agility, auditability, availability, buffer space performance,

capability, capacity, clarity, code-space performance, cohesiveness, commonality, communication cost,

communication time, compatibility, completeness, component integration time, composability, comprehensibility,

conceptuality, conciseness, confidentiality, configurability, consistency, coordination cost, coordination time,

correctness, cost, coupling, customer evalutation time, customer loyalty, customizability, data-space performance,

decomposability, degradation of service, dependability, development cost, development time, distributivity, diversity,

domain analysis cost, domain analysis time, efficiency, elasticity, enhanceability, evolvability, execution cost,

extensibility, external consistency,fault-tolerance, feasibility, flexibility, formality, generality, guidance, hardware cost,

impact analyzability, independence, informative-ness, inspection cost, inspection time, integrity, inter-operable,

internal consistency, intuitiveness, learnability, main-memory performance, maintainability, maintenance cost,

maintenance time, maturity, mean performance, measurability, mobility, modifiability, modularity, naturalness,

nomadicity, observability, off-peak-period performance, operability, operating cost, peakperiod performance,

performability, performance, planning cost, planning time, plasticity, portability, precision, predictability, process

management time, productivity, project stability, project tracking cost, promptness, prototyping cost, prototyping

time, reconfigurability, recoverability, recovery, reengineering cost, reliability, repeat ability, replaceability,

replicability, response time, responsiveness, retirement cost, reusability, risk analysis cost, risk analysis time,

robustness, safety, scalability, secondary storage performance, security, sensitivity, similarity, simplicity, software

cost, software production time, space boundedness, space performance, specificity, stability, standardizability,

subjectivity, supportability, surety, survivability, susceptibility, sustainability, testability, testing time, throughput, time

performance, timeliness, tolerance, traceability, trainabilìty, transferability, transparency, understandability, uniform

performance, uniformity, usability, user-friendliness, validity, variability, verifìabiìity, versatility, visibility, wrappability

78

79

Additional Information

Study 1: Documents

Document Document Type Size AC AU AV LG LF MT OP PR PS RC RL SC US OT FN NA

CCHIT

Ambulatory

Requirements

Requirement 306 12 27 1 2 0 10 0 0 1 5 2 28 4 8 228 6

iTrust Requirement, Use

Case 1165 439 44 0 2 2 18 2 9 0 9 9 55 2 0 734 376

PromiseData Requirement 792 164 20 36 10 50 26 89 7 75 4 12 71 101 19 340 0 Open EMR Install

Manual Installation Manual 225 3 0 0 0 0 0 5 1 0 6 1 25 0 0 2 184

Open EMR User

Manual User Manual 473 169 0 0 0 14 0 0 0 0 0 0 8 4 0 286 95

NC Public Health

DUA DUA 62 1 0 0 20 0 0 0 4 0 0 0 1 0 0 0 41

US

Medicare/Medicai

d DUA

DUA 140 1 0 0 26 0 0 0 17 0 0 0 0 0 5 2 108

California

Correctional

Health Care

RFP 1893 94 120 9 85 0 133 94 52 13 16 13 193 14 38 987 409

Los Angeles

County EHR RFP 1268 58 37 8 3 2 28 19 3 11 8 13 108 21 10 639 380

HIPAA Combined

Rule CFR 2642 28 8 3 0 0 78 0 213 0 9 0 41 1 0 317 2018

Meaningful Use

Criteria CFR 1435 0 0 0 0 0 0 0 0 0 0 0 8 0 0 116 1311

Health IT

Standards CFR 1475 10 20 0 0 0 119 0 1 0 2 2 71 1 2 164 1146

Total 11876 979 276 57 152 68 413 207 300 100 50 43 563 148 82 3568 6076

80

Study 1/RQ1: What document types contain what categories of

NFRs?

• All evaluated document contained NFRs

• RFPs had a wide variety of NFRs except look and feel

• DUAs contained high frequencies of legal and privacy

• Access control and/or security NFRs appeared in all of

the documents.

• Low frequency of functional and NFRs with CFRs

exemplifies why tool support is critical to efficiently

extract requirements from those documents.

1. What patterns exist among sentences with access

control rules?

2. How frequently do different forms of ambiguity occur in

sentences with access control rules?

3. How effectively does our process detect sentences with

access control rules?

4. How effectively can the subject, action, and resources

elements of ACRs be extracted?

81

Additional Information

Study 2: Research Questions

82

Document Domain

Number of Sentences

Number of ACR Sentences

Number of ACRs

Fleiss’ Kappa

iTrust Healthcare 1160 550 2274 0.58

iTrust for Text2Policy Healthcare 471 418 1070 0.73

IBM Course Management Education 401 169 375 0.82

CyberChair Conf. Mgmt 303 139 386 0.71

Collected ACP Documents Multiple 142 114 258 n/a

Additional Information

Study 2: Investigated Documents

83

Additional Information

Study 2: ACR Patterns

Top ACR Patterns

Pattern Num. of Occurrences

(VB root(NN nsubj)(NN dobj)) 465 (14.1%)

(VB root(NN nsubjpass)) 122 (3.7%)

(VB root(NN nsubj)(NN prep)) 116 (3.5%)

(VB root(NN dobj)) 72 (2.2%)

(VB root(NN prep_%)) 63 (1.9%)

84

Additional Information

Study 2: Ambiguity

Ambiguity Occurrence % in

ACR Sentences

Pronouns 3.2%

“System” / “user” 11.0%

No explicit subject 17.3%

Other ambiguous terms 21.5%

Missing objects 0.2%

Ambiguous terms: “list”, “name”, “record”, “data”, …

85

Additional Information

Study 3: Case Study

System Open Conference System

Version 2.3.6, released May 28th, 2014

Language PHP

Supported DBMSs MySQL, PostgreSQL

Architecture Web-based application

Number of PHP files 1557

Number lines in PHP files 22198

Number of application defined roles 7

Number of database tables 52

Number of fields in database tables 369

86

Additional Information

Study 3: Case Study

Number of sentences 708

Number of ACR sentences 327

Number of ACRs 630

Number of DDE sentences 329

Number of DDEs 1002

Number of Entity DDEs 748 (287 unique)

Number of Entity-Attribute DDEs 99 (75 unique)

Number of Relationship DDEs 155 (82 unique)

Number of DDE sentences with no ACRs 2

87

Additional Information

Study 3: Research Questions

88

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Extracting ACRs

Precision Recall F1

OCS 0.53 0.27 0.35

Top 10 ACR Extraction Errors

Number Times

Missed Error Type Pattern

89 FN ( % VB root ( % NN dobj ))

36 FN ( % VB root ( % PRP nsubj )( % NN dobj ))

20 FN ( % VB root ( % NN prep_% ))

18 FN ( % VB root ( % NN nsubj )( % NN dobj ))

17 FP ( % VB root ( % NN nsubjpass ))

12 FN ( % VB root ( % PRP nsubj )( % NN prep_% ))

8 FP ( % VB root ( % PRP nsubj )( % NN dobj ))

5 FN ( allow VB root ( % PRP dobj )( % VB dep ( % NN dobj )))

5 FN ( % VB root ( % NN nsubj )( % NN prep_% ))

5 FN ( % VB root ( % NN nsubjpass ))

Modified version of Levenshtein String Edit Distance

Use words (vertices) instead of characters

89

Additional Information

Sentence Similarity Algorithm

computeVertexDistance(Vertex a, Vertex b)

1: if a = NULL or b = NULL return 1

2: if a.partOfSpeech <> b.partOfSpeech return 1

3: if a.parentCount <> b.parentCount return 1

4: for each parent in a.parents

5: if not b.parents.contains(parent) return 1

6: if a.lemma = b.lemma return 0

7: if a and b are numbers, return 0

8: if ner classes match, return 0

9: wnValue = wordNetSynonyms(a.lemma,b.lemma)

10: if wnValue > 0 return wnValue

11: return 1

Why is This Problem Difficult?

90

• Ambiguity

• Multiple ways to express the same

meaning

• Resolution issues

Motivation: U.S. Data Breaches

91

Source: Privacy Rights Clearinghouse[2]

Motivation: Healthcare Documentation

• HIPAA

• HITECH ACT

• Meaningful Use Stage 1 Criteria

• Meaningful Use Stage 2 Criteria

• Certified EHR (45 CFR Part 170) • ASTM

• HL7

• NIST FIPS PUB 140-2

• HIPAA Omnibus

• NIST Testing Guidelines

• DEA Electronic Prescriptions for Controlled Substances (EPCS)

• Industry Guidelines: CCHIT, EHRA, HL7

• State-specific requirements • North Carolina General Statute § 130A-480 – Emergency Departments

• Organizational policies and procedures

• Project requirements, use cases, design, test scripts, …

• Payment Card Industry: Data Security Standard

93 Scream, Edvard Much, 1895

Dissertation Thesis

Access control rules explicitly and implicitly defined

within unconstrained natural language product artifacts

can be effectively identified and extracted;

Database design elements can be effectively

identified and extracted;

Mappings can be identified among the access control

rules, database design elements, and the physical

database implementation; and

Role-based access control can be established within

a system’s relational database.

94