15
DEPARTMENT OF COMPUTER SCIENCE 31 OCTOBER 2018 PROFESSOR OPENSCIENCE SEMINAR IRA ASSENT AARHUS UNIVERSITY DATA LEAK PREVENTION – IDENTIFYING SENSITIVE INFORMATION IN TEXT USING DEEP LEARNING

DATA LEAK PREVENTION – IDENTIFYING SENSITIVE INFORMATION

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

DEPARTMENT OF COMPUTER SCIENCE31 OCTOBER 2018 PROFESSOR

OPENSCIENCE SEMINAR IRA ASSENTAARHUSUNIVERSITY

DATA LEAK PREVENTION – IDENTIFYING SENSITIVE INFORMATION IN TEXT USING DEEP LEARNING

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

DATA-INTENSIVE SYSTEMS• Data Management

• complex query processing and indexing• anytime and progressive algorithms

• Data Mining and Machine Learning• automatically extracting patterns from data• Clustering, anomaly / outlier detection, classification• active learning

• Focus:• efficient and scalable algorithms for large data volumes• formal definitions and analyses• empirical prototypes and studies

2

Relevant patterns, query results

Big Data Analysis

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Sensitive information•Sensitivity is domain specific

•Business Processes

•Inventions/new projects

•Customer/Citizen information or cases

•Leak: high cost

•Data Leak Prevention

•Detect sensitive information in text

•Allows redacting prior to publication

•A form of classification task

•Supervised machine learning

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

The Enron Case Enron: US energy company Major accounting fraud scandal in 2001, In the following investigation, FERC (Federal

Energy Regulatory Commission) forced Enron to release its internal documents, including emails, memos etc

Excellent case for the study of sensitive information detection

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

COMPLEX SENSITIVE INFORMATION

Keyword based approaches

• Define sensitivity using keyword(s)

• Observe frequent co-occurrence of words with keywords

Complex sensitivity

• Sensitivity not only depends on keywords

• Cannot capture how people express sensitive topics in natural language phrases

Example

• describing sensitive financial transactions might use the same vocabulary as in

the non-sensitive case, but using different expressions in natural language.

”We may have to move to cash margining if necessary”

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Parse-treesIn NLP (Natural Language Processing) community

parse trees capture the structure of phrases / sentences

Tree structure captures the relationship / roleof words

”We may have to move to cash margining if necessary”

move cash margining if necessaryto to

SBARNPNP

PP

VP

VPVP

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

NEURAL NETS / DEEP LEARNING

• As in any classifier: use training data x, where correct ”class label” c(x) known• Here: sensitive document yes/no

• initialize network with random weights wij on edges• feed training data x to network (forward, without c(x))

• Output is ”predicted label” y, possibly with some error compared to true class label e(c(x),y)

• ”Backpropagation” layerwise from output to input• adjust weights based on error• efficient automated derivative computation

y

forw

ard

: pre

dic

tion

ba

ckw

ard

: le

arn

ing

x1 x2 x3

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

CLASSIFYING WITH RECURSIVE NEURAL NETSRecursive neural network model

• Can take phrase trees of arbitrary length/complexity as input

• Trained using example sensitive documents• For the Enron case

• Prepay: All documents related to Enron’s engagement in structuredcommodity transactions known as “prepay transactions”

• Labeled by law students for the TREC competition (Text retrieval conference)

• Used to predict sensitive content

8

Wl

V Ur

V

Ul

Wr

V

V

V

Wr

Wr

Ur

Ur

Wl

Wl

Ul

Wl

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Encoding Visualizationt-sne

Clusters

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

ONGOING WORKTraining often prohibitively slow, especially for complex networks and large data volumes

• Cluster (group) representation in the hidden layers of the network (prior to actual prediction)• Clusters with ”pure” class label considered ”learnt” exclude from further training• Reduces training time• Even slightly improves prediction accuracy

10

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Data structures are important for IR systems.

Information Retrieval

InfraredInternational Relations

Infra-red

infra-red

International Relation

Inter-national Relations

International Relation

immunoreactivityimmediate release

infection rate irradiationionizing radiation

insulin receptor

iterative reconstruction insulin resistance

inversion recovery

ionotropic receptors

inflammatory response

interquartile range

information ratio

incidence rate

immunoreactivities

infection rates

immediate releasesionotropic receptor

inflammatory responses

information ratios

insulin receptors

inter-quartile range

inter-quartile ranges

incidence rates

iterative reconstructions

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

HOW WORD EMBEDDINGS CAN DISAMBIGUATE

● Learning one vector for each abbreviation results in a mixture of semantics○ E.g. IR mixture of Information Retrieval, Iterative

Reconstruction, International Relations, Insulin Receptors…

● Idea: replace each abbreviation with special token, learn separately for each of them

● “Information retrieval (IR) is the activity of obtaining…”rewritten to “__ABB__IR/Information_retrieval is the activity of obtaining…”○ Thus, learn the context for each meaning!

Joint work with Manuel Ciosici, PhD student

receptor

insulinpig

information

data

retrievaldatabases

IR

information

data

retrieval

databases

__ABB__IR/InformationRetrieval

receptor

insulinpig__ABB__IR/Insulin Receptor

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Conclusion and discussion Data leak prevention of high practical importance

Detecting complex sensitive information

A model built using phrase trees from NLP

Train a recursive neural network that can learn from phrase trees to predict sensitivity

The Enron case provides an interesting use case

We are currently working on Monsanto

Abbreviation disambiguation

Speeding up training

Transparency and explanations

OPENSCIENCE SEMINAR IRA ASSENT

31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE

AARHUSUNIVERSITY

Thank You!questions?

AcknowledgementsJan [email protected] [email protected]

AARHUSUNIVERSITY