DEPARTMENT OF COMPUTER SCIENCE31 OCTOBER 2018 PROFESSOR
OPENSCIENCE SEMINAR IRA ASSENTAARHUSUNIVERSITY
DATA LEAK PREVENTION – IDENTIFYING SENSITIVE INFORMATION IN TEXT USING DEEP LEARNING
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
DATA-INTENSIVE SYSTEMS• Data Management
• complex query processing and indexing• anytime and progressive algorithms
• Data Mining and Machine Learning• automatically extracting patterns from data• Clustering, anomaly / outlier detection, classification• active learning
• Focus:• efficient and scalable algorithms for large data volumes• formal definitions and analyses• empirical prototypes and studies
2
Relevant patterns, query results
Big Data Analysis
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Sensitive information•Sensitivity is domain specific
•Business Processes
•Inventions/new projects
•Customer/Citizen information or cases
•Leak: high cost
•Data Leak Prevention
•Detect sensitive information in text
•Allows redacting prior to publication
•A form of classification task
•Supervised machine learning
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
The Enron Case Enron: US energy company Major accounting fraud scandal in 2001, In the following investigation, FERC (Federal
Energy Regulatory Commission) forced Enron to release its internal documents, including emails, memos etc
Excellent case for the study of sensitive information detection
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
COMPLEX SENSITIVE INFORMATION
Keyword based approaches
• Define sensitivity using keyword(s)
• Observe frequent co-occurrence of words with keywords
Complex sensitivity
• Sensitivity not only depends on keywords
• Cannot capture how people express sensitive topics in natural language phrases
Example
• describing sensitive financial transactions might use the same vocabulary as in
the non-sensitive case, but using different expressions in natural language.
”We may have to move to cash margining if necessary”
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Parse-treesIn NLP (Natural Language Processing) community
parse trees capture the structure of phrases / sentences
Tree structure captures the relationship / roleof words
”We may have to move to cash margining if necessary”
move cash margining if necessaryto to
SBARNPNP
PP
VP
VPVP
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
NEURAL NETS / DEEP LEARNING
• As in any classifier: use training data x, where correct ”class label” c(x) known• Here: sensitive document yes/no
• initialize network with random weights wij on edges• feed training data x to network (forward, without c(x))
• Output is ”predicted label” y, possibly with some error compared to true class label e(c(x),y)
• ”Backpropagation” layerwise from output to input• adjust weights based on error• efficient automated derivative computation
y
forw
ard
: pre
dic
tion
ba
ckw
ard
: le
arn
ing
x1 x2 x3
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
CLASSIFYING WITH RECURSIVE NEURAL NETSRecursive neural network model
• Can take phrase trees of arbitrary length/complexity as input
• Trained using example sensitive documents• For the Enron case
• Prepay: All documents related to Enron’s engagement in structuredcommodity transactions known as “prepay transactions”
• Labeled by law students for the TREC competition (Text retrieval conference)
• Used to predict sensitive content
8
Wl
V Ur
V
Ul
Wr
V
V
V
Wr
Wr
Ur
Ur
Wl
Wl
Ul
Wl
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Encoding Visualizationt-sne
Clusters
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
ONGOING WORKTraining often prohibitively slow, especially for complex networks and large data volumes
• Cluster (group) representation in the hidden layers of the network (prior to actual prediction)• Clusters with ”pure” class label considered ”learnt” exclude from further training• Reduces training time• Even slightly improves prediction accuracy
10
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Data structures are important for IR systems.
Information Retrieval
InfraredInternational Relations
Infra-red
infra-red
International Relation
Inter-national Relations
International Relation
immunoreactivityimmediate release
infection rate irradiationionizing radiation
insulin receptor
iterative reconstruction insulin resistance
inversion recovery
ionotropic receptors
inflammatory response
interquartile range
information ratio
incidence rate
immunoreactivities
infection rates
immediate releasesionotropic receptor
inflammatory responses
information ratios
insulin receptors
inter-quartile range
inter-quartile ranges
incidence rates
iterative reconstructions
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
HOW WORD EMBEDDINGS CAN DISAMBIGUATE
● Learning one vector for each abbreviation results in a mixture of semantics○ E.g. IR mixture of Information Retrieval, Iterative
Reconstruction, International Relations, Insulin Receptors…
● Idea: replace each abbreviation with special token, learn separately for each of them
● “Information retrieval (IR) is the activity of obtaining…”rewritten to “__ABB__IR/Information_retrieval is the activity of obtaining…”○ Thus, learn the context for each meaning!
Joint work with Manuel Ciosici, PhD student
receptor
insulinpig
information
data
retrievaldatabases
IR
information
data
retrieval
databases
__ABB__IR/InformationRetrieval
receptor
insulinpig__ABB__IR/Insulin Receptor
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Conclusion and discussion Data leak prevention of high practical importance
Detecting complex sensitive information
A model built using phrase trees from NLP
Train a recursive neural network that can learn from phrase trees to predict sensitivity
The Enron case provides an interesting use case
We are currently working on Monsanto
Abbreviation disambiguation
Speeding up training
Transparency and explanations
OPENSCIENCE SEMINAR IRA ASSENT
31 OCTOBER 2018 PROFESSORDEPARTMENT OF COMPUTER SCIENCE
AARHUSUNIVERSITY
Thank You!questions?
AcknowledgementsJan [email protected] [email protected]