16
12/02/2015 Big Data & Security Aljosa Pasic

Big Data & Security - Trust in Digital Life · 2017-10-16 · real time data analytics data mining machine learning deep learning … Hadoop, Storm, Spark ... Security metadataand

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

12/02/2015

Big Data & Security

Aljosa Pasic

2

Welcome to Madrid !!!

▶Big Data AND security: what is there on our minds ?

▶Big Data tools and technologies

▶Big Data T&T chain and security/privacy concern mappings

▶From Strategies to concrete solutions

▶Future research topics

▶SECCORD and PHEME

▶Conclusions

3

BD 4 SEC; SEC 4 BD or BD & SEC?

Big Data forsecurity SIEM can do it !!!

4

Need of privacy & security FOR BD !!!!

5

Plethora of “Big Data” related tools

http://www.bigdata-startups.com/open-source-tools/

6

Technology landscape

• Batch processing: Apache Hadoop, Spark...

• Real-time processing: Apache Storm, S4, Spark�

• Messaging & queues: Apache Flume, Kafka�

• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D

data visualization, mobile visualization�

• Related technologies: CEP, pipelines, sensor and machine

acquisition APIs, Social Networks APIs�.

• Batch processing: Apache Hadoop, Spark...

• Real-time processing: Apache Storm, S4, Spark�

• Messaging & queues: Apache Flume, Kafka�

• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D

data visualization, mobile visualization�

• Related technologies: CEP, pipelines, sensor and machine

acquisition APIs, Social Networks APIs�.

Big Data Baseline technologiesBig Data Baseline technologies

• Machine Learning, Deep Learning, Data mining, Web mining,

• Statistical methods, pattern recognition�

• Decision Support Systems, predictive & prescriptive analytics�

• Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:

Analytical Processes: KNIME, RapidMiner; Statistical Software: R,

Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology:

CRISP-DM; Standards: Predictive Model Markup Language (PMML)

• Machine Learning, Deep Learning, Data mining, Web mining,

• Statistical methods, pattern recognition�

• Decision Support Systems, predictive & prescriptive analytics�

• Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:

Analytical Processes: KNIME, RapidMiner; Statistical Software: R,

Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology:

CRISP-DM; Standards: Predictive Model Markup Language (PMML)

Advanced Data AnalyticsAdvanced Data Analytics

• Natural Language Processing,

• Name Entity Recognition, PoS tagging, language detection�

• (Semi)Automatic categorization and annotation

• Natural Language Processing,

• Name Entity Recognition, PoS tagging, language detection�

• (Semi)Automatic categorization and annotation

Language technologiesLanguage technologies

• NoSQL: HBase, Cassandra, MongoDB, Neo4j�

• Triplestores: Sesame, GraphDB�

• NewSQL

• In-memory processing (SAP Hana..)

• NoSQL: HBase, Cassandra, MongoDB, Neo4j�

• Triplestores: Sesame, GraphDB�

• NewSQL

• In-memory processing (SAP Hana..)

Big Data storageBig Data storage

• Ontology engineering

• Linked Data

• Formal Semantics (DL, OWL, FOL�)

• Semantic Interoperability

• Ontology engineering

• Linked Data

• Formal Semantics (DL, OWL, FOL�)

• Semantic Interoperability

SemanticsSemantics

• Big Data reference architectures

• Lambda architecture

• Scalable solutions, fit-for-purpose solutions

• Standards

• Big Data reference architectures

• Lambda architecture

• Scalable solutions, fit-for-purpose solutions

• Standards

Big Data architecturesBig Data architectures

7

Security and privacy technologiesin the Big Data Value Chain

Data

Storage

Data

Storage

Data

Acquisition

Data

AcquisitionData UsageData Usage

Data

Analysis

Data

Analysis

Data pre-

processing

Data pre-

processing

Data

Curation

Data

Curation

NOHDFS

Hbase, Cassandra

MongoDB,

ElephantDB

Neo4J,

Triplestores

Models

Veracity

Matching

Cleansing

Validation

Update

Social Networks

IoT

Web

CEP

Messaging

Pub-sub

Apache Kafka

Apache Flume

Visualization

2D, 3D

Mobile, APIs

D3, Tableau

Big Data

Architecture

Big Data

Architecture

Data

science

Data

science

text mining

text analysis

sentiment analysis

case base reasoning

real time data analytics

data mining

machine learning

deep learning

Hadoop, Storm, Spark…

data scientist

statistics

data mining

machine learning

Reference

Lambda Arch.

Pipelines

Filtering

Cleansing

Aggregation

Fusion

Annotation

Categorization

NLP, NER

R, Octave

ML frameworks

Weka, Apache Mahout

Consent in M2M?

Anonymization? Access and usagepolicies ??

8

Need to map strategies to BD value chain

9

Strategy analysis

▶ MINIMIZE (Collection stage):

– data posted on SN, collected as a service requirement, collected as a legal requirement, collected automatically and unknowingly (e.g. location), inferred by previous processing, bought and added from external sources, shared with external sources

– Recommendation : move from consent to reputational penalties (trust index)

▶ HIDE (Pre-processing):

– side-information, meta-data leakage (e.g. location etc)

– anonymize/de-identify not feasible on a long term

– adding “noise”, use intermediator (trusted privacy proxy), publish “epsilons”

▶ HIDE (Processing):

– functions over encrypted data

– hybrid or onion encryption

10

Strategy analysis

▶ CONTROL (Data usage and analysis)

– Express purpose, context, usage…in a data policy

– Associate policy with data (“sticky” policy) and the processing component(monitor and enforcement)

– From NL policy ro MR policy and data-tags (tranformation, refinement)

– From input policies to (computed) output policies

– Build EU regulation library of NL2MR patterns

▶ BD results and Post-use impact

– Discrimination

– Data divide

– Power imbalance

– Echo chambers

▶ Strategy: cultural and societal awareness and capacity building

11

Need to define FUTURE security research topics forBD

▶ Secure data conditioning

▶ Tamper resistant logs (e.g. TR- Flume, privacy in auditing)

▶ Secure object storage

▶ Secure “divide and conquer” computation approach

▶ Secure stream processing

▶ SW and ontologies for policy conflict resolution, policy transformation and refinement

▶ Extracting and sharing cybersecurity linked data

▶ Security metadata and tagging (e.g. machine readible certificates, use BD to semi-automate tagging)

▶ Security and machine learning e.g. recomendation engines, predicton, intelligentagents, risk assessment, distributed ML, simulation games…

▶ Threats from unsupervised machine learning algorithms

▶ Secure infomediaries, data value added resellers (VAR), data marketplaces(https://gnip.com)

▶ Statistical models, correlation rules, logic etc

▶ Security and creativity!!!!

12

SECCORD trend analysis

▶ SECCORD D5.4 : Big Data impact on security

▶ Opening up security data repositories (ACDC)

▶ Role of unstructured text in SN-based botnet C&C

▶ Patterns of abnormal behavior

▶ Pattern recognition = discovery (data mining to find patterns) + detection (apply pattern to find e.g. anomaly)

▶ False alarm reduction

▶ Veracity engines

13

The 3+ V’s of Big Data

▶ 1. Volume (lots of data Zettabytes)

▶ 2. Variety (complexity, dimensionality)

▶ 3. Velocity (fast data)

+

▶ 4. Veracity (truthfulness, curation)

▶ 5. Venue (location)

▶ 6. Vocabulary (semantics)

▶ 7. Variability …

From “Understanding Big Data” by IBM

14

Conclusion nr.1 : is the future Orwel or Huxley like ??

15

Conclusion nr. 2 (not definitive): your privacy will be in hands of “data curator”

12/02/2015

Thank you

Atos Research & [email protected]

Atos, the Atos logo, Atos Consulting, Atos Worldline, Atos Sphere, Atos Cloud and Atos WorldGrid

are registered trademarks of Atos SA. June 2011

© 2011 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.