2
Welcome to Madrid !!!
▶Big Data AND security: what is there on our minds ?
▶Big Data tools and technologies
▶Big Data T&T chain and security/privacy concern mappings
▶From Strategies to concrete solutions
▶Future research topics
▶SECCORD and PHEME
▶Conclusions
6
Technology landscape
• Batch processing: Apache Hadoop, Spark...
• Real-time processing: Apache Storm, S4, Spark�
• Messaging & queues: Apache Flume, Kafka�
• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D
data visualization, mobile visualization�
• Related technologies: CEP, pipelines, sensor and machine
acquisition APIs, Social Networks APIs�.
• Batch processing: Apache Hadoop, Spark...
• Real-time processing: Apache Storm, S4, Spark�
• Messaging & queues: Apache Flume, Kafka�
• Visualization frameworks: HTML5, D3, Tableau, Quilkview, 3D
data visualization, mobile visualization�
• Related technologies: CEP, pipelines, sensor and machine
acquisition APIs, Social Networks APIs�.
Big Data Baseline technologiesBig Data Baseline technologies
• Machine Learning, Deep Learning, Data mining, Web mining,
• Statistical methods, pattern recognition�
• Decision Support Systems, predictive & prescriptive analytics�
• Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:
Analytical Processes: KNIME, RapidMiner; Statistical Software: R,
Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology:
CRISP-DM; Standards: Predictive Model Markup Language (PMML)
• Machine Learning, Deep Learning, Data mining, Web mining,
• Statistical methods, pattern recognition�
• Decision Support Systems, predictive & prescriptive analytics�
• Libraries: MADLib, R Libraries; Languages: R, Phyton, Java:
Analytical Processes: KNIME, RapidMiner; Statistical Software: R,
Weka, Mahout, Cluto, Octave, gretl, Racket, SPSS; Methodology:
CRISP-DM; Standards: Predictive Model Markup Language (PMML)
Advanced Data AnalyticsAdvanced Data Analytics
• Natural Language Processing,
• Name Entity Recognition, PoS tagging, language detection�
• (Semi)Automatic categorization and annotation
• Natural Language Processing,
• Name Entity Recognition, PoS tagging, language detection�
• (Semi)Automatic categorization and annotation
Language technologiesLanguage technologies
• NoSQL: HBase, Cassandra, MongoDB, Neo4j�
• Triplestores: Sesame, GraphDB�
• NewSQL
• In-memory processing (SAP Hana..)
• NoSQL: HBase, Cassandra, MongoDB, Neo4j�
• Triplestores: Sesame, GraphDB�
• NewSQL
• In-memory processing (SAP Hana..)
Big Data storageBig Data storage
• Ontology engineering
• Linked Data
• Formal Semantics (DL, OWL, FOL�)
• Semantic Interoperability
• Ontology engineering
• Linked Data
• Formal Semantics (DL, OWL, FOL�)
• Semantic Interoperability
SemanticsSemantics
• Big Data reference architectures
• Lambda architecture
• Scalable solutions, fit-for-purpose solutions
• Standards
• Big Data reference architectures
• Lambda architecture
• Scalable solutions, fit-for-purpose solutions
• Standards
Big Data architecturesBig Data architectures
7
Security and privacy technologiesin the Big Data Value Chain
Data
Storage
Data
Storage
Data
Acquisition
Data
AcquisitionData UsageData Usage
Data
Analysis
Data
Analysis
Data pre-
processing
Data pre-
processing
Data
Curation
Data
Curation
NOHDFS
Hbase, Cassandra
MongoDB,
ElephantDB
Neo4J,
Triplestores
…
Models
Veracity
Matching
Cleansing
Validation
Update
Social Networks
IoT
Web
…
CEP
Messaging
Pub-sub
Apache Kafka
Apache Flume
…
Visualization
2D, 3D
Mobile, APIs
D3, Tableau
Big Data
Architecture
Big Data
Architecture
Data
science
Data
science
text mining
text analysis
sentiment analysis
case base reasoning
real time data analytics
data mining
machine learning
deep learning
…
Hadoop, Storm, Spark…
data scientist
statistics
data mining
machine learning
Reference
Lambda Arch.
Pipelines
…
Filtering
Cleansing
Aggregation
Fusion
Annotation
Categorization
NLP, NER
R, Octave
ML frameworks
Weka, Apache Mahout
Consent in M2M?
Anonymization? Access and usagepolicies ??
9
Strategy analysis
▶ MINIMIZE (Collection stage):
– data posted on SN, collected as a service requirement, collected as a legal requirement, collected automatically and unknowingly (e.g. location), inferred by previous processing, bought and added from external sources, shared with external sources
– Recommendation : move from consent to reputational penalties (trust index)
▶ HIDE (Pre-processing):
– side-information, meta-data leakage (e.g. location etc)
– anonymize/de-identify not feasible on a long term
– adding “noise”, use intermediator (trusted privacy proxy), publish “epsilons”
▶ HIDE (Processing):
– functions over encrypted data
– hybrid or onion encryption
10
Strategy analysis
▶ CONTROL (Data usage and analysis)
– Express purpose, context, usage…in a data policy
– Associate policy with data (“sticky” policy) and the processing component(monitor and enforcement)
– From NL policy ro MR policy and data-tags (tranformation, refinement)
– From input policies to (computed) output policies
– Build EU regulation library of NL2MR patterns
▶ BD results and Post-use impact
– Discrimination
– Data divide
– Power imbalance
– Echo chambers
▶ Strategy: cultural and societal awareness and capacity building
11
Need to define FUTURE security research topics forBD
▶ Secure data conditioning
▶ Tamper resistant logs (e.g. TR- Flume, privacy in auditing)
▶ Secure object storage
▶ Secure “divide and conquer” computation approach
▶ Secure stream processing
▶ SW and ontologies for policy conflict resolution, policy transformation and refinement
▶ Extracting and sharing cybersecurity linked data
▶ Security metadata and tagging (e.g. machine readible certificates, use BD to semi-automate tagging)
▶ Security and machine learning e.g. recomendation engines, predicton, intelligentagents, risk assessment, distributed ML, simulation games…
▶ Threats from unsupervised machine learning algorithms
▶ Secure infomediaries, data value added resellers (VAR), data marketplaces(https://gnip.com)
▶ Statistical models, correlation rules, logic etc
▶ Security and creativity!!!!
12
SECCORD trend analysis
▶ SECCORD D5.4 : Big Data impact on security
▶ Opening up security data repositories (ACDC)
▶ Role of unstructured text in SN-based botnet C&C
▶ Patterns of abnormal behavior
▶ Pattern recognition = discovery (data mining to find patterns) + detection (apply pattern to find e.g. anomaly)
▶ False alarm reduction
▶ Veracity engines
13
The 3+ V’s of Big Data
▶ 1. Volume (lots of data Zettabytes)
▶ 2. Variety (complexity, dimensionality)
▶ 3. Velocity (fast data)
+
▶ 4. Veracity (truthfulness, curation)
▶ 5. Venue (location)
▶ 6. Vocabulary (semantics)
▶ 7. Variability …
From “Understanding Big Data” by IBM
12/02/2015
Thank you
Atos Research & [email protected]
Atos, the Atos logo, Atos Consulting, Atos Worldline, Atos Sphere, Atos Cloud and Atos WorldGrid
are registered trademarks of Atos SA. June 2011
© 2011 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.