Upload
big-data-spain
View
83
Download
0
Embed Size (px)
Citation preview
Finding the needle in the haystack How the food industry is
leveraging big data to defeat cyber attacks
Rafael Villoria-Ferrer - Rafael San Miguel Carrasco
• Present in196 countries
• 447 factories in 86 countries
• 333.000 employees
• 92.158M CHF in sales
Odds of you (or your pet) having enjoyed Nestle products: 99,99%
Nestle: facts and figures
Behind the scenes
Source: M-Trends Report
63%of companies learnt they were
breached from an external entity
100%of victims had up-to-date antivirus
signatures
Why we are here ... in 4 emojis
Key constraints of current security technologies
Multi-vector and multi-stage cyber attacks go unnoticed to
SOC operators focused on real-time detection
To detect a ‘never-seen before’ attack, it must have
been seen before in order to produce a signature (lol)
Execessive time-to-insightsInvestigations take too much with
so much data available
Reliance on signaturesWelcome to the ‘managing false
positives’ world
Real-time monitoringThe attack is no longer the tree,
but a forest
Up to 1 day of to generate a report containing an IP
address’ activity across the network
The optical effect into play
• Someone that is back from holidays and forgot his password (99%)
• Someone brute forcing an account (1%)
Critical SAP transaction executed
Reconaissance activity from Internet
Multiple failed logons
EVENT WHAT COULD MEAN
BUT ...
• An attacker might obtain the password through other means and never fail a logon
• True scans or false positives that will never turn into an attack (99%)
• Activities launched prior a true targetted attack (1%)
• Legitimate transaction executed by an authorized user as part of his BaU (99%)
• The last step in the kill chain (1%)
• An attack might not need generate noisy reconaissance activity
• A non-critical transaction could also be a serious attack
Our environment
• Aggregation of logs from multiple sources with propietary technology from HP
• Correlation rules with common intrusion scenarios and use cases of known cyber attacks
• Hundreds of alerts every day
• Real-time processing under control: the SOC can effectively handle alerts as they are triggered by using predefined response procedures
• The goal is to respond fast
• Data lake just with relevant events that can become part of a analytical data model
• Focus on detection of anomalies, not predefined attacks patterns
• Expectation of few alerts every day
• Visualizations that let us understand the network and its assets’ activity
• Forest-based data processing allowing to spot multi-vector and multi-stage attacks
• The goal is to detect complex attacks
6 billion security events generated every week from ~250.000 employees across the
globe
Principles and challenges
Cyber attack
Activity different from legitimate activity, that must be the most frequent
Anomaly Detection
Leveraging kownledge about the kill chain
• multiple traces left by the attacker,• unlikely to be recognized as an attack• that differ from regular activity
* Statistical significance less relevant than prediction
performance
Asset modeling
All
User authentication Microsoft & VPN
Network connections Firewall & IPS
Business transactionsSAP
Internet browsing Content filtering proxy
Internet domainsContent filtering proxy
Web applications WAF
Server behavior Microsoft
Threat intelligence HP & OSINT
Security domains
Why Assets Modeling is key for security
Risk prioritization
Malicious activity can be critical if it affects a system with key business privileges or almost inocuous if applies to a kiosk PC for guests
Multiple decoupled IT inventories
Can’t trust multiple IT inventories with:
• conflicting information• often outdated• not correlated• lacking key security details• with diferent semantics for features
Our questions
• “Has this PC have privileged access to Treasury?”
• “Does this webserver runs unauthorized or insecure services?”
Available information
• “This application is part of the Treasury environment”
• ”This system is a webserver”
In an incident investigation, an asset details tells us how big the impact us and what other assets might be affected.
Principles and challenges
Understanding of the (complex) SIEM data model
• Multiple sources
• Thousands of event types
Serv
ers
and
Dom
ain
Con
trol
lers
Principles and challenges
Understanding of the (complex) SIEM data model
• Multiple sources
• Thousands of event types
Con
tent
-filt
erin
g p
roxy
Intr
usio
n Pr
even
tion
Sys
tem
• Hadoop-based deployment
• Spark cluster for parallel in-memory execution of data processing routines
• Scales up and down as required
• Advanced capabilities for events ingestion
• Fully customizable approach to data mining, statistical analysis and modeling
• Integration with most Hadoop components, including Spark
• RevoScale R libraries for distributed computing on the cluster
• Excel-like programming language
• Interactive dashboards
• Relational data model with point-and-click data blending
• Extensive library of advanced visuals
Technology stack
The (initial) feature set
No labeled data other than signatures (false positives and negatives) and IOCs (ever-changing)
• Log source
• Timestamp
• Source Address and Hostname
• Target Address and Hostname
• Action
• Parameters (URL, category, etc.)
• Username
• Error code
=
Building blocks
• Update and retention policies
• Enrichment: geo, external sources, flags from existing knowledge about network assets
• Target features: behaviour vectors, maps, ratios, volumes
• Typical transformation: agreggation, filtering, reshaping
Ingestion
Exploration Modeling
Prediction Visualization
Feedback management
• Applied to ingested data based on trained models
• Search of potentially useful models
• Kibana, download of samples to local, tests in cluster
Transformation
Data Model
• Dashboards with prediction results
• Dashboard with high level overview
• Dashboard per domain: threat intelligence, transactions, connections, vulnerabilities
• Behavioral analytics
• Classifiers
• Multivariate outliers detection
• Scoring
• Super Correlation
• Pipeline• Activities• Linked Services• Dataset for
FileServer• Dataset for
BlobStorage
INGESTION > TRANSFORMATION > EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
Arcsight SIEM
Proxy
Firewall
Domain controllers
WAF
IPS
SAP
Arcsight Logger
Arcsight ESM
Correlated Events
Base Events
Scheduled reports
CSV files Data Management Gateway
Blob Storage
Data Factory
MRS
Spark RevoScaleRPowerBI
Scheduled reports
Internet
• Spark Data Frames• Data Sources, XDF• data frames
• CSV• Parquet files
Heavy-liftingAgreggation, filtering, reshaping
INGESTION > TRANSFORMATION > EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
• Behaviour records
• Metrics
• Ratios (of successful vs error)
• Time-based measures (binning) for histograms
• Useful features from enrichment from external sources
• Log source
• Timestamp
• Source Address and Hostname
• Target Address and Hostname
• Action
• Parameters (URL, category, etc.)
• Username
• Error code... so now we can develop, test and apply algorithms
EnrichmentGeo, external sources, flags from existing
knowledge about network assets
Storage of formatted eventsApply update and retention policies
• Understanding of data and search of potentially useful models
• Kibana
• Download of samples to local (Tableau)
• Tests in cluster
• Connections volume between suspicious and legitimate workstations
• Distribution of users’ visits to websites hosted in each country
INGESTION > TRANSFORMATION > EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
Generate prediction-oriented datasets from datasets with formatted events and support datasets
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
Behavior analytics• Deviations from previous behavior
• Spearman / Kendall / Cosine similarity
• Graph analysis for clustering
• Granger’s Causality Test
Models trained signature-based systems• Logistic regression
• LDA
• QDA
• Naive Bayes
• KNN
Univariate/Multivariate outliers detection• N standard deviations from the mean
• Mahalanobis distance
• High leverage points
• Cook’s distance
• Scoring: calculate risk factor from anomalies
• Super Correlation: factor in relationships between assets and past predictions
• Cascading algorithms
Example: User Authentication
Histogram of user auth requests from 00:00 to 23:59
Histogram of user auth requests from Monday to Sunday
Matrix of user authentications in servers
Matrix of user auth requests grouped by
location
Clusters of users showing similar behaviour
Malicious Authentications
New data
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
Working with RevoScaleR
• Standard clustering techniques not feasible because we can’t figure out a proper number of clusters in advance
• Replacing cor() by rxCor(): 45 minutes to 6 seconds
• graph package helping us to identify clusters membership based on correlation between workstations authenticating into servers
Example: Malicious Internet Domains
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
Training the algorithm to build profiles of malicious domains, based on agnostic variables
TRAINING
• Marketing metrics
• HTML metrics
• SEO metrics
prcomp()
Example: Malicious Internet Domains
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
http://www.datasciencecentral.com/profiles/blogs/dimensionality-reduction-with-r-to-uncover-malicious-internet
4 malicious domains get displayed below the line, along with the rest of malicious domains previously identified
• Generate prediction-oriented features: behaviour vectors, maps, ratios, counts
• Apply models to fresh data to detect anomalies
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
IN THE LAB
Leveraging our ethical hackers to create scenarios that must be detected as anomalies by the platform
More labeled data!
• Dashboards with prediction results (anomalies)
• Dashboard with high level overview
• Dashboard per domain: threat intelligence, transactions, connections, vulnerabilities
• Graphs for asset-based correlation, timelines for time-based correlation
• Dashboards for incident investigation: asset modeling
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK
IN THE LAB
Send e-mails to users to confirm whether they performed a given suspicious activity
Generate exceptions or fine-tune models from wrong predictions
More labeled data!
Super Correlation
After each run of prediction routines:
• Each asset has a risk level (scoring) and a list of anomalies
• List of pairs of related assets is built (risk level = sum of risk levels)
• Graphs are built from related assets and clusters are identified (risk level is the sum of risk levels)
• Highest-risk clusters are displayed as anomalies
ASSETS-BASED
Asset A:
User
AssetB: Server
AssetC:
Internet Domain
Anomal
y
Anomal
y
Anomal
y
Anomal
y
Anomal
y
Anomal
y
Behavioral Analytics
Suspicious logon
Behavioral Analytics
Suspicious process
Univariate outlierVolume of
connections to single IP
Multivariate outliersUnusual combination of Security Audit events
Anomal
y
Behavioral Analytics
User account not linked to workstation
ClassifiersSuspicious domain
Behavioral analyticsNever contacted before• Independence from predefined
correlation scenarios: A > B > C
• Accurately quantify the risk level of a given alert, that can be low if isolated but critical considering the context
• Identify other assets impacted by a given cyber attack: quantify business impact
Super Correlation
After each run of prediction routines:
• New anomalies are stored
• Previous anomalies are matched with them based on common assets/clusters
• A timeline of anomalies related to each asset/cluster can be displayed.
TIME-BASED
• Identify multi-stage attacks that are performed over the course of days, weeks or months
Asset A:
User
Asset A:
User
Asset A:
User
B
C
D
F C
GE
01/03/16 12/04/16 20/05/16
A
A
A AA
A
Our vision: analytics-based SOC monitoring
Datalake arch & deployment
• Architectural design
• Integration of log sources
Datalake administration Data mining &
Statistical AnalysisVisual Data Discovery
Statistical modeling
• Data preparation
• Development of advanced features
• Pattern Discovery
• Anomaly detection
• Time Series analysis
• Outliers analysis
• Data and pattern discovery
• Trend analysis
• Anomaly detection
• Unsupervised training
• Supervised training
• Development of ML algorithms
• Configuration management
• System monitoring
• Backup and restore
• Capacity management
SOC ManagerReports
Team Coordination
• Process design & continuous improvements
• Coordination of resources
• Performance monitoring
• Reporting
Event Monitoring & Correlation
Threat Intelligence
New Intel from offline analysis
Updates on architecture
Request Changes / Execute Changes
Advanced Visual Analytics
• Development of scenarios
• Development of interactive charts
• Development of dashboards
Pain
s w
e ha
ve fe
lt so
far
....
Integration of multiple components which many
times hardly integrate
30% of time transforming data strctures
Standard algorithms don’t fit
Creativity is a must for modeling
Real data is not like academic datasets
50% of time generating useful
features
Lessons learned & Takeaways
Think big, start slow
Or be prepared to suffer scope creep and
never deliver
Think big data from the beginning
Or be prepared to rewrite code
multiple times
Focus on results, not sophistication
Detecting cyber attacks is the ultimate goal and the reason for the investment
Just a few seconds of advertisement ...
• Join the community of cyber specialists developing ML-based open-source anomaly detection systems!
Stay in [email protected]