Finding the needle in the haystack: how Nestle is leveraging big data to defeat cyber attacks by Rael Villoria and Rafael San Miguel

Finding the needle in the haystack How the food industry is

leveraging big data to defeat cyber attacks

Rafael Villoria-Ferrer - Rafael San Miguel Carrasco

• Present in196 countries

• 447 factories in 86 countries

• 333.000 employees

• 92.158M CHF in sales

Odds of you (or your pet) having enjoyed Nestle products: 99,99%

Nestle: facts and figures

Behind the scenes

Source: M-Trends Report

63%of companies learnt they were

breached from an external entity

100%of victims had up-to-date antivirus

signatures

Why we are here ... in 4 emojis

Key constraints of current security technologies

Multi-vector and multi-stage cyber attacks go unnoticed to

SOC operators focused on real-time detection

To detect a ‘never-seen before’ attack, it must have

been seen before in order to produce a signature (lol)

Execessive time-to-insightsInvestigations take too much with

so much data available

Reliance on signaturesWelcome to the ‘managing false

positives’ world

Real-time monitoringThe attack is no longer the tree,

but a forest

Up to 1 day of to generate a report containing an IP

address’ activity across the network

The optical effect into play

• Someone that is back from holidays and forgot his password (99%)

• Someone brute forcing an account (1%)

Critical SAP transaction executed

Reconaissance activity from Internet

Multiple failed logons

EVENT WHAT COULD MEAN

BUT ...

• An attacker might obtain the password through other means and never fail a logon

• True scans or false positives that will never turn into an attack (99%)

• Activities launched prior a true targetted attack (1%)

• Legitimate transaction executed by an authorized user as part of his BaU (99%)

• The last step in the kill chain (1%)

• An attack might not need generate noisy reconaissance activity

• A non-critical transaction could also be a serious attack

Our environment

• Aggregation of logs from multiple sources with propietary technology from HP

• Correlation rules with common intrusion scenarios and use cases of known cyber attacks

• Hundreds of alerts every day

• Real-time processing under control: the SOC can effectively handle alerts as they are triggered by using predefined response procedures

• The goal is to respond fast

• Data lake just with relevant events that can become part of a analytical data model

• Focus on detection of anomalies, not predefined attacks patterns

• Expectation of few alerts every day

• Visualizations that let us understand the network and its assets’ activity

• Forest-based data processing allowing to spot multi-vector and multi-stage attacks

• The goal is to detect complex attacks

6 billion security events generated every week from ~250.000 employees across the

globe

Principles and challenges

Cyber attack

Activity different from legitimate activity, that must be the most frequent

Anomaly Detection

Leveraging kownledge about the kill chain

• multiple traces left by the attacker,• unlikely to be recognized as an attack• that differ from regular activity

* Statistical significance less relevant than prediction

performance

Asset modeling

All

User authentication Microsoft & VPN

Network connections Firewall & IPS

Business transactionsSAP

Internet browsing Content filtering proxy

Internet domainsContent filtering proxy

Web applications WAF

Server behavior Microsoft

Threat intelligence HP & OSINT

Security domains

Why Assets Modeling is key for security

Risk prioritization

Malicious activity can be critical if it affects a system with key business privileges or almost inocuous if applies to a kiosk PC for guests

Multiple decoupled IT inventories

Can’t trust multiple IT inventories with:

• conflicting information• often outdated• not correlated• lacking key security details• with diferent semantics for features

Our questions

• “Has this PC have privileged access to Treasury?”

• “Does this webserver runs unauthorized or insecure services?”

Available information

• “This application is part of the Treasury environment”

• ”This system is a webserver”

In an incident investigation, an asset details tells us how big the impact us and what other assets might be affected.


Understanding of the (complex) SIEM data model

• Multiple sources

• Thousands of event types

Serv

ers

and

Dom

ain

Con

trol

lers


Understanding of the (complex) SIEM data model

• Multiple sources

• Thousands of event types

Con

tent

-filt

erin

g p

roxy

Intr

usio

n Pr

even

tion

Sys

tem

• Hadoop-based deployment

• Spark cluster for parallel in-memory execution of data processing routines

• Scales up and down as required

• Advanced capabilities for events ingestion

• Fully customizable approach to data mining, statistical analysis and modeling

• Integration with most Hadoop components, including Spark

• RevoScale R libraries for distributed computing on the cluster

• Excel-like programming language

• Interactive dashboards

• Relational data model with point-and-click data blending

• Extensive library of advanced visuals

Technology stack

The (initial) feature set

No labeled data other than signatures (false positives and negatives) and IOCs (ever-changing)

• Log source

• Timestamp

• Source Address and Hostname

• Target Address and Hostname

• Action

• Parameters (URL, category, etc.)

• Username

• Error code

=

Building blocks

• Update and retention policies

• Enrichment: geo, external sources, flags from existing knowledge about network assets

• Target features: behaviour vectors, maps, ratios, volumes

• Typical transformation: agreggation, filtering, reshaping

Ingestion

Exploration Modeling

Prediction Visualization

Feedback management

• Applied to ingested data based on trained models

• Search of potentially useful models

• Kibana, download of samples to local, tests in cluster

Transformation

Data Model

• Dashboards with prediction results

• Dashboard with high level overview

• Dashboard per domain: threat intelligence, transactions, connections, vulnerabilities

• Behavioral analytics

• Classifiers

• Multivariate outliers detection

• Scoring

• Super Correlation

• Pipeline• Activities• Linked Services• Dataset for

FileServer• Dataset for

BlobStorage

INGESTION > TRANSFORMATION > EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK

Arcsight SIEM

Proxy

Firewall

Domain controllers

WAF

IPS

SAP

Arcsight Logger

Arcsight ESM

Correlated Events

Base Events

Scheduled reports

CSV files Data Management Gateway

Blob Storage

Data Factory

MRS

Spark RevoScaleRPowerBI

Scheduled reports

Internet

• Spark Data Frames• Data Sources, XDF• data frames

• CSV• Parquet files

Heavy-liftingAgreggation, filtering, reshaping


• Behaviour records

• Metrics

• Ratios (of successful vs error)

• Time-based measures (binning) for histograms

• Useful features from enrichment from external sources

• Log source

• Timestamp

• Source Address and Hostname

• Target Address and Hostname

• Action

• Parameters (URL, category, etc.)

• Username

• Error code... so now we can develop, test and apply algorithms

EnrichmentGeo, external sources, flags from existing

knowledge about network assets

Storage of formatted eventsApply update and retention policies

• Understanding of data and search of potentially useful models

• Kibana

• Download of samples to local (Tableau)

• Tests in cluster

• Connections volume between suspicious and legitimate workstations

• Distribution of users’ visits to websites hosted in each country


Generate prediction-oriented datasets from datasets with formatted events and support datasets

INGESTION > TRANSFORMATION >EXPLORATION > MODELING > PREDICTION > VISUALIZATION > FEEDBACK

Behavior analytics• Deviations from previous behavior

• Spearman / Kendall / Cosine similarity

• Graph analysis for clustering

• Granger’s Causality Test

Models trained signature-based systems• Logistic regression

• LDA

• QDA

• Naive Bayes

• KNN

Univariate/Multivariate outliers detection• N standard deviations from the mean

• Mahalanobis distance

• High leverage points

• Cook’s distance

• Scoring: calculate risk factor from anomalies

• Super Correlation: factor in relationships between assets and past predictions

• Cascading algorithms

Example: User Authentication

Histogram of user auth requests from 00:00 to 23:59

Histogram of user auth requests from Monday to Sunday

Matrix of user authentications in servers

Matrix of user auth requests grouped by

location

Clusters of users showing similar behaviour

Malicious Authentications

New data


Working with RevoScaleR

• Standard clustering techniques not feasible because we can’t figure out a proper number of clusters in advance

• Replacing cor() by rxCor(): 45 minutes to 6 seconds

• graph package helping us to identify clusters membership based on correlation between workstations authenticating into servers

Example: Malicious Internet Domains


Training the algorithm to build profiles of malicious domains, based on agnostic variables

TRAINING

• Marketing metrics

• HTML metrics

• SEO metrics

prcomp()

Example: Malicious Internet Domains


http://www.datasciencecentral.com/profiles/blogs/dimensionality-reduction-with-r-to-uncover-malicious-internet

4 malicious domains get displayed below the line, along with the rest of malicious domains previously identified

http://www.datasciencecentral.com/profiles/blogs/dimensionality-reduction-with-r-to-uncover-malicious-internet

• Generate prediction-oriented features: behaviour vectors, maps, ratios, counts

• Apply models to fresh data to detect anomalies


IN THE LAB

Leveraging our ethical hackers to create scenarios that must be detected as anomalies by the platform

More labeled data!

• Dashboards with prediction results (anomalies)

• Dashboard with high level overview

• Dashboard per domain: threat intelligence, transactions, connections, vulnerabilities

• Graphs for asset-based correlation, timelines for time-based correlation

• Dashboards for incident investigation: asset modeling



IN THE LAB

Send e-mails to users to confirm whether they performed a given suspicious activity

Generate exceptions or fine-tune models from wrong predictions

More labeled data!

Super Correlation

After each run of prediction routines:

• Each asset has a risk level (scoring) and a list of anomalies

• List of pairs of related assets is built (risk level = sum of risk levels)

• Graphs are built from related assets and clusters are identified (risk level is the sum of risk levels)

• Highest-risk clusters are displayed as anomalies

ASSETS-BASED

Asset A:

User

AssetB: Server

AssetC:

Internet Domain

Anomal

y

Anomal

y

Anomal

y

Anomal

y

Anomal

y

Anomal

y

Behavioral Analytics

Suspicious logon


Suspicious process

Univariate outlierVolume of

connections to single IP

Multivariate outliersUnusual combination of Security Audit events

Anomal

y


User account not linked to workstation

ClassifiersSuspicious domain

Behavioral analyticsNever contacted before• Independence from predefined

correlation scenarios: A > B > C

• Accurately quantify the risk level of a given alert, that can be low if isolated but critical considering the context

• Identify other assets impacted by a given cyber attack: quantify business impact

Super Correlation

After each run of prediction routines:

• New anomalies are stored

• Previous anomalies are matched with them based on common assets/clusters

• A timeline of anomalies related to each asset/cluster can be displayed.

TIME-BASED

• Identify multi-stage attacks that are performed over the course of days, weeks or months

Asset A:

User

Asset A:

User

Asset A:

User

B

C

D

F C

GE

01/03/16 12/04/16 20/05/16

A

A

A AA

A

Our vision: analytics-based SOC monitoring

Datalake arch & deployment

• Architectural design

• Integration of log sources

Datalake administration Data mining &

Statistical AnalysisVisual Data Discovery

Statistical modeling

• Data preparation

• Development of advanced features

• Pattern Discovery

• Anomaly detection

• Time Series analysis

• Outliers analysis

• Data and pattern discovery

• Trend analysis

• Anomaly detection

• Unsupervised training

• Supervised training

• Development of ML algorithms

• Configuration management

• System monitoring

• Backup and restore

• Capacity management

SOC ManagerReports

Team Coordination

• Process design & continuous improvements

• Coordination of resources

• Performance monitoring

• Reporting

Event Monitoring & Correlation

Threat Intelligence

New Intel from offline analysis

Updates on architecture

Request Changes / Execute Changes

Advanced Visual Analytics

• Development of scenarios

• Development of interactive charts

• Development of dashboards

Pain

s w

e ha

ve fe

lt so

far

....

Integration of multiple components which many

times hardly integrate

30% of time transforming data strctures

Standard algorithms don’t fit

Creativity is a must for modeling

Real data is not like academic datasets

50% of time generating useful

features

Lessons learned & Takeaways

Think big, start slow

Or be prepared to suffer scope creep and

never deliver

Think big data from the beginning

Or be prepared to rewrite code

multiple times

Focus on results, not sophistication

Detecting cyber attacks is the ultimate goal and the reason for the investment

Just a few seconds of advertisement ...

• Join the community of cyber specialists developing ML-based open-source anomaly detection systems!

Stay in [email protected]