59
1 Big Data at CERN 2nd International PhD School on Open Science Cloud 20 September 2018 Vaggelis Motesnitsalis

Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

1

Big Data at CERN

2nd International PhD School on Open Science Cloud

20 September 2018

Vaggelis Motesnitsalis

Page 2: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

2Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 3: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

3

Who am I?

Page 4: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

4

I am…

Big Data Engineer as openlab Intel

Research Fellow

Research Fellow from 2017 - today

4+ years of Big Data Experience

Former Big Data Devops Support Engineer

and Escalation Engineer @AWS

MSc Distributed Systems @Imperial College London

Studies at King’s College London and Aristotle

University of Thessaloniki

[email protected]

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 5: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

5

Physics Analysis at CERN

Page 6: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

6

Until today, the vast majority of high energy physics analysis is done with

the ROOT Framework which uses physics data that are stored in

ROOT format files, within the EOS Storage Service of CERN.

ROOT Data Analysis Framework

A modular scientific software framework which provides all the functionalities neededto deal with big data processing, statistical analysis, visualization and file storage. It ismainly written in C++ but also integrated with Python and R.

EOS Storage Service

A disk-based, low-latency storage service with a highly-scalable hierarchical namespace,which enables data access through the XRootD protocol. It provides storage for bothphysics and user use cases via different service instances such as EOSPUBLIC,EOSATLAS, EOSCMS.

Physics Analysis at CERN

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 7: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

7

Operating the LHC

Page 8: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

8

Operating the LHC

2 Beams

7 TeV per proton

350 MJ per beam – equivalent to a

TGV train travelling with 150 km/h

Only 2 nanograms of Hydrogen per day

Ultra-high Vacuum

11245 circuits every second

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 9: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

9

How many collisions?

Page 10: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

10

Collisions

2 beams

4 collision points – 4 experiments

2808 bunches (max)

100 billions protons per bunch

20 – 40 collisions per bunch interaction

1 collision every 25 nanoseconds

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 11: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

11

4 Collision Points

Page 12: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

12

1 Billion collisions per second(at each collision point)

Page 13: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

13

1-2 MB to store a collision

Page 14: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

14

What is the data production rate of the LHC?

Page 15: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

15

4 – 8 PBs

Page 16: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

16

Filtering and Triggers

Page 17: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

17

Filtering and Triggers

Trigger Level 1: 100K collisions out of 1B

Trigger Level 2: 10K events out of 100K

Trigger Level 3: 100-200 events out of 10K

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 18: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

18

Data

Page 19: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

19

Data

12.3 PB in a single month

(October 2017)

>250 PB at CERN Data Center

Data formats include: ROOT for Physics

Parquet, JSON, AVRO in Hadoop

Different types of data: physics,

accelerator metrics,

infrastructure logging,etc.

Stored on a custom storage service [EOS]

Tapes are used for long-term storage

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 20: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

20

The Worldwide LHC Computing Grid

Page 21: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

21

WLCG

The Worldwide LHC Computing Grid

(WLCG) is a global collaboration of

more than 170 institutions

in 42 countries which provide

resources to store, distribute and

analyse the PBs of LHC Data

Worldwide LHC Computing Grid

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 22: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

22

EOS Storage Service

Page 23: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

23

EOS Storage ServiceService Characteristics

Disk-based

Low Latency

Highly Scalable

Based on the XRootD Protocol

Storage for both physics and user use cases

Different service instances such as

EOSPUBLIC, EOSATLAS, EOSCMS

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 24: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

24

Hadoop Service at CERN

Page 25: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

25

Hadoop and Spark Service at CERNThe Service in Numbers

6 Clusters

110+ Nodes

14+ PBs Storage

20+ TB Memory

3100+ Cores

4 TBs/day Data Growth

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 26: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

26

Hadoop and Spark Service at CERNRunning the service

“vanilla”Apache Hadoop Distribution

Puppet for Administration

Kerberos for Authentication

High Availability

Daily backups to Tapes

Accessible by client nodes and Jupyter Notebooks

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 27: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

27

Hadoop and Spark Service at CERNService Components

Kaf

kaSt

ream

ing

and

In

gest

ion

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 28: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

28

SWAN Service at CERN

Page 29: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

29

SWAN ServiceHosted Jupyter Notebooks for Data Analysis

Web-based interactive analysisusing PySpark in the cloud

Direct access to the EOS and HDFSFully Integrated with IT Spark and

Hadoop Clusters

No need to install softwareCombines code, equations, text and

visualisations

Collaboration between

EP-SFT, IT-ST, and IT-DB

https://swan.web.cern.ch/

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 30: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

30

SWAN Service

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 31: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

31

Other Tools

Page 32: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

32

Bridging the GapProblem Definition

EOS Storage Service

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 33: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

33

Hadoop – XRootD ConnectorConnecting XrootD-based Storage Systems with Hadoop and Spark

https://github.com/cerndb/hadoop-xrootd

JNI

HadoopClusters(HDFS)Hadoop-

XRootDConnector

EOSStorageService XRootD

Client

C++ Java

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

A Java library that connects to the XRootD client via JNI

It reads files from the EOS Storage

Service directly

Slower Read – Broader Data Access

Supports Kerberos and GRID

Certificate Authentication

Page 34: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

34

spark-rootImporting Root files in Apache Spark

A Scala library which implements

DataSource for Apache Spark

Spark can read ROOT TTrees and

infer their schema

Root files are imported to

Spark Dataframes/Datasets/RDDsDeveloped by DIANA-HEP

https://github.com/diana-hep/spark-root/

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 35: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

35

Spark on Kubernetes Service

Page 36: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

36

Spark on Kubernetes ServiceLeveraging the Kubernetes support in Spark 2.3

Prototype of Spark on Kubernetes over

OpenStack and built spark images and tooling

Under active development

Based on already separated storage solution (EOS)

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

EOS

Storage

Service

Storage Compute

Page 37: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

37

Future Plans

Page 38: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

38

Analytics Platform

EOS Storage Service

runs on

access data from

access data from

accessed by

runs on

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 39: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

39

Selected Projects

Page 40: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

40

Next Accelerator Logging (NXCALS)

A control system with:

Streaming

Online System

API for Data Extraction

Critical for LHC

Operations

Runs on a dedicated

cluster

Credits: BE-CO-DS

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 41: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

41

Data Center and WLCG Monitoring Systems

Kafka cluster

(buffering) *

Processing

Data enrichment

Data aggregation

Batch Processing

Transport

Flume

Kafka

sink

Flume

sinks

FTS

Data

Sources

Rucio

XRootD

Jobs

Lemon

syslog

app log

DB

HTTP

feed

AMQFlume

AMQ

Flume

DB

Flume

HTTP

Flume

Log GW

Flume

Metric

GW

Logs

Lemon

metrics

HDFS

ElasticS

earch

Storage &

Search

Others

(influxdb)

Data

Access

CLI, API

Credits: IT-CM-MM

Critical for Data

Center operations

and WLCG

200M events/day

500 GB/day

Proved effective in

several occasions

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 42: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

42

Computer Security Intrusion Detection

Credits: CERN security team, IT-DI

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 43: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

43

Physics Analysis with Spark

Page 44: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

44

The CMS Data Reduction Facility

Page 45: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

45

CMS Data Reduction FacilityPerforming Physics Analysis and Data Reduction with Apache Spark

Investigate new ways to analyse physics data and improve resource utilization and time-to-physics

We now have fully functioning Analysis

and Reduction examples tested over

CMS Open Data

We started scaling – goal is 1 PB in 5 hours

Root files are imported with « spark-root »

Files are accessed from the EOS Storage Service

with the « Hadoop-XRootD Connector »

A Project by openlab, Intel, Fermilab and the CMS Experiment

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 46: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

46

CMS Data Reduction Facility

Produce reduced data based on potential

complicated user queries

If sucessful, this type of facility could be a big

shift for High Energy Physics

Make High Energy Physics more open to the Big

Data community

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 47: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

47

CMS Data Reduction Facility

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

EOS

Storage

Service

Driver[Dimuon Invariant

Mass Calculation]

Executor

1

ROOTfile

ROOTfile

Executor 2

ROOTfile

ROOTfile

Executor

3

ROOTfile

ROOTfile

Executor

N

ROOTfile

ROOTfile

Task 2

Task 1

Task X

Hadoop-XRootD Connector

spark-root

Page 48: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

48

CMS Data Reduction Facility

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Input Size Time

22 TB 13m

44 TB 25m

66 TB 38m

88 TB 53m

110 TB 59m

500 TB 4h50m

1 PB 9h54m

13 25 38 53 59

290

596

0

100

200

300

400

500

600

700

22 TB 44 TB 66 TB 88 TB 110 TB 500 TB 1 PB

Results for different input size in minutes for Dimuon Mass Calculation

Page 49: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

49

Example

Page 50: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

50

Example on Physics Analysis with SWAN

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 51: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

51

Example on Physics Analysis with SWAN

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 52: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

52

Final Result

Page 53: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

53Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 54: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

54

OK, OK, one with better graphics

Page 55: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

55

Final Result

More on: https://swan.web.cern.ch/content/apache-spark

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 56: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

56

The observed mass distribution of two Z bosons in four leptons events. The points are the data, while the blue histogram is the distribution expected from direct ZZ production in the Standard Model. The red line with a central value around 125 GeV is the Higgs boson signal.

Final Result

Page 57: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

57

Summary

Page 58: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

58

ConclusionsCERN deals with an unprecedented amount of data:

Several PBs at the collisions12+ PBs of filtered data in a single month250 PBs capacity at the Data Center

We now have fully functioning Analysis and Reduction examples tested over 1 PB of CMS Open Data

CERN utilizes a variety of frameworks from the Hadoop Ecosystem:

Apache SparkApache Hadoop/YARNKafkaJupyter Notebooks

We have a number of critical and data intensive use cases:

CMS Data Reduction FacilityNext Accelerator Logging ServiceWLCG and Data Center Monitoring SystemSecurity Intrusion Detection

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud

Page 59: Big Data at CERN - agenda.infn.it · BigData Engineer as openlab Intel Research Fellow Research Fellow from 2017 - today 4+ years of Big Data Experience Former Data Devops Support

59

Thank [email protected]

www.linkedin.com/in/evangelos-motesnitsalis

Vaggelis Motesnitsalis - 2nd International PhD School Open Science Cloud