Upload
mark-delgado
View
223
Download
1
Embed Size (px)
Citation preview
SoC HPC: Design and Optimization
Mark DelgadoBS in Nuclear Engineering From NC State
Python User
Presentation Topics
Why? SoC Choices and Economics Software and IT Stack Cluster Design and Decisions Optimizations and Improvements What has been done today What will be done tomorrow
Not Presentation Topics
Calculation and Data Decisions Data Acquisition Parameter Selections Application Strategies Broker Selection and Integration Heavy Quantitative Finance Cluster Application Strategies Source Code
Why?
Professional Curiosity, New Challenges, New Technologies
Project Hypothesis
Can I build a system that can perform massive amounts of calculations?
Can I then use this system to solve problems, find relationships, and find strategies?
Can I build or modify the system to take any strategies and apply them?
What Kind of System? What is HPC?
Titan Supercomputer
Titan Economics
18,688 Nodes with 16 Cores per Node 299,008 Total CPU Cores
18,688 GPUs Total Cost: $97,000,000 Individual Unit? Only $15,000!
SoC Choices and Economics
Raspberry Pi 3 1.2 GHz Quad Core ARM, 1GB RAM, $35
Parallella Dual Core ARM, 16 Core RISC CPU, $150
Odroid-XU4 Quad Core ARM 1.5 GHz, Quad Core ARM 2.0GHz, 2GB
RAM, $75
SoC vs Server?
XU4 Energy Requirements: 20 Watts
Server Energy Requirements: 750 Watts
Total Yearly Cost of Server ~$725
Total Yearly Cost of XU4 ~$89
Software and IT Stack
Software and IT Stack
1 Gbps switch Cat6 Cables 1 Gbps Supporting SoC and Laptop Configured and Mounted NFSv4 Folders and
Partitions SSH access
Software and IT Stack
What is the System Being Designed for? Ease of Use and Support? Less Ease, less support, more performance?
Software and IT Stack
Pure and Raw Performance? Less Support, More Difficult to Use Difficult to Setup, Difficult to Hand-off 5-10% increase, modern software
Software and IT Stack
Languages Used: Python, Cython, C/C++
Message Passing OpenMPI, 0mq
Networking 0mq
Database MongoDB
Software and IT Stack
Python Modules: Message Passing
Pyzmq, mpi4py Networking
pyzmq Database
PyMongo
All Modules found on PIP!
Software and IT Stack
Cluster Design and Decisions
The Buy Strategy: MACD Cross Over The Sell Strategy: TP/SL Timeframe: Weekly Data Resolution: Minute Question: Using a MACD Cross Over as a buy
strategy, and a TP/SL as a sell strategy, is there a combination that yields higher ROI vs the weekly ROI of that equity?
Cluster Design and Decisions
Hypothesis: YES! Problem:
f(a,b,c,tp,sl) a=2..100, b=2..100, c=2..100, tp=1..10, sl=1..10 98**3*10**2*37s = ~6600 years of calculations
Solution: Parallelization, Network Optimization, Algo Optimization
Cluster Design and Decisions
Lesson 0: Memory > Database 15 second query done every calculation New time: ~4000 years
Cluster Design and Decisions
0mq Pub/Sub Network Architecture
Cluster Design and Decision
Lesson 1: Avoid ‘Pre-Processing’ Data
More Gbps = More Time
Cluster Design and Decisions
New Calculation time, ~2s New Total time, 11.1 years
Cluster Design and Decisions
Lesson 2: Memory > Network
Cluster Design and Decisions
Lesson 3: Parallelize Everything
Cluster Design and Decisions
Different Designs Yield Different Results Control time = 0.6s Pub/Sub = ~1s = 11.1 years Pub/Sub/Modified = 0.83s = 10.2 years Pub/Sub/Modified/Parallel = 0.78s = 9.5 years
Cluster Design and Decisions
Lesson 4: Cython isn’t always the answer
Still slow, worth exploring?
Cluster Design and Decisions
Different types of clusters for different problems Previous cluster designs = Centralized Streaming
and Centralized Storage
Cluster Design and Decisions
Introducing Decentralized Streaming and Centralized Storage
Cluster Design and Decision
Lesson 5: Good Memory Management = Good Results
Cluster Design and Decision
Removing the network stream reduces the data transmission time to 0s
New Calculation time = 1s New Total time = 5.56years
Optimizations and Improvements
Lesson 6: Profile Profile Profile What are the pain points in the algo? Given the current algo design, what can be ported
to C/Cython? Are the parameters ‘good’ ?
Optimizations and Improvements
Choosing ‘good’ parameters = .5s New time = 2.78 years
Exporting math to C/Cython = .2s New time = 1.1 years
Combining C/Cython and Pypy = .09s New time = 0.5 years
Choosing ‘actually good’ parameters = .06s ***Speculating*** New Time = .33 years
Optimizations and Improvements
The problem: 98**3*10**2*.06s = Total Time
98**3*10**2 = C 0.06 = t C*t = Total Time
Optimizations and Improvements
Total Calculation number = 98**3*10**2 = 94,119,200 = C
Decrease Resolution of C = Cn Cn = C*.99 New time after Cn = .021 years
Optimizations and Improvements
Lesson 7: IT Automation is Awesome! Especially when applied to math!
Use IT automation to determine new values of Cn and automatically parallelize calculations New time=***0.005-0.01 years***
Optimizations and Improvements
New Estimated Total Time: .01 years .01 years = 3.6 days
From 6600 years to 3.6 days
Optimizations and Improvements
What did we just do? S(M1,M2,M3,TP,SL) M1=x1→x1* M2=x2→x2* M3=x3→x3* TP=x4→x4* SL=x5→x5*
What Has Been Done Today
Everything except Pypy and C/Cython merge, IT Automation, and IT Automation + Math
What can I show you? Fully functioning cluster without automation Real performance differences between Python and Pypy NFS to aggregate the results
What Will Be Done Tomorrow?
Pypy and C/Cython merge, IT Automation, and IT Automation + Math
Pandas to handle data Matplotlib to graph potential strategies
Questions?
Thanks