First principles modeling with Octopus: massive parallelization towards petaflop computing and more...

First principles modeling with Octopus: massive parallelization towards petaflop computing and more

A. Castro, J. Alberdi and A. Rubio

Outline

Theoretical SpectroscopyThe octopus codeParallelization

Outline

Theoretical Spectroscopy

Electronic excitations:~Optical absorption~Electron energy

loss~Inelastic X-ray

scattering

~Photoemission~Inverse

photoemission~…

Goal: First principles (from electronic structure) theoretical description of the various spectroscopies (“theoretical beamlines”):

Role: interpretation of (complex) experimental findings

Theoretical atomistic structures, and corresponding TEM images.

The European Theoretical Spectroscopy Facility (ETSF)

Theoretical SpectroscopyThe European Theoretical Spectroscopy

Facility (ETSF)

~ Networking~ Integration of tools (formalism,

software)~ Maintenance of tools~ Support, service, formation

The octopus code is a member of a family of free software codes developed, to a large extent, within the ETSF:~abinit~octopus~dp

Outline

The octopus code

Targets:~Optical absorption spectra of molecules,

clusters, nanostructures, solids.~Response to lasers (non-perturbative response

to high-intensity fields)~Dichroic spectra, and other mixed (electric-

magnetic responses)~Adiabatic and non-adiabatic Molecular

Dynamics (for, e.g. infrared and vibrational spectra, or photochemical reactions).

~Quantum Optimal Control Theory for molecular processes.

The octopus code

Physical approximations and techniques:~Density-Functional Theory, Time-

Dependent Density-Functional Theory to describe the electron structure.• Comprehensive set of functionals through the

libxc library.

~Mixed quantum-classical systems.~Both real-time and frequency domain

response (“Casida” and “Sternheimer” formulations).

The octopus code

Numerics:~Basic representation:

real space grid.~Usually regular and

rectangular, occasionally curvilinear.

~Plane waves for some procedures (especially for periodic systems)

~Atomic orbitals for some procedures

The octopus code

Derivative in a point: sum over neighbor points.Cij depend on the points used: the stencil.More points -> more precision.Semi-local operation.

The octopus code

The key equations~Ground-state DFT: Kohn-Sham

equations.

~Time-dependent DFT: time-dependent KS eqs:

The octopus code

Key numerical operations:~Linear systems with sparse matrices.~Eigenvalue systems with sparse matrices.~Non-linear eigenvalue systems.~Propagation of “Schrödinger-like”

equations.

~The dimension can go up to 10 million points.

~The storage needs can go up to 10 Gb.

The octopus code

Use of libraries:~BLAS, LAPACK~GNU GSL mathematical library.~FFTW~NetCDF~ETSF input/output library~Libxc exchange and correlation library~Other optional libraries.

www.tddft.org/programs/octopus/

Outline

Objective

Reach petaflops computing, with a scientific codeSimulate photosynthesis of the light in chlorophyll

Multi levelparallelization

MPIKohnShamstates

Realspacedomains

In NodeOpenMPthreads OpenCL

tasksVectorization

CPU GPU

Target systems:Massive number of execution units ~Multi core

processors with vectorial FPUs

~IBM Blue Gene architecture

~Graphical processing units

High Level Parallelization

MPI parallelization

Parallelization by states/orbitals

Assign each processor a group of statesTime propagation is independent for each stateLittle communication requiredLimited by the number of states in the system

Domain parallelization

Assign each processor a set of grid pointsPartition libraries: Zoltan or Metis

Main operations in domain parallelization

Low level paralelization and vectorization

OpenMP andGPU

Two approaches

OpenMPThread programming based on compiler directivesIn node parallelizationLittle memory overhead compared to MPIScaling limited by memory bandwidthMultithreaded Blas and Lapack

OpenCLHundreds of execution unitsHigh memory bandwidth but with long latencyBehaves like a vector processor (length > 16)Separated memory: copy from/to main memory

Supercomputers

Corvo cluster~X86_64

VARGAS (in IDRIS)~Power6~67 teraflops

MareNostrum~PowerPC 970~94 teraflops

Jugene (image)~1 petaflops

Test Results

Laplacian operator

Comparison in performance of the finitedifference Laplacian operator

CPU uses 4 threadsGPU is 4 times fasterCache effects are visible

Timepropagation

Comparison in performance for a timepropagation

Fullerene moleculeThe GPU is 3 times fasterLimited by copying and non GPU code

Multi level parallelization

Clorophyll molecule: 650 atomsJugene Blue Gene/PSustained throughput: > 6.5 teraflopsPeak throughput: 55 teraflops

Scaling

Scaling (II)

Comparison of two atomic system in Jugene

Target system

Jugene all nodes~294 912 processor cores = 73 728

nodes~Maximum theoretical performance of

1002 MFlops

5879 atoms chlorophyll system~Complete molecule of spinach

Tests systems

Smaller molecules~180 atoms~441 atoms~650 atoms~1365 atoms

Partition of machines~Jugene and Corvo

Profiling

Profiled within the codeProfiled with Paraver tool~www.bsc.es/paraver

1 TD iteration

Some “inner” iterations

One “inner” iterationIreceive Isend Iwait

Poisson solver

2xAlltoallAllgather Allgather Scatter

ImprovementsMemory improvements in GS~Split the memory among the nodes~Use of ScaLAPACK

Improvements in the Poisson solver for TD~Pipeline execution • Execute Poisson while continues with an

approximation

~Use new algorithms like FFM~Use of parallel FFTs

Conclusions

Kohn Sham scheme is inherently parallelIt can be exploited for parallelization and vectorizationSuited to current and future computer architecturesTheoretical improvements for large system modeling

First principles modeling with Octopus: massive parallelization towards petaflop computing and more...

Documents

Efficient Parallelization of a Dynamic Unstructured ... · Efficient Parallelization of a Dynamic Unstructured Application on the ... the parallelization of a dynamic unstructured

Alberdi Periodista en Chile

Juan Bautista Alberdi

Alberdi - Obras Completas Alberdi - Vol II

First principles modeling with Octopus: massive ... · First principles modeling with Octopus: massive parallelization towards petaflop computing and more A. Castro, J. Alberdi and

Parallel CC & Petaflop Applications Ryan Olson Cray, Inc

Proyecto aulico alberdi

ALBERDI -ENSAYO UNEAL

PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF ...domingo/teaching/ciic8996/PARALLELIZATION … · PARALLELIZATION AND PERFORMANCE OPTIMIZATION OF BIOINFORMATICS AND BIOMEDICAL APPLICATIONS

Turbodecodingalgorithm parallelization

Loop parallelization & pipelining

Correspondencia epistolar (ALBERDI-VILLANUEVA)

University of California, Los Angeleshelper.ipam.ucla.edu/publications/maws2/maws2_5442.pdf · State of the Art Computer Simulations: Where Are We Now? 10 Petaflop 1 Petaflop 100

10 Petaflop cluster built for open science!

Estudios Politicos (Alberdi)

Parallelization & Multicore

The need for parallelization Challenges towards effective parallelization A multilevel parallelization framework for BEM: A compute intensive application

From Megaflop to NCSA Petaflop and Beyond

Parallelization - XS4ALL Klantenservice

Recuerdos de Alberdi