LLNL-PRES-652730 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

LLNL-PRES-652730This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Power, Resilience, Capacity, Oh My!I/O-integrated computing with NVRAM

Maya Gokhale, LLNL

Lawrence Livermore National Laboratory LLNL-PRES-652730

Amount of DRAM per core is severely constrained in exascale node• Many cores per node• Number of nodes• Total power requirements dominated by DRAM

Scale of an exaflop computer makes frequent checkpointing to global file system infeasible

• Peer-to-peer checkpoints (SCR) harder due to less memory per core

NVRAM in the compute fabric is required for energy, capacity, resilience

Exascale node architectures add NVRAM to memory hierarchy


Exascale machine incorporates NVRAM at every level

X-Stack I NoLoSS


Memory and storage are regarded as different in kind by the entire software stack: direct access to memory, but read and write operations to storage

Emerging persistent memories support efficient random, byte level, read/write access, filling a gap in memory/storage hierarchy

• Disk is a factor of 100,000 slower access latency of DRAM

• PCM is a factor of 2× slower (in read) than DRAM latency

NVRAM bridges memory/storage latency gap

Read Latency

Write Latency

DDR3 .01μs .01μs

PCM .05μs 1μs

NAND 10μs 100μs

Disk 1000μs 1000μs

With this new hardware technology, can we …• directly operate on arrays and tables that are too big

to fit in main memory• make permanent updates to arrays and tables without file I/O• traverse large graphs “as if” in memory• save lightweight checkpoints by flushing cache to persistent memory

• enable scalable applications to transition to exascale


Out-of-core memory map runtime for block device• Customized page cache management for persistent

memory data store • Applications

— Graph traversal framework

— Metagenomic analysis and classification

Active storage• Future generations of byte-addressable NVRAM storage • Compute logic co-located with storage controller

Outline


Systems software for data intensive computing: data intensive memory map

Brian Van Essen, LLNL


I/O bus

Block device/File system

Persistent memory/Extended memory

Memory hardware interface

Views of node local NVRAM


Optimize system software for application-specific access patterns

Performance measurements of system mmap were disappointing

We want abstraction of memory mapping but improve its implementation for data-centric application classes

Data intensive memory mapping (DI-MMAP)• DRAM as cache to persistent memory• Data intensive runtime manages the cache in a pre-allocated memory buffer• Differentiates between frequently used locations vs. random access


Buffer management applies caching techniques to NVRAM pages

Minimize the amount of effort needed to find a page to evict:

In the steady state a page is evicted on each page fault

Track recently evicted pages to maintain temporal reuse

Allow bulk TLB operations to reduce inter-processor interrupts


DI-MMAP is being integrated with Livermore Computing software stack Scalable out-of-core performance provides an

opportunity for working with node-local storage in HPC• Catalyst system

— 150 teraflop/s: 324 nodes, 7776 cores

— 40 TB DRAM + 253 TB Flash

— Each node: 128 GiB DRAM + 800 GiB Flash

DI-MMAP is integrated with SLURM• Can be loaded by user• Allocated per job• User specified total memory used for buffering external Flash

https://bitbucket.org/vanessen/di-mmap/


Parallel graph traversal framework High visibility data intensive supercomputing application class Traverse scale free graph with trillions of edges

– Social network, web Developed parallel, latency tolerant asynchronous traversal framework Uses visitor abstraction

– Visitor is an application-specific kernel that the framework applies to each graph vertex

– Visit results in traversal of graph edges to queue work on target vertices– Visitor is queued to the vertex using priority queue

Scalable: single server with NVRAM to data intensive cluster to BG/P– 15% faster than best Intrepid result

Demonstrated with multiple traversal algorithms: Breadth first search, Single source shortest path,(Strongly) Connected components, Triangle counting, K-th core

Multiple Graph500 submissions Code has been open sourced (https://bitbucket.org/PerMA/havoqgt) and

will continue to be updatedRoger Pearce, LLNL

https://bitbucket.org/PerMA/havoqgt




BFS Graph Analysis performance shows DI-MMAP advantagesDI-MMAP provides …

High-performance

7.44x faster than mmap• with 256 threads

• and 40 GiB of DRAM

Scalable out-of-core performance

50% less DRAM• 20 GiB vs 40 GiB DRAM

• Only 23% slower


Application: Livermore Metagenomics Analysis Toolkit• Challenge: Given an

“anonymous” sample give as complete a report on the pathogen contents as possible.

• Standard method: sequence alignment – align every read in sample against every known genome

Our Approach:Use highly concurrent search and large persistent memory to store and access reference data in a “pre-processed” form that supports efficient runtime queries.

375 GB k-mer database mapped into memory with DI-MMAP (24.5 billion k-mer keys) with tens to hundreds of threads querying database

Jonathan Allen, LLNL


We’ve designed a two-level index optimized for NVRAM

COTS Hash tables aren’t suitable for page-access storage

Solution: split k-mer key• Manage prefix with array• Suffixes managed in lists typically

span no more than one page

Results: up to 74X speedup over gnu hashmap

Need to determine split parameter– can depend on input data set and k-mer length

Sasha Ames, LLNL


LMAT + DI-MMAP performance

LMAT on ramdisk, flash vs other methods1 GB first-level size

Two-level vs hash – ERR data set index configuration (below) includes first level table size

General Experimental Setup• 22 GB original reference index

containing microbial (with fungi+protozoa) and human sequences.

• LMAT (version 1.1) runs with 160 threads on 40 intel core with 1.2TB FusionIO node, and DI-MMAP with 16 GB buffer

LMAT DI-MMAP vs linux mmap8 GB first-level size


Analyze human genomes consisting of samples from 2948 individuals collected as part of the NIH “1000” human genome project, 100TB+ compressed

LMAT can answer science questions• Can we build a "more complete" human reference for screening

samples such as the microbiome or cancer sequences?• Are there any interesting previously undetected microbial

components in the samples?• What is the statistical profile of k-mer uniqueness across

individuals?

Replicate database on node-local flash, distribute dataset across nodes

LMAT experiment runs at scale on the Catalyst cluster



Orders of magnitude range in query rate for different samples


Modest database increase with additional human genomes

Database 1 - original LMAT databaseDatabase 2 - addition of GenBank human referenceDatabase 3 - (2) + addition of K-mers from from 1000 Genome project samples .


More accurate DB means more labeled reads


Expanded database with human variants improves classification

https://computation-rnd.llnl.gov/lmat/


Active storage with Minerva (UCSD and LLNL)

Designed for future generations of NVRAM that are byte addressable, very low latency on read

Based on computation close to storage

Moved data or I/O intensive computations to the storage to minimize data transfer between the host and the storage

Processing scales as storage expands

Power efficiency and performance gain

Arup De, UCSD/LLNL

Prototyped on BEE-3 system • 4 Xilinx Virtex 5, each with

16GB memory, ring interconnect

Emulate PCM controller by inserting appropriate latency on reads and writes, emulate wear leveling to store blocks

Get advantage of fast PCM and FPGA-based hardware acceleration


Minerva architecture provides concurrent access, in-storage processing units


Minerva Programming Model

Normal reads and writes bypass storage processor

Remote procedure call-like interface for computational “reads” and “writes”

Storage processor generates interrupt for assistance whenever requires the host intervention to resolve

The software driver is responsible for setup, manage, monitor and finally release the storage processor on behalf of a given application


Interaction with storage processor

Local Interface

Compute Kernel

Ring Interface

Loca

l Mem

oryControl

Processor

Spl

itter

Data Manager

Compute Manager

DataScheduler

DMA Engine

ComputeScheduler

Local Memory

K0 K1

K3K2

KnKn-1

Request Manager

Command 1

Command 2

Status 1

Status 211

Receives commandReads command

Initiate computations via request manager Receives result

Sends back completion notification

Loads it to memorySends DMA request via

request managerUpdate statusLoads data to memory

Request Extractor

Response Generator

1

Word Counter

Memory Interface

GrepFSM

Address

Ref.String

Actual String

Com

para

tor

Enable

Req

-rsp

Inte

rfac

e


Minerva Prototype

Built on BEE3 board

PCIe 1.1 x8 host connection

Clock frequency 250MHz

Virtex 5 FPGA implements• Request scheduler• Network• Storage processor• Memory controller

DDR2 DRAM emulates PCM


Minerva FPGA emulator allows us to estimate performance, power

Compare standard I/O with computational offload approach

Design alternative storage processor architectures and evaluate trade-offs

Run benchmarks and mini-apps in real-time

Estimate power draw and energy for different technology nodes

Streaming string search• 8+× performance• 10+× energy efficiency

Saxpy• 4× performance• 4.5× energy efficiency

Key-Value store (key lookup)• 1.6× performance• 1.7× energy efficiency

B+ tree search• 4+× performance• 5× energy efficiency


Active storage• Arup De (Lawrence Scholar, UCSD)

Data-intensive memory map• Brian Van Essen

Metagenomic analysis• Sasha Ames• Jonathan Allen – PI of Metagenomics LDRD

Graph algorithms• Roger Pearce

Collaborators• UCSD – active storage • SDSC – data intensive architectures• UCSC – Resistive RAM as Content Addressable Memory• TAMU – graph algorithms

• Argonne, ASCR – data intensive memory map for the exascale

Contributors Further reading• Arup De, Maya Gokhale, Rajesh Gupta, and Steven

Swanson , Minerva: Accelerating Data Analysis in Next-Generation SSDs, in Proceedings of Field-Programmable Custom Computing Machines (FCCM), April 2013.

• Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, Maya Gokhale, DI-MMAP: a scalable memory map runtime for out-of-core data-intensive applications, Cluster Computing, Oct. 2013 (open access)

• Sasha K. Ames, David A. Hysom, Shea N. Gardner, G. Scott Lloyd, Maya B. Gokhale and Jonathan E. Allen, Scalable metagenomic taxonomy classification using a reference genome database, Bioinformatics, July 2, 2013 (open access).

• Roger Pearce, Maya Gokhale, Nancy Amato, Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory, IEEE International Parallel and Distributed Processing Symposium, May 2013.


In-storage computing can improve performance and EDP, particularly for low latency, byte-addressable persistent memory on attached I/O bus.

Extending the in-memory model to present day flash based block devices has demonstrated significant performance gains.

Highly concurrent throughput-driven algorithms can achieve excellent performance when accessing NVRAM through the optimized software stack.

In-cluster persistent memory addresses capacity, power, and latency/bandwidth gaps

Documents

LLNL-PRES-652730 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344