10
Perseus Design

Perseus Design. 2 Lockheed Martin and Government Use Only Architecture Behavioral “signatures” are extracted from a baseline execution Prototype will

Embed Size (px)

Citation preview

Perseus Design

2

Behavioral Analysis

Components + Behavioral Meta-data Design

OptimizationEngine

Platform Models

Optimization Heuristics

Original Application

Components

Instrumentation Engine

Instrumented Code

Execution Engine

Configuration Plan

Architecture

Behavioral “signatures” are extracted from a baseline execution

Prototype will focus on support for x86 binaries on a Linux platform

Configuration plans define application-triggered system control (affinity & power)

The plethora of variables presents a huge solution space ideal for Genetic Algorithm approaches

Platform models define number of cores and cache characteristics

Second phase of instrumentation “hooks” configuration plan into application

3

Behavioral Analysis Sub-system

Behavioral analysis is performed is split in two TEG (Temporal Execution Graph) TMAM (Temporal Memory Access Map)

Precise data is collected for on a per-thread, per-call site basis Binary instrumentation is facilitated by Dyninst (University Wisconsin

Madison) Accurate counting (e.g., processor cycles) and timing is facilitated through

PAPI (University Tennessee)

Binary Code Instrumentation

Real Platform

Execution

Behavioral Profile

Components Instrumented with

Measurement Probes

Data Distillation and Model

ConstructionRaw

Trace Data

4

TEG Collection

TEG collects information about how much time the application spent executing different functions in the application. Both cycle count and timestamps are collected so that potential for “slow-downs” can be identified

Per-thread, per-call site timing and cycle count information is collected for selected function calls

Results provide timing distributions for functions as opposed to averages and counts (e.g., gprof, callgrind)

Overhead is dependent upon density of instrumentation (i.e., number of functions + calls) ~ in most cases negligible

TEGInstrumentor

(teg.exe)

Real Platform

ExecutionComponents Instrumented with

Measurement Probes

Shared Memory

Data Logger(logger.exe)Event

Data

TEG Binary

File(.teg)Application

Binary

5

TMAM Collection All application reads and writes to memory are captured via probes

instrumented at the binary level. This data is essential for cache false-sharing identification

Data is collected via a shared memory logger Overhead is very expensive - O(x100) slower

At these levels we have to be careful not to affect normal behavior. Dynamic probe placement and sampling could be used to alleviate this problem

Massive volumes of data result (e.g., 20 second program can generate 100 Gb +)

Two modes of operation: off-line analysis, real-time analysis

TMAM Instrumentor(tmam.exe)

Real Platform

ExecutionComponents Instrumented with

Measurement Probes

TMAMInstrumentor(tmam.exe)

Real Platform

ExecutionComponents Instrumented with

Measurement Probes

Shared Memory

Data Logger(logger.exe)

Conflicts Analysis

(conflicts.exe).tmamEventData

ConflictsFile

Real-time Data Distiller(distiller.exe)Event

Data

ConflictsFile

Application Binary

Application Binary

6

Platform Analysis Micro-benchmarks implemented as part of current solution

empirically measure data concerning Number of processors, number (and values) of frequency steppings Cost of thread migration (i.e. affinity change) Ratios of power-to-cycles at different frequencies Cost (in cycles) of frequency modulation Core topology

7

Example Platform Information Example data empirically collected through fine-grained

on-chip timing and micro-benchmark program

Data collected from Dual-processor Quad-core Xeon running Debian Linux. Each matrix element is shaded according to measured latencies of the migration (darker is slower).

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

From Core

To

Cor

e

Represents “slow” migration across cores on different

processors

Migration between two of the four cores

is faster within a single processor.

(suggesting that the Xeon Quad is two Dual-cores linked

together)

8

Design Optimization Engine

Temporal Execution Graph

Temporal Memory Access Map

Platform Models & GA Parameters

XMLXML

Data Transformation and Compilation

Random Population Generation

Fitness Evaluation

MutationReproductionCross-over

(2-point)

Next Generation

GA-based Optimization Engine

DeploymentPlan

(source code & trigger point description)

Random Population

Infusion

Tournament Selection

9

Example Deployment Data Deployment results are made up of a trigger locations and

auto-generated trigger source code

libControl.so8048C07,Before_CS_8048C078048C98,Before_CS_8048C988048D92,Before_CS_8048D928048DB0,Before_CS_8048DB0

#include <pthread.h>#include "affinity.h"#include "fvctrl.h"#include "triggeraux.h"

void Init_Frequency(){ modulate_cpu(0, 1, 0); modulate_cpu(1, 1, 0); modulate_cpu(2, 1, 0); modulate_cpu(3, 0, 0); modulate_cpu(4, 0, 0); modulate_cpu(5, 0, 0); modulate_cpu(6, 0, 0); modulate_cpu(7, 1, 0);}

void Before_CS_8048D92(){ switch(GetThreadInstanceId()) { case 1: { affinize_thread(0, pthread_self()); break; }

case 2: { affinize_thread(3, pthread_self()); break; }

case 3: { affinize_thread(1, pthread_self()); break; }

case 4: { affinize_thread(1, pthread_self()); break; }

}}

10

Power MeasurementServer-style ATX power feeds two 12V lines to each processor.

Data is streamed to a host via USB.