CNL2 WP Architecture

Embed Size (px)

Citation preview

  • 8/12/2019 CNL2 WP Architecture

    1/20

    Whitepaper: Distributed Numerical Computing

    with Microsoft Codename Cloud NumericsMicrosoft Corporation

    Published: January 2012

    This content is preliminary content. It might be incomplete and is subject to change.

    AbstractThis white paper introduces Microsoft Codename Cloud Numerics lab (referred to as Cloud Numerics

    in the following content), a numerical and data analytics library. It provides guidelines for data scientists

    and others who write C# applications to enable their applications to be scaled out, deployed, and run on

    Windows Azure.

  • 8/12/2019 CNL2 WP Architecture

    2/20

    Copyright Information

    This document supports a preliminary release of a software product that may be changed substantially

    prior to final commercial release, and is the confidential and proprietary information of Microsoft

    Corporation. It is disclosed pursuant to a non-disclosure agreement between the recipient and Microsoft.

    This document is provided for informational purposes only and Microsoft makes no warranties, either

    express or implied, in this document. Information in this document, including URL and other Internet Web

    site references, is subject to change without notice. The entire risk of the use or the results from the use

    of this document remains with the user. Unless otherwise noted, the companies, organizations, products,

    domain names, e-mail addresses, logos, people, places, and events depicted in examples herein are

    fictitious. No association with any real company, organization, product, domain name, e-mail address,

    logo, person, place, or event is intended or should be inferred. Complying with all applicable copyright

    laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document

    may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any

    means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the

    express written permission of Microsoft Corporation.

    Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property

    rights covering subject matter in this document. Except as expressly provided in any written license

    agreement from Microsoft, the furnishing of this document does not give you any license to these patents,

    trademarks, copyrights, or other intellectual property.

    2011-2012 Microsoft Corporation. All rights reserved.

    Microsoft, Windows, Windows HPC Server, Visual Studio, and Windows Azure are trademarks of theMicrosoft group of companies.

    All other trademarks are property of their respective owners.

  • 8/12/2019 CNL2 WP Architecture

    3/20

    Contents

    1 Introduction ......................................................................................................................................... 4

    1.1 A Hello, World example in Cloud Numerics............................................................................. 5

    1.2 What Cloud Numerics is not ....................................................................................................... 5

    1.3 What this document does and does not cover .............................................................................. 6

    1.4 High-level overview ....................................................................................................................... 6

    2 The Cloud Numerics Programming Model.................................................................................... 8

    2.1 Arrays in Cloud Numerics ........................................................................................................... 9

    2.1.1 Array element types .............................................................................................................. 9

    2.1.2 Indexing and iteration ............................................................................................................ 9

    2.1.3 Array broadcasting ................................................................................................................ 9

    2.2 Local arrays in Cloud Numerics................................................................................................ 10

    2.2.1 Underlying storage .............................................................................................................. 102.2.2 Array factory type ................................................................................................................ 10

    2.2.3 Interoperability with .NET arrays ......................................................................................... 10

    2.2.4 Indexing and iteration with local arrays ............................................................................... 11

    2.2.5 Other operations.................................................................................................................. 11

    2.3 Distributed Arrays ........................................................................................................................ 12

    2.3.1 Partitioning of arrays ........................................................................................................... 12

    2.3.2 Propagation of distributions ................................................................................................. 12

    2.4 Distributed array types ................................................................................................................ 13

    2.5 Working with distributed arrays ................................................................................................... 13

    2.5.1 Conversion between local and distributed arrays ............................................................... 13

    2.5.2 Indexing distributed arrays .................................................................................................. 13

    2.5.3 Other distributed array operations ....................................................................................... 14

    2.6 Expressing parallelism in Cloud Numerics............................................................................... 14

    3 The Cloud Numerics Runtime Execution Model........................................................................ 15

    3.1 Application development, debugging and deployment ............................................................... 15

    3.2 Process model............................................................................................................................. 15

    4 Conclusions ....................................................................................................................................... 19

    5 References ......................................................................................................................................... 20

  • 8/12/2019 CNL2 WP Architecture

    4/20

    1 Introduction

    Cloud Numericsis a new .NET programming framework tailored towards performing numerically-

    intensive computations on large distributed data sets. It consists of

    a programming model that exposes the notion of apartitionedor distributedarray to the user

    an execution framework or runtime that efficiently maps operations on distributed arrays to a

    collection of nodes in a cluster

    an extensive library of pre-existing operations on distributed arrays and tools that simplify the

    deployment and execution of a Cloud Numericsapplication on the Windows Azureplatform

    Programming frameworks such as Map/Reduce [2, 3](and its open-source counterpart, Hadoop [9]) have

    evolved to greatly simplify the processing of large datasets. These frameworks expose a very simple end-

    user programming model and the underlying execution engine abstracts away most of the details of

    running applications in a highly scalable manner on large commodity clusters. These simplified modelsare adequate for performing relational operations and implementing clustering and machine learning

    algorithms on data that is too large to fit into the main memory of all the nodes in a cluster. However, they

    are often not optimal for cases when the data does fit into the main memory of the cluster nodes.

    Additionally, algorithms that are inherently iterative in nature or that are most conveniently expressed in

    terms of computations on arrays are difficult to express using these simplified programming models.

    Finally, while the Hadoop ecosystem is extremely vibrant and has developed multiple libraries such as

    Mahout [7], Pegasus [5]and HAMA for data analysis and machine learning, these currently do not

    leverage existing, mature scalable linear algebra libraries such as PBLAS and ScaLAPACK that have

    been developed and refined over several years.

    In contrast, libraries such as the Message Passing Interface or MPI [7]are ideally suited for efficiently

    processing in-memory data on large clusters, but are incredibly difficult to program correctly andefficiently. A user of the library has to carefully orchestrate the movement of data between the various

    parallel processes. Unless this is done with a lot of care, the resulting high performanceapplication will

    exhibit extremely poor scalability or worse, crash or hang in a non-deterministic manner.

    The programming abstractions provided by Cloud Numericsdo not expose any low-level parallel-

    processing constructs. Instead, parallelism is implicit and automatically derived from operations on data

    types that are provided by the framework such as distributed matrices. These parallel operations in turn

    map to efficient implementations that in turn utilize existing high-performance libraries such as the PBLAS

    and ScaLAPACK libraries mentioned above.

    The rest of this document provides an overview of the programming and runtime execution models in

    Cloud Numericsand is intended to complement the quick start guide and the library API reference.

  • 8/12/2019 CNL2 WP Architecture

    5/20

    1.1 A Hello, World example in Cloud Numerics

    To briefly illustrate the parallel programming model in Cloud Numerics, the following C# sample code

    first loads a distributed matrix, computes its singular values in parallel and prints out its two-norm and

    condition number with respect to the two-norm:

    varA = Distributed.IO.Loader.LoadData(csvReader);varS = Decompositions.SvdValues(A);vars0 = ArrayMath.Max(S);vars1 = ArrayMath.Min(S);

    Console.WriteLine("Norm: {0}, Condition Number: {1}", s0, s0 / s1);

    1.2 What Cloud Numericsis not

    While the rest of this document provides more details on the programming and execution model in Cloud

    Numerics, it is important to briefly emphasize what it is not.

    First, the Cloud Numericsprogramming model is primarily based around distributed array operations(c.f. data parallelor SIMD-style of programming). Certain relational operations such as selects with user-

    defined functions or complex joins are simpler to express on top of languages such as Pig, Hive and

    SCOPE. Similarly, while Cloud Numericsis designed to deal with large data sets, it is currently

    constrained to operate on arrays that can fit in the main memory of a cluster. On the other hand, data on

    disk can be pre-processed via existing bigdata processing tools and ingested into a Cloud Numerics

    application for further processing.

    Second, Cloud Numericsis not just a convenient C# wrapper around message-passing libraries such as

    MPI, for example MPI.NET [3]; all aspects of parallelism are expressed via operations on distributed

    arrays and the Cloud Numericsruntime transparently handles the efficient execution of these high-level

    array operations on a cluster.

    A key aspect that distinguishes Cloud Numericsfrom parallelization techniques such as PLINQ and

    DryadLINQ [10], that are based on implementing a custom LINQ provider, is that parallelization in Cloud

    Numericsoccurs purely at runtime and does not involve any code generation from (say) LINQ expression

    trees; a usersapplication can be developed as a regular .NET application by referencing the Cloud

    Numericsruntime and library DLLs and executed on the cluster in Azure.

    Finally, the underlying communication layer in Cloud Numericsis built on top of the message passing

    interface (MPI) and does inherit some of the limitations in the underlying implementation such as:

    1. The process model is currently inelastic; once a Cloud Numericsapplication has been launched

    on (say)

    cores in a cluster, it is not possible to dynamically grow or shrink the resources as the

    application is running.

    2. The implementation is not resilient against hardware failure. Unlike frameworks like Hadoop that

    are designed explicitly to operate on unreliable hardware, if one or more nodes in a cluster fails, it

    is not possible for a Cloud Numericsapplication to automatically recover and continue

    executing.

    On the other hand, having MPI as the underlying communication layer in the Cloud Numericsruntime

    does endow it with certain advantages. For instance, Cloud Numericsapplications can automatically

    take advantage of high-speed interconnects such as Infiniband between nodes in a cluster and

  • 8/12/2019 CNL2 WP Architecture

    6/20

    optimizations such as zero-copy memory transfers and shared-memory-aware collectives within a single

    multi-core node. More importantly, array operators in Cloud Numericscan leverage the vast ecosystem

    of high-performance distributed memory numerical libraries such as ScaLAPACK built on top of MPI.

    1.3 What this document does and does not cover

    The intent of this document is to provide an overview of the programming model in Cloud Numerics

    along with a high-level description of how a Cloud Numericsapplication runs in parallel on a cluster of

    machines. It does not for example cover the extensive library of numerical functions in Cloud Numerics

    for operating on local and distributed arrays, the programming interfaces provided to import and export

    data in parallel and the details of how a Cloud Numericsapplication is deployed on a cluster on

    Windows Azure from the client and executed.

    1.4 High-level overview

    Figure 1 illustrates the main components in a distributed Cloud Numericsapplication. The end-user or

    clientapplication is itself written in a CLI (Common Language Infrastructure) language such as C#,

    VisualBasic.NET or F# against a set of .NET APIs provided by the Cloud Numericslibraries. Theseinterfaces primarily consist of a set of localand distributedhigh-performance array data types along with

    a large collection of array operations; there are no additional interfaces for parallel programming such as

    task creation and synchronization, resource management, etc. Once an application is running on a cluster

    of machines, the runtime automatically partitions the data between the nodes in the cluster and executes

    operations on partitioned data in parallel.

    Operations on distributed or partitioned arrays in turn depend upon a set of runtime services for handling

    distributed function execution, error handling and input/output operations. In addition, the Cloud

    Numericssystem provides a set of tools to simplify development, deployment and execution of a

    distributed application on Azure from a local client machine.

    Under the covers, the programming interfaces are implemented as thin wrappers around a native runtimeand library infrastructure. These consist of lower-level local and distributed array objects along with

    interfaces for performing operations on them. Several operations on distributed arrays leverage existing

    high-performance numerical libraries such as BLAS and LAPACK on a single CPU core and PBLAS,

    BLACS and ScaLAPACK across CPU cores. At the lowest level are inter-process communication libraries

    such as MPI and C and C++ runtime libraries for memory management and exception handling.

  • 8/12/2019 CNL2 WP Architecture

    7/20

    Figure 1. The components in a Cloud Numericsapplication.

    CloudNumerics Application

    CloudNumerics.NET Runtime

    (Local and

    distributed types,remote function

    invocation, I/O)

    CloudNumerics.NET Library

    (User-facing

    mathematical,

    statistical and

    signal processing

    functions)

    CloudNumerics

    Native Runtime

    CloudNumerics

    Native Libraries

    System Libraries

    (Process and Thread Management, Message Passing,

    Memory Management, Exception Handling, )

    Lower-level

    Numerical

    Libraries(BLAS, BLACS,

    PBLAS, LAPACK,

    ScaLAPACK, FFTW)

    CloudNumerics

    Azure Deployment

    Tools

  • 8/12/2019 CNL2 WP Architecture

    8/20

    2 The Cloud NumericsProgramming Model

    As mentioned previously, Cloud Numericsis ideally suited for algorithms that can be expressed in terms

    of high-level operations on multi-dimensional arrays.

    For instance, consider the power iterationalgorithm for computing the largest eigenvector of a matrix.

    Power iteration is at the heart of several fundamental graph-theoretic computations such as thePageRank for a collection of web pages. If the collection of pages is represented as a matrix where is non-zero when page is connected to page, the principal eigenvector can be computed roughly asfollows:

    Arrays such as the web-connectivity matrix described above are fundamental types in Cloud Numerics.

    Further, vector operations such as and array operations such as can beexpressed succinctly using the Cloud Numericslibrary and executed efficiently even when the vectors

    and matrices are very large and are therefore partitioned between several nodes in a cluster.

    On the flip side, the Cloud Numericsprogramming model may not be a good fit for algorithms that

    cannot be easily expressed in array notation (such as relational joins between two data sets) or for

    applications such as parsing and simplifying petabytes of text data where numerical computation is not

    the main bottleneck. However, once the data has been boiled-down and transformed to its core

    numerical structure, it can be further analyzed using Cloud Numerics.

    , , , T /* Start with an initial guess */

    /* Normalize it */

    /* Compute the next iteration */

    While > do /* Repeat the process until it converges */

  • 8/12/2019 CNL2 WP Architecture

    9/20

    2.1 Arrays in Cloud Numerics

    All array types in Cloud Numerics, be they dense or sparse, local or distributed implement the

    Microsoft.Numerics.IArrayinterface, where the generic parameter Trefers to the type of

    elements in the array. The IArrayinterface provides fundamental properties such as Shape,

    NumberOfDimensionsand NumberOfElements, indexers and basic methods such as Transpose,

    Reshapeand Flattenthat are supported by all array sub-types. Note that arrays in Cloud Numericsare

    always multi-dimensional (i.e., there is no notion of distinct Matrix or Vector types).

    2.1.1 Array element types

    Both local and distributed arrays support the following element types: boolean, one-byte, two-byte, four-

    byte and eight-byte signed and unsigned integers, single and double-precision real floating point numbers

    and single and double-precision complex floating point numbers. The corresponding .NET types are listed

    below:

    Array element type Description Range of values

    System.Boolean Single-byte logical values true/false

    System.UInt8 Single-byte unsigned integers ,System.Int8 Singled-byte signed integers ,System.UInt16 Two-byte unsigned integers , System.Int16 Two-byte signed integers , System.UInt32 Four-byte unsigned integers , System.Int32 Four-byte signed integers , System.UInt64 Eight-byte unsigned integers , System.Int64 Eight-byte signed integers , System.Single Single-precision IEEE 754 real floating

    point values,

    System.Double Double-precision IEEE 754 real floatingpoint values

    ,Microsoft.Numerics.Complex64 Single-precision complex valuesMicrosoft.Numerics.Complex128 Double-precision complex values

    2.1.2 Indexing and iteration

    An important interface that IArrayin turn implements is the .NET enumerable interface

    IEnumerable. Therefore, methods in .NET, for instance LINQ operators that accept IEnumerable

    can be passed instances of Cloud Numericsarrays.

    2.1.3 Array broadcasting

    In NumPy [2], broadcasting refers to two related concepts: (a) the application of a scalar function to every

    element of an array producing an array of the same shape and (b) the set of rules governing how two or

    more input arrays of different shapes are combined under an element-wise operation. In environmentssuch as NumPy and Cloud Numericsthat support a rich set of operations on multi-dimensional arrays,

    broadcasting is often used as a convenient mechanism for implementing certain forms of iteration.

    Several element-wise operators and functions in Cloud Numericssupport broadcasting using

    conventions similar to NumPy. For example, given a local or distributed array Aand a scalar s,

    Microsoft.Numerics.BasicMath.Sin(A)applies the sin function and A + sadds sto each element of

  • 8/12/2019 CNL2 WP Architecture

    10/20

    A. Similarly, for a vector vcontaining the same number of rows of A, A vsubtracts vfrom every column

    of A (in other words, Cloud Numericsbroadcaststhe vector valong the columnsof A)1.

    2.2 Local arrays in Cloud Numerics

    Before describing the programming model for distributed arrays, it is instructive to first understand the

    model for interacting with local arrays in Cloud Numericssince distributed arrays support practically the

    same semantics as local arrays except with different performance characteristics.

    The most important local array type in Cloud Numericsis

    Microsoft.Numerics.Local.NumericDenseArray which represents a dense multi-dimensional

    array with numerical values. As mentioned previously, dense arrays can have an arbitrary number of

    dimensions (so it is possible to create and operate on say, a 10-dimensional array just as easily as a two-

    or three-dimensional array).

    2.2.1 Underlying storage

    Dense local arrays in Cloud Numericsare not backed by arrays allocated in the .NET/CLR heap. This is

    primarily because arrays allocated in the CLR are currently constrained to be smaller than two Gigabyteswhereas the ones allocated on the native heap do not have this limitation. Additionally, the underlying

    native array types in the Cloud Numericsruntime are sufficiently flexible that they can be easily wrapped

    in environments other than .NET (say Python or R) with very little additional effort.

    Unlike arrays in .NET that are internally stored in row-major order, multi-dimensional arrays in Cloud

    Numericsare represented in column-major order that allows for efficient interoperability with high-

    performance numerical libraries such as BLAS and LAPACK.

    2.2.2 Array factory type

    The local NumericDenseArrayclass does not provide any public constructors; users are expected to

    create instances of local arrays using a factory class, NumericDenseArrayFactorythat provides anumber of static array creation methods. For example, to create a local array of double-precision values filled with zero, one would use:

    vardoubleArray = NumericDenseArrayFactory.Zeros(4, 10, 12);

    2.2.3 Interoperability with .NET arrays

    As mentioned above, arrays in Cloud Numericsare not backed by .NET arrays; instead, the

    NumericDenseArrayFactoryprovides a method to create a new Cloud Numericsarray from a .NET

    array. For example, to create a new local array from a .NET array, one would use:varaDotNetArray = newdouble[,]{{1.0, 2.0}, {3.0, 4.0}};varcnArray = NumericDenseArrayFactory .CreateFromSystemArray(aDotNetArray);

    1In the first version Cloud Numerics only supports broadcasting for scalars, but this deficiency will be

    addressed for the next release.

  • 8/12/2019 CNL2 WP Architecture

    11/20

    Conversely, the NumericDenseArrayclass provides two methods that return .NET arrays: one for

    returning a multi-dimensional .NET array of the same shape and with the same elements as the

    NumericDenseArrayinstance and other for returning a one-dimensional .NET array that contains the

    elements flattened in row-majororder. These are shown in the code snippets below:

    double[,] anotherDotNetArray = (double[,]) cnArray.ToSystemArray();

    double[] flatDotNetArray = cnArray.ToFlatSystemArray();

    2.2.4 Indexing and iteration with local arrays

    A local array in Cloud Numericscan be indexed by a scalar just like any other multi-dimensional array in

    .NET. Currently, indices are constrained to be scalars and the number of indices must be the same as the

    number of dimensions of the array as shown in the following examples:

    doubletwoDValue = twoDArray[3, 4];

    doublethreeDValue = threeDArray[2, 3, 4];

    Similarly, local arrays support iteration over the elements in flattened column-major order. For instance,

    one could compute the Frobenius norm of a matrix defined as: ( )

    using the following LINQ expression:

    varfroNorm = Math.Sqrt(doubleMatrix.Select((value) => value * value).Sum());

    2.2.5 Other operations

    In addition to the primitive operations shown in this section, local arrays in Cloud Numericsalso supportoperations such as element-wise arithmetic and logical operators, reshaping an existing array to a

    different shape while preserving the order of elements in memory, transposing an array and filling the

    elements of an array using a generator function. For example, given a general matrix, one coulddecompose it into a sum of a symmetric and a skew-symmetric matrix:

    using:

    varsym = 0.5 * (matrix + matrix.Transpose());varskewSym = 0.5 * (matrix - matrix.Transpose());

  • 8/12/2019 CNL2 WP Architecture

    12/20

    2.3 Distributed Arrays

    In Cloud Numerics, the end-user creates and interacts with a single instance of a distributed array.

    However, under the covers, the runtime creates several arrays on a number of locales2and partitions the

    global distributed array between them. Every aggregate operation on a global distributed array (say

    computing the sum of its columns or permuting its elements) is transparently transformed by the Cloud

    Numericsruntime into one or more parallel operations on these array partitions.

    2.3.1 Partitioning of arrays

    This section briefly summarizes the partitioning scheme for distributed arrays in Cloud Numerics. Note

    however that the Cloud Numericsruntime automatically handles the partitioning of arrays between

    several locales and the user normally does not need to specify any aspects of data distribution.

    A distributed array is constrained to be partitioned across the locales only along one of its dimensions.

    The dimension along with an array is partitioned is referred to as its distributeddimension. Each locale

    then gets a contiguous set of slices of the partitioned array along the distributed dimension; the set of

    indices along the distributed dimension corresponding to the slices assigned to a particular locale

    constitute the spanof the array on that locale. Finally, each locale always gets complete slices of thedistributed array and the slices between two different locales never overlap.

    The partitioning scheme described above is illustrated inFigure 2 for a matrix that is distributed across

    three locales, first by rows and second by columns.

    By default, a multidimensional array is always distributed along its last non-singleton dimension. For

    instance, a matrix is distributed by columns whereas a vector is distributed byrows.

    Figure 2. Row and column partitioned arrays.

    2.3.2 Propagation of distributions

    Generally, methods and static functions in Cloud Numericsthat return arrays produce distributed

    outputs given distributed inputs (the ToLocalArraymethod and functions that return scalars such as SUM

    are important exceptions to this rule). The precise manner in which the distribution propagates is as

    2A locale is defined in the section on the execution model, but it can be thought of as a process

    containing a portion of a distributed array.

    Span 0

    Span 1

    Span 2

    Span

    0

    Span

    1

    Span

    2

  • 8/12/2019 CNL2 WP Architecture

    13/20

    follows: if one or more input arguments to a function are arrays, the distributed dimension of the output is

    the maximum of the distributed dimension of all the inputs. If this distributed dimension is greater than the

    last non-singleton dimension of the output, the output is distributed along its last non-singleton dimension.

    2.4 Distributed array types

    All distributed arrays in Cloud Numericsimplement the

    Microsoft.Numerics.Distributed.IArrayinterface where Tis the type of elements in the

    distributed array. Besides the properties and methods that the base Microsoft.Numerics.IArray

    interface provides, the distributed array interface provides two important properties:

    DistributedDimensionand Spanas well as a Redistributemethod that redistributes the array along

    a different dimension.

    2.5 Working with distributed arrays

    The fundamental distributed array type in Cloud Numericsis

    Microsoft.Numerics.Distributed.NumericDenseArray. Unlike their local counterparts,

    distributed arrays provide explicit constructors. For example to create a new distributed double-precision

    dense array of shape one would use:vardArray = newDistributed.NumericDenseArray(100, 100, 500);

    2.5.1 Conversion between local and distributed arrays

    Local arrays are implicitly converted to distributed arrays by assignment. As a consequence any function

    that only takes a distributed array as an input can be passed a local array and the runtime will

    transparently promote the local array to be distributed.

    varlArray = Local.NumericDenseArrayFactory .Create(400, 400);

    Distributed.NumericDenseArray dArray = lArray;

    On the other hand, a distributed array cannot be implicitly converted to a local array since they are

    assumed to be much larger than what can fit into the memory of a local machine. Therefore, converting a

    distributed array to a local array requires the user to explicitly call the

    NumericDenseArray.ToLocalArray()method. In a similar fashion, a distributed array can be

    converted to a .NET array via the NumericDenseArray.ToSystemArraymethod.

    vardArray = newDistributed.NumericDenseArray(100, 100, 500);

    varlArray = dArray.ToLocalArray();

    varcsArray = dArray.ToSystemArray();

    2.5.2 Indexing distributed arrays

    Distributed arrays can be indexed by scalars just like their local counterparts. For example, the following

    snippet shows indexing into two- and three-dimensional arrays:

    doubletwoDValue = twoDDistArray[3, 4];

    doublethreeDValue = threeDDistArray[2, 3, 4];

  • 8/12/2019 CNL2 WP Architecture

    14/20

    Unlike local arrays, distributed arrays do not support iterating over the elements. This is primarily because

    there is no built-in facility in .NET for iterating over partitioned collections3in parallel.

    2.5.3 Other distributed array operations

    Distributed arrays support almost exactly the same set of operations that are supported by local arrays.

    For instance, the earlier example of decomposing an array into a symmetric and skew-symmetric matrix:

    varsym = 0.5 * (matrix + matrix.Transpose());

    varskewSym = 0.5 * (matrix - matrix.Transpose());

    works unmodified when the argument is a distributed matrix.

    2.6 Expressing parallelism in Cloud Numerics

    As mentioned several times in this document, the primary mechanism in Cloud Numericsfor expressing

    parallelism is via aggregate (data-parallel) operations on distributed arrays; Cloud Numericscurrently

    does not support other modes of parallelism such as task or control parallelism.

    As an example, the following code segment illustrates an alternate implementation of decomposing amatrix into its symmetric and skew-symmetric components:

    vardims = matrix.Shape.ToArray();varsym = newDistributed.NumericDenseArray(dims);varskewSym = newDistributed.NumericDenseArray(dims);for(longj = 0; j < dims[1]; j++)

    {

    for(longi = 0; i < dims[0]; i++){

    sym[i, j] = 0.5 * (matrix[i, j] + matrix[j, i]);sym[i, j] = 0.5 * (matrix[i, j] - matrix[j, i]);

    }

    }

    While this code is syntactically correct and will produce the correct answer for a distributed array, it will

    not execute in parallel due to its use of the C# (sequential) loop construct. Further, the scalar indexing

    operations (matrix[i, j], sym[i, j]and sym[i, j]) are highly sub-optimal for distributed arrays

    since each scalar access generates communication steps where is the number of locales. Theapproach based on array operators such as 0.5 * (matrix + matrix.Transpose())besides being

    more concise is also more efficient, possibly by several orders of magnitude.

    It is therefore always advantageous in Cloud Numericsapplications to try and translate fine-grained

    array accesses into coarse-grained aggregate array operations using mechanisms such as slicing and

    broadcasting.

    3This could potentially be done by implementing a custom LINQ provider.

  • 8/12/2019 CNL2 WP Architecture

    15/20

    3 The Cloud NumericsRuntime Execution Model

    Having described the programming model in Cloud Numericsbased around operations on local and

    distributed arrays, this section briefly explains how a Cloud Numericsapplication is developed and

    executed on a cluster. The intent of this section is to once again only provide a high-level overview of the

    execution model; the Cloud NumericsQuick Start Guide provides a detailed walkthrough of developing

    an application and deploying on to a cluster in Azure.

    3.1 Application development, debugging and deployment

    A Cloud Numericsapplication is initially developed and debugged on a single workstation using the

    Cloud Numerics.NET API. The user can then choose to run the application locally on the workstation

    either in sequential or parallel modes. When an application is run in the sequential mode, it behaves as

    any other .NET application and for instance, can be debugged and profiled inside the Visual Studio IDE.

    Once an application is developed and debugged on a single workstation, it can be deployed and

    executed on a Windows HPC (High-Performance Computing) cluster on Windows Azure using the Cloud

    Numericsdeployment wizard.

    3.2 Process model

    Figure 3 illustrates the process model in a Cloud Numericsapplication when executed on a cluster on

    Azure. Briefly, when a Cloud Numericsapplication is, it starts up a given number, say copies of theCloud Numericsruntime. These runtime instances are referred to as locales. InFigure 3,each locale is

    shown on a different physical node, but in reality, the mapping of locales to actual hardware is somewhat

    arbitrary. For example, on a multi-core node or virtual machine, there can be as many locales as cores in

    the node. Normally, a user only specifies the number of nodes on which the application must be

    executed; the HPC Scheduler takes of allocating as many physical cores as required.

    The notion of a locale in Cloud Numericsis intimately tied to the notion of an operating systemprocess;each locale in fact runs on a separate operating system process even within a single physical node. For

    this reason, the terms locale and process are interchangeably in the rest of this document.

    Of the locales created, one is designated to be the masterand the other are designated to beworkerlocales. The users application only runs on the master process (in other words, a parallel

    application in Cloud Numericshas a single thread of control) and the remaining processes wait inan infinite loop for notification from the master to do work on their portions of a distributed array. A single

    thread of control from the users perspective is an important aspect that makes parallel programming

    convenient in Cloud Numerics; the user does not have to explicitly write or manage the execution of any

    parallel constructs (such as launching tasks or passing messages between instances) and therefore it is

    not possible to programmatically run into pitfalls such as starvation or deadlock that are present in other

    parallel programming models.

  • 8/12/2019 CNL2 WP Architecture

    16/20

    Figure 3. A distributed Cloud Numericsapplication executing on a master and three worker

    processes. The processes are interconnected in a topology referred to as a hypercube.

    Worker 1

    User Application,

    CloudNumerics Runtime

    CloudNumerics Libraries

    System Libraries

    Master

    User Application,

    CloudNumerics Runtime

    CloudNumerics Libraries

    System Libraries

    Worker 2

    User Application,

    CloudNumerics Runtime

    CloudNumerics Libraries

    System Libraries

    Worker 3

    User Application,

    CloudNumerics Runtime

    CloudNumerics Libraries

    System Libraries

    Interprocess Comm

    Interprocess Comm

  • 8/12/2019 CNL2 WP Architecture

    17/20

    Figures 4 through 6 illustrate the steps involved in performing a distributed operation such as computing

    the element-wise sum of two partitioned arrays and.First, the users application dispatches the operation to the master process via the overloaded addition

    operator.

    Figure 4. First step: the user invokes a method on a distributed array on the master process.

    Then, the master process broadcasts the operation to all workers. The payload consists of both the

    operation to be performed as well as references to the distributed input arguments.

    Figure 5. Second step: The command is broadcast from the master to all workers.

    A+B

    Master

    Master

    C=Add(A,B)

    C=Add(A,B)

    Worker 0

    Worker 1Worker 2

    C=Add(A,B)

  • 8/12/2019 CNL2 WP Architecture

    18/20

    Next, the master and workers perform a part of the underlying computation in parallel. Often, but not

    always this involves communicating with other locales to exchange data. For example, in the case of

    element-wise addition, if the two arguments are partitioned such that each locale has all the data

    necessary to compute its portion of the output, the operation is done element-wise without any

    communication. On the other hand if say the first argument is partitioned by rows and the second is

    partitioned by columns, the master and workers first collectively exchange pieces of the first matrix in

    order to compute the result.

    Figure 6. The master and workers work on a part of the computation in parallel. In is particularexample, the computation can be performed without requiring any communication.

    Once each worker is finished with its portion of the computation, it sends back a successful completion

    flag to the master. On the other hand, if it fails (say because of an exception), it sends back a specific

    failure code along with a serialized exception object describing the failure.

    Figure 7. Once the master and workers are done with the computation, the master collects anyerrors during computation. In this particular case, all work is completed successfully.

    Worker 0

    Worker 1Worker 2

    Master

    C = Add(A,B)C = Add(A,B)

    C = Add(A,B)C = Add(A,B)

    Master Worker 0

    Worker 1Worker 2

    OK

    OK

    OK

  • 8/12/2019 CNL2 WP Architecture

    19/20

    The master process waits for an acknowledgement from each of the workers. If all processes (including it)

    completed successfully, the master returns a reference to the result back to the users application. On the

    other hand if one or more workers encountered an error, the master process rethrows the exception in the

    payload into the application where it can be handled like any other exception in .NET.

    Figure 8. The master then returns the result of the computation to the user as a distributed result.

    4 Conclusions

    The programming model in Cloud Numerics, based around operations on distributed arrays, simplifiesthe implementation of algorithms that are most naturally expressed in matrix notation and introduces no

    additional parallel programming or synchronization constructs. Further, the model, based on a single-

    thread of execution, greatly simplifies application development and debugging and reduces the possibility

    of conditions such as races and deadlocks due to bugs in the usersapplication. The Cloud Numerics

    distributed runtime transparently executes operations on distributed arrays in parallel on nodes in a

    cluster, leveraging existing mature high-performance libraries when possible.

    Although the general principles behind the design and implementation of Cloud Numericsare based on

    well-established fundamentals in high-performance scientific computing, it is still a new and rapidly-

    evolving framework. It is anticipated that future versions of Cloud Numericswill support:

    interoperability with Map/Reduce execution frameworks

    a richer set of functionality on distributed arrays

    operations on distributed sparse matrices

    better performance on existing functions

    more user-friendly application deployment wizard

    and most importantly, additional features based on end-user feedback.

    Master

    C

  • 8/12/2019 CNL2 WP Architecture

    20/20

    5 References

    1. Blackford, L.S, Choi, J., Cleary, A., et al. ScaLAPACK UsersGuide, SIAM, Philadelphia, PA, 1997.

    2. Dean, J. and Ghemawat, S., MapReduce: Simplified data processing on large clusters, Symposium on

    Operating System Design and Implementation (OSDI), 2004.

    3. Gregor, D. and Lumsdaine, A. Design and implementation of a high-performance MPI for C# and the

    Common Language Infrastructure,ACM Principles and Practice of Parallel Programming, pages 133-142,February 2008.

    4. Lin, J. and Dyer, C., Data-intensive Text Processing with MapReduce, Morgan-Claypool, 2010.

    5. Kang, U., Tsourakakis, C. E. and Faloutsos, C. PEGASUS: A Peta-Scale Graph Mining System -

    Implementation and Observations, IEEE Conference on Data Mining, 2009.

    6. Oliphant, T., A Guide to NumPy, Trelgol Publishing, 2005.

    7. Owen, S., Anil, R. and Dunning, T. Mahout in Action, Manning Press, 2011.

    8. Snir, M. and Gropp, W., MPI The complete reference, MIT Press, Cambridge, MA, 1998.

    9. White, T., Hadoop: The Definitive Guide, 2e, OReilly, Sebastopol, CA, 2008.

    10. Yu, Y., Isard M., Fetterly, D., et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel

    Computing Using a High-Level Language. Symposium on Operating System Design and Implementation

    (OSDI), San Diego, CA, 2008.