9
Advanced Review CXXR: an extensible R interpreter Andrew Runnalls This paper describes CXXR, a project to refactor the R interpreter from C into C++, with a view to making the internals of the interpreter more readily accessible to researchers and developers. The continued growth of CRAN is a testament to the increasing number of developers engaged in R development. But far fewer researchers have experimented with the R interpreter itself. The code of the interpreter, written for the most part in C, is structured in a way that will be foreign to students brought up with object-oriented programming, and the available documentation, though giving a general understanding of how the interpreter works, does not really enable a newcomer to start modifying the code with any confidence. The CXXR project is progressively refactoring the interpreter into C++, whilst all the time preserving existing functionality. By restructuring the code into tightly encapsulated and carefully documented classes, CXXR aims to open up the interpreter to more ready experimentation by statistical computing researchers. This paper focuses on the example task of providing, as a package, a new type of data vector, and shows how CXXR greatly facilitates such a task by internal changes to the structure of the interpreter, and by offering a higher-level interface for packages to exploit. © 2013 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2013, 5:181–189. doi: 10.1002/wics.1251 Keywords: R; CXXR; serialization CXXR T he object of the CXXR project (http:// www.cs.kent.ac.uk/projects/cxxr) is gradually to reengineer the fundamental parts of the R interpreter from C into C++ with the intention that: Full functionality of the standard distribution of R (including the recommended packages) is preserved; The behavior of R code is unaffected (unless it probes into the interpreter internals); There is no change to the existing interfaces for calling out from R to other languages (.C, .Fortran, .Call and .External). Likewise there is no change to the main application program interfaces (APIs) (R.h and Correspondence to: [email protected] School of Computing, University of Kent, Canterbury, Kent, UK S.h) for calling into R. However, a broader API is made available to external C++ code. Work on CXXR started in May 2007, at that time shadowing R-2.5.1. Subsequently it has been upgraded to parallel at least every minor release of R, usually synching on the 0.1 maintenance release; the current CXXR release (as of December 2012) shadows R-2.15.1. a Recent changes entailed by this upgrade process include the deployment of the bytecode compiler. An earlier description of the project appeared in Ref 1. In the interests of portability, CXXR endeavors to adhere to the 2003 standard for the C++ language ISO14882:2003 (which in its turn is almost identical to the original 1998 standard), rather than to the latest 2011 standard. CXXR does however make use of some of the library functions that now form part of the 2011 standard, but only in cases where such functions are readily available from free and open source repositories, in particular the Boost libraries Volume 5, May/June 2013 © 2013 Wiley Periodicals, Inc. 181

CXXR: an extensible R interpreter

  • Upload
    andrew

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CXXR: an extensible R interpreter

Advanced Review

CXXR: an extensible R interpreterAndrew Runnalls∗

This paper describes CXXR, a project to refactor the R interpreter from C into C++,with a view to making the internals of the interpreter more readily accessible toresearchers and developers.

The continued growth of CRAN is a testament to the increasing numberof developers engaged in R development. But far fewer researchers haveexperimented with the R interpreter itself. The code of the interpreter, written forthe most part in C, is structured in a way that will be foreign to students broughtup with object-oriented programming, and the available documentation, thoughgiving a general understanding of how the interpreter works, does not really enablea newcomer to start modifying the code with any confidence. The CXXR project isprogressively refactoring the interpreter into C++, whilst all the time preservingexisting functionality. By restructuring the code into tightly encapsulated andcarefully documented classes, CXXR aims to open up the interpreter to more readyexperimentation by statistical computing researchers.

This paper focuses on the example task of providing, as a package, a newtype of data vector, and shows how CXXR greatly facilitates such a task by internalchanges to the structure of the interpreter, and by offering a higher-level interfacefor packages to exploit. © 2013 Wiley Periodicals, Inc.

How to cite this article:WIREs Comput Stat 2013, 5:181–189. doi: 10.1002/wics.1251

Keywords: R; CXXR; serialization

CXXR

The object of the CXXR project (http://www.cs.kent.ac.uk/projects/cxxr) is gradually to

reengineer the fundamental parts of the R interpreterfrom C into C++ with the intention that:

• Full functionality of the standard distributionof R (including the recommended packages) ispreserved;

• The behavior of R code is unaffected (unless itprobes into the interpreter internals);

• There is no change to the existing interfacesfor calling out from R to other languages (.C,.Fortran, .Call and .External).

• Likewise there is no change to the mainapplication program interfaces (APIs) (R.h and

∗Correspondence to: [email protected]

School of Computing, University of Kent, Canterbury, Kent, UK

S.h) for calling into R. However, a broader APIis made available to external C++ code.

Work on CXXR started in May 2007, at thattime shadowing R-2.5.1. Subsequently it has beenupgraded to parallel at least every minor releaseof R, usually synching on the 0.1 maintenancerelease; the current CXXR release (as of December2012) shadows R-2.15.1.a Recent changes entailedby this upgrade process include the deployment ofthe bytecode compiler. An earlier description of theproject appeared in Ref 1.

In the interests of portability, CXXR endeavorsto adhere to the 2003 standard for the C++ languageISO14882:2003 (which in its turn is almost identicalto the original 1998 standard), rather than to thelatest 2011 standard. CXXR does however make useof some of the library functions that now form partof the 2011 standard, but only in cases where suchfunctions are readily available from free and opensource repositories, in particular the Boost libraries

Volume 5, May/June 2013 © 2013 Wiley Per iodica ls, Inc. 181

Page 2: CXXR: an extensible R interpreter

Advanced Review wires.wiley.com/compstats

of peer-reviewed, portable C++.2 Until now CXXRhas been built and tested only on various Linuxdistributions and (to a lesser extent) on MacOS X,but a port to Microsoft Windows is scheduled.

This paper will refer to the standard R interpreteras CR, in contradistinction to CXXR. Bare R will beused to refer to the language and user interface thatboth these interpreters implement.

The original motivation for embarking on theCXXR project was to introduce provenance trackingfacilities into R, so that for any R data object itwill be possible to determine exactly which originaldata sources it was derived from, and exactly whichsequence of operations was used to produce it;progress on this front is described elsewhere.3,4 ButCXXR has a broader mission, which is to makethe R interpreter more accessible to developers andresearchers, by various means. For example, the C++classes and other code structures within CXXR arecarefully documented. Second, CXXR tightens up theencapsulation boundaries within the interpreter, sothat a developer can be more confident that modifyingcode in one place will not unexpectedly break thingselsewhere. This is part-and-parcel of moving towardan object-oriented software structure, something thatgraduate students are familiar with, and which the Rlanguage itself has moved toward. Finally CXXR aimsto express the internal algorithms of the interpreter ata higher level of abstraction, and make them availableto external code through CXXR’s own API, and weshall see some examples of this in this article.

The objectives of CXXR are therefore somewhatdifferent from that of the powerful Rcpp package,5

which provides extensive support for developerswishing to write R packages in C++ for integrationwith the standard (CR) interpreter.b CXXR, incontrast, is progressively refactoring the interpreteritself into C++, and in the process opening it upto more ready experimentation by developers. A sideeffect of this is to render some—but by no meansall—of the facilities of the Rcpp package redundant.

LayersCXXR code falls into three categories, which can beconsidered roughly to form three concentric layers.At the center is the CXXR core, which is written asfar as possible in idiomatic C++. Everything in thecore layer is placed in the C++ namespace CXXR. Thefollowing is a summary of the functionality currentlywithin the core:

• Memory allocation and garbage collection.

• SEXPREC union replaced by an extensible classhierarchy rooted at class RObject (describedbelow).

• Environments (i.e., variable to object mappings),with hooks to support provenance tracking.

• Expression evaluation. (S3 method dispatchpartially refactored; S4 dispatch not yetrefactored.)

• Handling of R errors and indirect flows ofcontrol in R, e.g. return and break, havebeen reengineered to eliminate the use of Csetjmp/longjmp, which are incompatible withC++. (Some loose ends remain.)

• Unary and binary function generics. [-sub-scripting.

• Object duplication is now handled by C++ copyconstructors.

On the outside is the packages and moduleslayer. Packages and modules (including graphicsfacilities) which form part of the standard CRdistribution are adapted and tested as an integralpart of the CXXR development process. A projectobjective is that existing R packages should workwith CXXR with minimal alteration—usually noneat all. Runnalls6 explored the extent to which thishad been achieved by testing CXXR with 50 keypackages (other than those forming part of thestandard distribution) from the central R archiveCRAN, namely the packages on which most otherpackages in the archive depend. The upshot was that,apart from fixing some latent bugs in the packages(and bugs in CXXR itself), only three lines of packagecode needed to be modified for all the tests included inthese 50 packages to pass. All these changes were inthe C code which some packages include; in no casedid a package’s R code need to be changed.

In between the core and the packages layer isthe transition layer, which consists of C files fromthe CR interpreter adapted to work with the CXXRcore. In almost all cases these C source files have beenredesignated as C++, but the programming idiomsare largely those of CR (which in addition to C idiomsfrequently exhibits those of Fortran and especiallyLISP).

It will be noted that the refactorization processwithin CXXR has focused on the fundamentalworkings of the interpreter. There is much else withinR that could profit from reworking in an object-oriented language, for example graphics facilities andthe implementation of R connections, but it has notthus far been possible to address these with theavailable effort.

182 © 2013 Wiley Per iodica ls, Inc. Volume 5, May/June 2013

Page 3: CXXR: an extensible R interpreter

WIREs Computational Statistics CXXR: an extensible R interpreter

The RObject Class HierarchyOne of the major changes in CXXR is the way thatobjects visible in R are implemented. In CR, theseobjects are implemented using a C union, which canbe thought of as a way of telling the C compiler thata particular memory address may hold any one ofseveral specified datatypes: in fact this union (namedSEXPREC) comprises no fewer than 23 distinct types,corresponding to the different types of R object. Allof these types have in common an integer-value fieldknown as the SEXPTYPE which is used at runtime todetermine which type of object it is; the various valuesof the SEXPTYPE are given mnemonic names such asINTSXP to identify an integer vector, or BUILTINSXPto identify an R function implemented directly withinthe interpreter.

This approach has several disadvantages. First ofall, the compiler does not know which of the 23 typesis occupying a particular address: this is determinedonly at runtime. Consequently all type checking of Robjects must be done at runtime, by examining theSEXPTYPE field: the implementation simply does notleverage the compiler’s own considerable capabilitiesfor type checking.c This type ambiguity can also makedebugging at the C level difficult.

Another snag is that this approach in effectturns the set of R data types into a closed list. If youwanted to add a new type of R object, that wouldmean delving into the heart of the interpreter andmaking fundamental modifications. This is somethingwhich package developers have rightly been reluctantto do, because the interpreter thus modified wouldneed to be maintained alongside the standard Rdistribution. The adverse effect of having a closedset of data types can be gauged by studying thecode of the powerful Matrix package7: the codeis constantly having to convert matrices from aformat representable using CR data types into thedata structures required by third-party libraries suchas LAPACK, and then—having applied a functionfrom a third-party library—to convert the result backinto a CR-compatible format. As we shall now see,CXXR allows this to be short-circuited, by allowingany desired C/C++ data structure to be incorporateddirectly into an R object.d

In CXXR, the different types of R object areimplemented as a C++ class inheritance hierarchy,rooted at an abstract class unimaginatively namedRObject, and shown in Figure 1. Readers familiarwith R will have no difficulty mapping mostd of theC++ class names in this figure onto the correspondingtype of object in the R language.f A particular pointto note is that all of R’s built-in vector types areimplemented using a single C++ class template called

FixedVector: this includes vectors of integers,vectors of reals, vectors of complex numbers, vectorsof strings (R’s character type), as well as R lists.

Along with this change to a C++ class hierarchy,CXXR is carrying out a wholesale restructuringof the interpreter code. First of all, the project isendeavoring gradually to move all code relating toa particular data type into one place, and then touse C++’s public/protected/private mechanisms toconceal implementational details and to defend classinvariants. Second, CXXR uses C++ templates toexpress key algorithms at a higher level of abstraction.

Most important of all, CXXR aims to make iteasy for developers to extend the class hierarchy.

EXAMPLE: MYGMP

We will now give an extended example illustrating theease with which CXXR enables new sorts of objectsto be introduced into R at the C++ code level, byextending the RObject class hierarchy.

R of course has the ability to handle vectorsof integers, but the range of values that can berepresented by these built-in integers is limited bythe word-length of the computer; in fact currently R’sbuilt-in integers are limited to 32 bits, even on a 64-bitarchitecture.

Suppose a developer wanted to write a package,which we shall call MyGMP, adding to R thecapability of handling arbitrarily large integers. Thisdeveloper is aware that there is a free GNU library,the GNU MP library8 that provides ‘bigints’, as theyare called, for C and C++, and would like to buildon that.

Some readers will be aware that there alreadyis such an R package in CRAN, the gmp package,9

based on the GNU MP library, which offers bigintsand much else besides. The rudimentary packagedescribed in this paper does not in any way match thecapabilities of gmp: its purpose is simply to illustratehow relatively easy it is to get such a package off theground using CXXR.g

The GNU MP library defines a C++ classmpz_class to represent an arbitrarily large integer.But one of the attractive features of R for statisticalanalysis is that it can flag individual data points asbeing ‘not available’, represented NA in R. As itstands mpz_class doesn’t have this capability: it canrepresent positive integers, negative integers, zero, butnot NA.

Fortunately, in CXXR we can put this rightessentially in one line of C++ code, using the classtemplate NAAugment, which does what the namesuggests:

Volume 5, May/June 2013 © 2013 Wiley Per iodica ls, Inc. 183

Page 4: CXXR: an extensible R interpreter

Advanced Review wires.wiley.com/compstats

FIGURE 1 | The RObject class hierarchy.

This type definition gives us a new C++ classBigInt which can represent an arbitrarily large integeror NA, and this is all set up in such a way that thegeneric algorithms in CXXR can handle NAs withlittle or no attention from the package writer.

Each object of the new class BigInt representsa single bigint. But of course R works primarilywith data vectors—or matrices or higher dimensionalarrays. This is not a problem: we can introduce vectorsof BigInts into CXXR essentially with one furtherline of code, which reinstantiates the same C++ classtemplate FixedVector that is used for the built-invector types:

The FixedVector class template takes threetemplate parameters. The first is the type of elementthe vector is to contain, here BigInt. The second isthe SEXPTYPE to be ascribed to objects of this type.CXXSXP is a special SEXPTYPE value used in CXXRfor any type of RObject that does not correspondto a CR object type: the intention is that when codeinherited from CR encounters this SEXPTYPE value,it will normally raise an error, unless the code hasbeen more specifically adapted to handle the case.The third template parameter must be the name of an‘initializer’ class, that is to say a class which defines astatic initialize() function which is to be used toinitialize newly-created objects. In the present case thisis the class ApplyBigIntClass, defined as follows:

184 © 2013 Wiley Per iodica ls, Inc. Volume 5, May/June 2013

Page 5: CXXR: an extensible R interpreter

WIREs Computational Statistics CXXR: an extensible R interpreter

The effect of this is to ensure that newly createdBigIntVector objects are given the R class attributeBigInt.

With the few lines of code listed above,BigInt Vectors have now joined the RObjectclass hierarchy. We can now assign BigIntVectorsto R variables, and facilities such as garbage collection,and the dimensioning of matrices and arrays—all theseare automatically in place.

Contrast this with the approach taken bythe CRAN gmp package, working as it mustwith CR’s limited range of object types. In gmp,a vector of bigints is held as an R object bypacking it into an array of char, using CR’sRAWSXP object type. Every operation on vectorsof bigints requires the gmp package to unpackeach operand into a (C++) vector of mpz_classobjects, perform the required operation using theGNU MP library, and then pack the result backinto a RAWSXP object. Moreover, because bigints ofdifferent magnitudes will occupy different numbersof bytes in the packed array, it is necessary tounpack the entire vector even to access a singleelement.

Binary FunctionsObviously the MyGMP package needs to have the capa-bility to carry out arithmetic on BigIntVectors:multiplication for example. Let us first review howbinary operations such as multiplication work forR vectors. Basically each element of the result isdetermined by multiplying together the correspond-ing elements of the two operands, so for example thefirst element of the result is the product of the firstelements of the two operands.

But there are some complications. For example,if either of the operand elements is NA, then normallythe corresponding result element must be set to NA.If the operands are of unequal length, the elementsof the shorter operand are reused in rotation. But Rgives a warning if the longer operand is not an exactmultiple of the length of the shorter operand. It is alsonecessary to consider attributes: for example in R theuser can give names to individual elements of vectors,or to the rows and columns of matrices. Somehowthe attributes of the result of a binary operation mustbe inferred from the attributes of the operands. Thereare some further complications if the operands arematrices or arrays.

CXXR defines a generic algorithm for imple-menting R binary functions, and makes it availableto package C++ code via the CXXR API. To usethe algorithm the programmer needs to specify fourthings:

• The elementwise operation that is to beperformed. In the case of multiplication thisoperation is given to us directly by theGNU MP library, which overloads the C++multiplication operator ‘*’ for operands ofmpz_class. This appears in the code belowas the template parameter to the class templateBinaryFunction.

• The two operands (in the present caseBigIntVectors).

• The type of vector (or other vector-likecontainer) to be produced as the result: inthe present case again a BigIntVector.This appears in the code below as thetemplate parameter to the templated functionBinaryFunction::apply().

• Finally, it is necessary to specify how theattributes of the result are inferred from thoseof the operands, and this is determined byan optional second template parameter to theBinaryFunction class template. Fortunately,this is better standardized in R for binaryoperations than it is for unary operations, sousually—as here—this parameter can be left totake its default value.

Using this generic algorithm, introducing multi-plication of BigIntVectors to the MyGMP packagerequires just the following C++ code:

which is invoked via a simple R function in thepackage:

(The function as.bigint invoked here needsto address a number of cases according to thetype of its argument, and is consequently too longto include its implementation here. However, thefunction print.BigInt discussed below gives theflavour of what is involved.)

All in all, the developer does not need to do muchprogramming at all in the MyGMP package before

Volume 5, May/June 2013 © 2013 Wiley Per iodica ls, Inc. 185

Page 6: CXXR: an extensible R interpreter

Advanced Review wires.wiley.com/compstats

it becomes possible, for example, to compute largefactorials in R. Here is an example session:

This example illustrates the fact that NAspropagate as expected without needing specialconsideration within the package code.

Printing of the vector f in the above exampleimplicitly invokes the S3 method print.BigInt,which works by converting bigints into decimal char-acter strings using the function BigInt2String()provided by the GNU MP package. This function isvectorized in the package’s C++ code by a simplewrapper as follows:

which illustrates how unary functions are handledin a similar way to binary functions. Using this,print.BigInt can be defined simply as follows:

SubscriptingThe factorial example above included the useof square brackets to perform subscripting on aBigIntVector, and this capability is something thatneeds to be provided explicitly by the MyGMP package.

R is rightly renowned for the power of itssubscripting operations: subscript expressions can beused to access or replace elements of vectors, rowsand/or columns of matrices, and slices of higherdimensional arrays. To give a few examples: a rowof a matrix (for example) can be specified eitherby its position, or by its name (if it has one).Negated row numbers signify ‘include all rows exceptthese’. Odd numbered rows of a matrix m can beselected by specifying a logical expression as theindex: m[c(TRUE, FALSE), ], the elements of the

‘logical’ vector c(TRUE, FALSE) being recycled asnecessary up to the total number of rows in m.

In fact the R Language Definition documentdevotes over four of its 51 pages to describing R’ssubscripting facilities, and even that glosses over someedge cases. In the CR interpreter there are some 2000C language statements implementing these facilities.But in a sense this C code is ‘locked up’: it is not madeavailable to external code via a documented API, andit is written exclusively to handle CR’s built-in datatypes. Consequently, as it stands, we cannot directlyexploit this code to bring subscripting facilities toBigIntVectors.

CXXR takes a more extensible approach. Itmakes subscripting facilities available through its APIusing a class with the obvious name: Subscripting.This class works using C++ generic algorithms,which abstract away from the type of the elementsof the vector (or matrix or higher-dimensional array).Consequently the algorithms are not restricted toR’s built-in data types; in particular they can handleBigInt elements without problem.

But not only do the algorithms abstract awayfrom the type of elements of the vector, they alsoabstract away from the particular data structure usedto implement the vector. So this opens the way to usingthe algorithms with packed data: perhaps a developerimplements a new vector object type in which 32 DNAbases are packed into a 64-bit word. Or a developermay introduce into the RObject class hierarchy anew vector implementation for large datasets in whichmost of the data are held on disk, after the fashion ofthe ff package.11 In either case the developer can useCXXR’s subscripting algorithms to index into thesenew vector types.

As far as BigIntVectors are concerned, usingCXXR’s Subscripting class means that we cancarry across the full power of R subscripting withjust a few lines of package code. Here for exampleis the package C++ code needed to implementsubassignment, i.e. the replacement operation thatoccurs when square brackets appear on the left-handside of an assignment:

186 © 2013 Wiley Per iodica ls, Inc. Volume 5, May/June 2013

Page 7: CXXR: an extensible R interpreter

WIREs Computational Statistics CXXR: an extensible R interpreter

which is invoked via the following R function in thepackage:

In the C++ code above, the templated functionSEXP downcast() is used to cast a pointer to sometype in the RObject class hierarchy to a pointerto a type further down the hierarchy (i.e. further tothe right in Figure 1). According to the configurationoptions with which CXXR is built, this cast maybe either checked (using a C++ dynamic-cast) orunchecked (static-cast).

SerializationFor our class MyGMP to be of full value, it is necessarythat objects of this class can be saved at the end ofan R session alongside other types of R object, andrestored at the start of the next session. In other wordsthis class must participate in session serialization anddeserialization.

The current CR code for serialization is notreadily extensible. It is built around the limitednumber of datatypes comprised by SEXPREC union,and the serialization code is concentrated into a fewhuge functions. Moreover, the serialization format isinherently binary in nature, which makes debuggingvery difficult: it is difficult even to determine whethera bug is happening during serialization or duringdeserialization.

CXXR now has the capability to save andrestore user sessions using an XML serializationformat. This capability builds upon the facilities ofthe ingenious Boost C++ Serialization Library.12

This library handles well the task of serializing adirected graph in which the nodes are heap-borneC++ objects of various types, and the edges arepointers. Each node in the graph will be serialized onlyonce, even if it is pointed to by several pointers, andon deserialization all pointers will be set up correctly,even though the graph nodes will now be at differentmemory addresses. The library also takes care ofthe possibility that a pointer of type T*, say, mayactually point to an object of some type inheritingfrom T.

All of this is achieved using C++ templates, andin a way which delegates to individual classes thedetails of how objects of that class will be serialized.So alongside the code defining the functionality ofclass MyGMP, some of which we reviewed above, wecan place code defining how BigIntVector objectsare serialized. In fact the following code suffices:

To understand this code, recall that BigIntVector is typedefed to an instantiation of theclass template FixedVector. This class templatealready defines a (templated) member functionserialize() which defines how FixedVectorobjects should be serialized using the facilities ofboost::serialization. This function delegatesthe task of serializing the individual elements ofthe FixedVector according to the type of theseelements. In the present case (BigIntVector) thetype of the elements is of course BigInt, andas we saw earlier BigInt is typedefed to aninstantiation of the class template NAAugment<T>.This class template in its turn defines a memberfunction serialize() defining how NAAugmentobjects should be serialized, and this in its turndelegates the task of serializing its payload, accordingto the type T of that payload.

In the present case the payload type ismpz_class, so all that remains is to specifyhow objects of mpz_class should be serialized.This is a class from a third-party library whichdoes not define a serialize() member function.That is not a problem, however, since theboost::serialization library will also accepta free-standing serialize() function as defined inthe code above.

This serialize() function calls the (inlined)library function split_free(), which simplyforwards the call either to save(), if the templateparameter Archive is an output archive type (i.e. forserialization), or to load() if Archive is an inputarchive type (deserialization). The save() functionis passed a reference to the mpz_class object tobe serialized, and represents this object as a decimal

Volume 5, May/June 2013 © 2013 Wiley Per iodica ls, Inc. 187

Page 8: CXXR: an extensible R interpreter

Advanced Review wires.wiley.com/compstats

string within the output archive; load() performsthe converse operation.

The code excerpt above calls the macroBOOST_SERIALIZATION_NVP, which is provided aspart of the serialization library and handles ‘name-value pairs’; this call causes the string d to beserialized within an XML element with name ‘‘d’’.The following snippet shows how a single element ofa BigIntVector is represented within an archive:

The reader may be puzzled by the unused parameter‘version’ of the functions above. This reflectsanother feature of the Boost serialization librarythat is very valuable within CXXR, namely that theserialization format is versioned on a class by classbasis. That is to say, whenever an archive includesthe serialized form of one or more objects of aparticular class Cls, the archive records which versionof serialization for this class was used: by default thisis Version 0. If in the course of time, developersdecide that a new serialization format is necessaryfor this class, perhaps because a new data memberhas been added to Cls, then they can change theoutput serialization code accordingly, and designatethe resulting format as version 1. They will also needto amend the code for input archives so that it canhandle both archives in the new Version 1 and inthe superseded Version 0 format. But note that thesechanges affect only the code associated with class Cls:there is no need for any changes to the higher-levelserialization code, nor in the ‘magic numbers’ at thestart of an archive file.

Once development of the new serializationapproach is complete, it is intended that CXXR willswitch to a serialization format based on compressedXML as its default serialization format; however,CXXR will of course retain the ability to load sessionsserialized in CR format, and to save CR-compatibledata in CR format. Among the issues that remainto be addressed is the fact that, for example, theMyGMP package—and in particular its dynamically-loaded C++ library—must already have been loadedbefore any attempt is made to deserialize an archivecontaining the serialized forms of MyGMP objects.(Ideally it would also be possible to load the non-MyGMP-dependent content from an archive withoutloading the package; whilst this is feasible in principle,providing this capability is not seen as a priority.)

FUTURE WORK

Much work remains to be done in the development ofCXXR. Among work scheduled for 2013 is portingto Microsoft Windows (a conspicuous gap untilnow), and the integration of provenance-trackingfacilities into the development trunk, along with thedecentralized approach to serialization.

There is considerable scope for further restruc-turing of the interpreter code using C++ generics,along the lines illustrated above for binary func-tions and subscripting. For example [[-subscriptingand binding operations such as cbind and rbindwould readily lend themselves to this treatment.

Another area being held in review is the speedof CXXR in comparison with CR. The performanceof CXXR depends very much on the R script beingrun. In particular CXXR at present has a higheroverhead than CR in setting up R function calls,and in the housekeeping related to garbage collection.Consequently CXXR fares particularly badly runningscripts that involve many R manipulations on smalldatasets; many of the R test scripts are of thiskind, and on such scripts CXXR currently runsat down to about three-quarters the speed of CR,sometimes even less. On the other hand, CXXR hasa more aggressive approach to garbage collectionthan CR, which means that it has leaner memoryrequirements, and makes more effective use of theprocessor caches. Consequently it tends to come intoits own working with larger datasets, and in manysuch cases CXXR runs somewhat faster than CR.Indeed Jens Oehlschlagel (private communication) hasproduced an example where CXXR runs over twiceas fast as CR.

It is perhaps worth remarking that in decidingfuture directions for development within CXXR,it is always necessary to consider what impact aproposed change may have on the task of subsequentlymaintaining CXXR in parallel to CR. Already some10% of the development effort in CXXR is incurredby tracking new CR releases, and it is important thatthis ratio is held within bounds.

CONCLUSION

CXXR aims to open up the R interpreter to developers,providing a workbench on which they can developtheir own ideas regarding extensions, modifications,and adaptations to R. As this paper has illustrated,one important way in which CXXR achieves this is byimplementing objects visible in R using a C++ classinheritance hierarchy which developers can extend.This is complemented by a drive to rewrite key

188 © 2013 Wiley Per iodica ls, Inc. Volume 5, May/June 2013

Page 9: CXXR: an extensible R interpreter

WIREs Computational Statistics CXXR: an extensible R interpreter

algorithms within the R interpreter at a higher levelof abstraction, and to make them available via theCXXR API.

NOTESaThe CXXR initiative is not part of the official Rproject; however, the author wishes to acknowledgethe help and encouragement of many members of theR Core Team.bAt the time of writing, the Rcpp package is incom-patible with the CXXR interpreter because it relieson internal features of CR. This should not be toodifficult to rectify.cEven in CXXR it is inevitable that some type check-ing needs to be done at runtime because R variablesdo not have a declared type. However, once a checkhas been performed, the relevant pointer can be down-cast to the appropriate CXXR type, thus exposing itsusage to compile-time checks.dR itself incorporates an EXTPTRSXP data type(implemented in CXXR as class ExternalPointer)which contains a void* pointer, with the intention

that developers can configure this pointer to point toany desired data object. However, it is left to the pro-grammer to cast this pointer down to the desired typewhenever it is used, and the interpreter gives the pro-grammer little or no support in creating, duplicating,destroying, or serializing the external object.eWith the exception of the BailOut classes, whichare not visible to R users. These are used in CXXRto implement indirect flows of control in R (e.g.,return, break) without incurring the overhead ofthrowing C++ exceptions.f Although the C++ language contains a number ofbuilt-in mechanisms for determining which type anobject belongs to within a class inheritance hierarchy,CXXR nevertheless also retains the SEXPTYPE field ofCR. This is for backwards compatibility with existingcode.gIt perhaps needs to be emphasized that the packagewe go on to describe is specifically a CXXR pack-age: though structured identically to a CR package asdescribed in ‘Writing R Extensions’,10 it also utilizescalls to CXXR’s C++ API, and is therefore unusablewith CR.

REFERENCES1. Runnalls AR. Aspects of CXXR internals. Comp Stat

2010, 26:427–42. ISSN 0943-4062.

2. Boost. Boost C++ libraries. Available at http://www.boost.org. (Accessed February 15, 2013).

3. Runnalls AR, Silles CA. Provenance tracking inR. From International Provenance and Anno-tation Workshop (IPAW) 2012, Santa BarbaraCA. Forthcoming in the conference proceed-ings, Springer LNCS 7525. Available in posterform at: http://www.cs.kent.ac.uk/projects/cxxr/pubs/IPAW2012_poster.pdf. (Accessed February 2, 2013).

4. Silles CA, Runnalls AR. Provenance-awareness in R.In: McGuinness, D, Michaelis J, Moreau L, eds.Provenance and Annotation of Data and Processes,Lecture Notes in Computer Science. Berlin/Heidelberg:Springer; 2010. ISBN 978-3-642-17818-4.

5. Eddelbuettel D, Francois R. Seamless R and C++ inte-gration. Available as package Rcpp from CRAN: http://cran.r-project.org. (Accessed February 15, 2013).

6. Runnalls AR. CXXR and add-on packages. FromuseR! 2010 conference. Available at: http://user2010.org/slides/Runnalls.pdf. (Accessed February 15, 2013).

7. Bates D, Maechler M. Matrix: Sparse and dense matrixclasses and methods (This package is included in thestandard distributions of CR and CXXR). Available aspackage Matrix from CRAN: http://cran.r-project.org.(Accessed February 15, 2013).

8. GMP. The GNU multiple precision arithmetic library.Available at: http://gmplib.org. (Accessed February 15,2013).

9. Lucas A, Scholz I, Boehme R, Jasson S, Maechler M.RBigIntegers. Available as package gmp from CRAN:http://cran.r-project.org, and from http://mulcyber.toulouse.inra.fr/projects/gmp. (Accessed February 15,2013).

10. The R Core Team Writing R extensions. Available at:www.r-project.org. (Accessed February 15, 2013).

11. Adler D, Glaser C, Nenadic O, Oehlschlagel J, Zuc-chini W. ff: memory-efficient storage of large data ondisk and fast access functions. Available as package fffrom CRAN: http://cran.r-project.org. (Accessed Febru-ary 15, 2013).

12. Ramey R. Boost serialization. Available at: http://www.boost.org. (Accessed February 15, 2013).

Volume 5, May/June 2013 © 2013 Wiley Per iodica ls, Inc. 189