Migrating a CISC Computer Family onto RISC via ... - HP Labs · Migrating a CISC Computer Family onto RISC via Object Code Translation Kristy Andrews Duane Sand Technical Report 92.1

1"TANDEM

Migrating a CISC ComputerFamily onto RISC via ObjectCode Translation

Kristy AndrewsDuane Sand

Technical Report 92.1November 1992Part Number: 94885

Migrating a CISC Computer Familyonto RISC via Object Code Translation

Kristy Andrews and Duane SandTandem Computers Incorporated

19333 Vallco ParkwayCupertino,califonria 95014

Tandem Technical Report 92.1

Abstract

A minicomputer/mainframe family (Tandem NonStopSystems) and all of its machine-dependent vendor and usersoftware has been moved from a proprietary CISC instructionset onto a generic RISC instruction set Translation ofprograms' CISC objectcode intooptimized RISC objectcodeis a migration path that is easy, gives greatly improvedperformance,andprovidesall thebenefitsoftraditionalobjectcodecompatibility. Thesebenefitsincludenoreprogramming,no retraining, fast time to market, anddebugging ofoptimizedprograms as if they were still running on the CISC platform.This paper shares our experience in implementing thismigration scheme, with measurements of the resultingperformance and code size.

IntrOduction

"Object code translation" is the automated conversion ofprograms from one instruction set to a significantly differentinstruction set. A special compiler translates binary machinecode (object code) rather than high level source code. Itgenerates an equivalent program that is optimized for thetarget machine. Object code translation is more efficient andless labor-intensive than traditional methods of migratingmachine-dependent software onto different machines. Thismigration technique played a crucial role in the smooth andrapid evolution of the lNS (Tandem NonStop Series) CISCbased computer family into the lNS/R (Tandem NonStopSeries/RISC) computer family based on MIPS RISCarchitecture. The new RISC machines run all the software oftheCISC machineswithoutrecompilation, includingprogramsdepending on uniquemachinefeatures. Objectcodetranslationdelivers much higher performance than traditional emulationtechniques. The fidelity and performance of this emulationtechnique means that applications need no reprogramming

This paper was presented at the Fifth InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS-V), held inBoston, Massachusetts on October 12-15, 1992.Copyright 1992. Association for Computing Machinery, Inc.Reprinted by permission.

1

and programmers need little additional training, even fordebugging.

Besides beingeasy for users, this migration techniqueenabledus to bring the RISC machines to market years earlier thanotherwise possible. All existing lNS system software andthird-party software packages were immediately available onlNS/R machines, with good performance. Our proprietaryoperating system (Guardian 90) contained some machinedependentlNS coding. Recodingofthesepieces inaportablelanguage dialect had to follow other major ongoingdevelopments. To expedite the introduction and exploitationofRISC technology,Tandem built thefrrstRISC releaseoftheoperating system through translation of CISC object code.(Future releases will bebuilt from the revised, portable sourcecode, using optimizing "RISC native mode" compilers.)

For this migration to be successful, our objectcode translator(the Accelerator) needed to be easy to use, reliable, and fullycompatible with our CISC machines, and yet generate veryefficientRISCcode. Theability todebugacceleratedprogramsinafamiliarenvironmentwasasignificantbenefit. Thispaperdescribes theresultingdesignandcompares theresultingcodespeed and size to thatofour CISC machines and to emulationvia a conventional interpreter.

Related Work

Object code translation is a decades-old migration andemulation technique. The advent of RISC machines andoptimizing compilers have led to many new applications ofthe technique.

Tandem's use of object code translation is similar to that ofHewlett Packard. HP used object code translation as one ofthree software migration paths from their HP3000 stackorientedCISCcomputerfamily to theirPrecisionArchitectureRISC computers [HP3000]:

• CISC emulation via interpretation.

• CISC emulation via object code translation andinterpretation.

• RISC native mode via recompilation from source code.

HP's fIrst MPE XL operating system release included majorsoftware restructurings tied in with their native mode. Thisdelayed their fIrst release long after their RISC hardware wasready.

DEC similarly uses object code ("binary") translation to portVAX VMS programs and MIPS ULTRIX programs ontoALPHA RISC machines [VEST]. The assembler codedportionsofVMS itselfwereportedby translatingfrom modifIedVAX macro source fIles [Alpha].

Apple used a translator from M68000 to Am29000 assemblertoport the Macintosh QuickDraw routines ontoa coprocessorcard (the 8-24 GC card). Apple plans to use M68000 objectand assembler code translators toportMacintosh applicationsand system software onto the PowerPC RISC computers.These translators will supplement interpreted M68000 codeand recompiled native code [Apple]. Apple also has atranslator for moving M68000 code onto M881l0 CPUs[BYlE].

Hunter Systems also uses object code translation ("binaryporting") to port MS-DOS applications onto UNIX machines[Hunter]. This translator has multiple back ends for differenttarget machines (M68030 and various RISC CPUs). Itsprogramanalyzerisonly semi-automated; itrequiresextensivemanual hints about each specifIc 8086 program to handlecommonproblemswithself-modifyingcode,computedjumps,and indirect calls.

Besides porting to new architectures, object code translationcan also be used to simulate future machines on existing onesefficiently. MIPS used translation from R2000 objectcode toVAX code to test MIPS compilers on substantial programsbefore the fIrst R2000 chips were running [Moxie].

Thetranslatorsdescribedaboveoperatelikecompilers,beforeprogram execution. An important variation is to translatechunks ofobject code during program execution and discardthe accumulated code when the program terminates. Thefollowing are recent uses of this "dynamic translation"emulation technique.

Insignia Solutions Limited's SoftPC directly executesunmodifIed MS-DOS application codefIles on various UNIXmachines. It interpretively executes sections of 8086 codesomenumberoftimes beforedeciding to translate thatsection.

IBM researchers prototyped a fast emulator for a subset ofSystem/370, hosted on a PC RT workstation. Dynamictranslation overcame a major problem with static analysis ofsimple branches in S/370 code [IBM].

None of these related works discussed the feasibility ofoffering a debugging environment for converted programsthat is similar to that offered for unconverted programs.

2

In addition to porting and simulation applications,objectcodetranslation has also been useful when the source and targetmachines are the same. MIPS Computer Systems' Pixie toolanalyzesandaugmentsMIPS code. Theinstrumentedprogramcollects its own address trace and path-counting informationwith minimal overhead [pixie]. The Puiify tool augmentsprograms to detect data initialization errors and heap usageerrors. It inserts code at each load, store, malloc, and free[Purify.] The Mahler compilers do post-linker optimizationsof register usage across an entire program [Mahler].

Tandem'5 CISC Migration Challenge

The lNS computer family is a range of fault-tolerant multicomputers with 2 to 16 independent CPUs per system and upto 224 CPUs per fIber optic cluster [1NS]. The Guardian 90operating system and NonStop SQL database software areespeciallyefficient for OLTP (online transaction processing).Processescooperate via messages rather than sharedmemory.This confmes failures, avoids shared memory bottlenecks,and enables throughput to scale up linearly as CPUs areaddedor replaced by faster CPUs. The system and softwarearchitecture are very effective, but the underlying proprietaryCISC instruction set gave opportunities for improvement.

It is now clear that pipeline-optimized RISC CPUs havesignifIcant performance and cost/performance advantagesover microprogrammed CISC CPUs, and that this advantagewill persist.

In general, all CISCs suffer from a reliance onmicroprogrammingand inefficientuseofinstructionexecutionpipelines. Beyond that, each CISC architecture has its ownparticular features which limit performance. The followingareaspectsoftheTNS CISCinstruction setthatwere improvedby switching to RISC.

The lNS instruction setwas derived from HP3000 in 1975. Itwas designed for simple compilers and for minimal memorycost for modest minicomputer applications of the midseventies. It is basically a 16-bitstack-oriented machine withextensions for 32-bit addressing. Instructions and registersare 16 bits, as are moSt addresses and data. 32-bit addressingrequires many more steps than I6-bit addressing. 256 wordsof global variables and 192 words of local stack frame aredirectly addressable by load or store instructions; everythingelse requires extra indirection or indexing steps that are oftenredundant. As our software evolved, it outgrew these limitsand increasingly had to use less efficient forms ofaddressing.

Most TNS instructions have implied stack operands andresults rather than explicitly designated registers. In contrastto most stack machines, the TNS expression evaluation stackis not an automatically buffered copy of the memory stack.Instead, it is a barrel of eight 16-bit registers that is alsoaccessed explicitly as index registers and sometimes as a

conventionalregisterfile. Thecurrent"topmost"stackregisteris identified by a special 3-bit register called RP (RegisterPointer). Statically predicting RP posed significantproblemsfor our object code translator.

As with pure stack machines, this hybrid register architectureprecludesmanykindsofcompileroptimizations. The registersare too few, too narrow, and require extra steps to access. Itis rarely profitable to employ register variables, to factor outsmall common subexpressions, or to move code out of loops,so the TNS compilers nevereven try. The rigidoperand orderprevents static rescheduling of instructions to rul predictablepipeline stalls.

These instruction-level inefficiencies are eliminated byttanslating TNS object code to RISC.

Translation Teamed With Interpretation

TNS/R RISC machines executeTNS CISC objectcode eitherby interpretation or by ttanslation.

The Accelerator is a post-compilation tool which augmentsTNS codefiles with equivalent optimized RISC code prior toexecution. It is invoked explicitly rather than as a side effectof RUN commands or process creation. It requires noinformation from the user.

BeforegeneratingefficientRISC code,objectcode ttanslatorsmustsomehowresolvevariouspuzzlesabout theunpredictabledynamic effects of the original CISC code. The particularpuzzles that arise in TNS code are described later. The threepopular methods for resolving ttanslation puzzles are:

• Get very detailed advice from an "authority", and fail ifthey were wrong.

• Run the program until the puzzle point is reached and theanswer is revealed, and then dynamically generate newcode before resuming.

• Make a best guess based on static analysis, and fall intointerpretive execution mode for a short time if and whenthis guess turns out to be incorrect.

The first method is clearly 100 unreliable for general use. Thesecondmethodisusedbydynamic ttanslators. OurAcceleratoruses the third method. The second and third methods, ifimplemented carefully, are totally reliable for all programs.

We chose static rather than dynamic ttanslation for thefollowing reasons:

• Thefirstoperating systemreleased woulditselfexecuteviattanslation.

• Our performance goals were high.

3

• The necessary translation algorithms require significanttime and memory.

• Tandem machines are primarily used for months-longexecution of a few applications.

The third method has several major consequences. It requiresthe presence of a CISC Interpreter for use as a backup insituations where the RISC code is based on incomplete flowinformation. The program mustretain all oftheoriginal CISCcode for potential use by the Interpreter. The potentialswitches to interpreter mode place strict requirements on thestructure of the program's data space and call/return stack: itcannotbe any different from the original CISC program. Thestack frames cannot be widened to include RISC returnaddresses or RISC register spill areas. Variables cannot berepositioned to be word-aligned. All expected CISC registervalues must be generated at potential switch points.

These requirements and restrictions limit the potential speedand sizeefficiency ofttanslated code, but they also have somebenefits. The presence ofunmodified CISC code segments inacceleratedruesenablesdistribution andexecutionofidenticalcodefiles on all TNS and TNS/R machines on a customer'snetwork. Also, the Accelerator need not try to extract andrelocate the read-only data tables intermixed in the CISCcode. Thefidelity ofmemory layoutgreatly simplifiessourceleveldebuggingandhelpsuserunderstanding. Themechanismfor faithfulemulationofCISCregisters makesCISC machinelevel debugging possible.

The Acceleratorallows theprogrammer to give someoptionaldetailed translation advice, or "hints," but this is usedonly totune a program by improving the guesses to avoid interpretermode fully; it is never needed for a program to executecorrectly. The Accelerator points out subroutines that maybenefit from hints. Users can also easily measure how muchtime an accelerated program is spending in interpreter mode.In our experience, mostacceleratedprograms spend less than1% of their time in interpreter mode, even without hints.

Of199Tandemprogramsreleased to customers inacceleratedform, only 7 used a hint command to override the guessedresult sizeofafunction. Theotherkindsofdetailed hints wereonly used by the system library.

Besidesaccidentalentry intointerpretermodeatpuzzlepoints,deliberate changes of execution mode are important at callsandreturns. Calls into the shared system libraryproceedatfullspeed even if the caller was unaccelerated. This call/returndesign allows the future possibility ofselectivelyacceleratingjust the most time-consuming subroutines of a program andthus minimizing code expansion.

Many TNS programs are I/O intensive oraccomplish mostoftheir work via calls on system-provided services. The speedoftheseprogramsdepends mostlyon the speedoftheoperatingsystem, and so they automatically benefit from accelerated or

native mode execution of the system code. Such applicationprograms would benefit little from acceleration or direct useofnative mode themselves; the CISC Interpreter is adequate.The time spent within the application's own driver code is sosmall that the overall execution time is unaffected even if thatcode were made infinitely fast.

Structure of Accelerated Code

lNS CISC code is composed of simple stack instructions,simple instructions with complex addressing modes, andoccasionalcomplex long-running instructions suchas MOVB(move bytes, with possible smear and right-to-Ieft option).Most instructions are simple.

The Accelerator translates unconditional sequences ofsimplelNS instructions into a functionally equivalent sequence ofRISe instructions. Thisisanalogous to inlinemacroexpansion.However, the RISe code need not follow the original orderand many optimizationsare possible. Themost importantoneis elimination of unneeded side effects that would requireextra RISe instructions. In particular, literal operands oftendisappear, and the condition code, carry out, and overflowflags are calculated and saved only when subsequent lNScode will require them. The Accelerator translates eachcomplex or long running lNS instruction into a fast call to a"millicode" routine hand-coded in RISe assembler.

The translated RISe code calculates the same answers as thelNS code, and does exactly the same sequence of stores intomemory. It also mimics the changes to lNS register state ifthose values are needed later.

TheTNS register stackofeight 16-bitregisters is held in eightdedicated RISC registers. These emulated register valuesmust behave like they are only 16 bits wide. For someoperations they are joined into 32-bit or 64-bit values. Thisleads to a variety ofpossible ways to representa TNS registervalue within its 32-bit RISe register. All of these possible"formats" are used at different points in the translated RISecode. Themostcommon formats arerightjustified, with signbit extension, zero fIll, or unknown fill. When doing 16-bitadd or subtract with overflow detection, the operands areshifted into left justified format. (The MIPS instruction setlacks a direct way to trap on 16-bit overflows.) When doing32-bit TNS operations or extended addressing, pairs ofemulated 16-bit registers are packed together into one RISeregister, with nothing in the paired RISC register.

Besidetheseregisters, another sevendedicatedRIse registershold special TNS register states such as the memory stackpointers,andtheconditioncode,carryout, andoverflow flags.Fourteen of the remaining RISe registers are Acceleratortemporaries. Thesehold indirectaddresses, shiftedaddresses,common subexpressions, and millicode arguments.Temporarily unused TNS registers can also be borrowed astemporaries.

4

Register-Exact Points

The extra registers of the RISe epu are key to manyoptimizations. However, there are moments when theseadditional resources must not be used. These are calledregister-exact points. At these points, the correspondencebetween CISC register state and RISC register state must beexact, unambiguous, and independent of how the programarrived there. Live lNS registers mustbe represented in rightjustified format in their own RISe register. The temporaryregisters mustbeempty. Pending stores to memory musthavebeen completed, and subsequent loads must not have begun.In short, optimizations cannot cross register-exact points.

The most important register-exact points are the call sites oflNS subroutinecalls. No hiddenRISe statecan bemaintainedacross the call, in registers or in the stack. Furthermore, thecalled subroutine may execute in interpreter mode. The callsite is therefore a re-entry point from interpreter mode. Otherregister-exact points include:

• lNS subroutine entry points, for calls through pointervariables or calls from interpreter mode.

• TNS subroutine exit points, to return to an untranslatedcaller, or to resume full-speed execution after a transitionto interpreter mode.

• Statements that are the likely targets of jumps throughpointer variables.

• Instructions which potentially fall into interpreter mode("puzzle points").

• Thebeginning ofevery statement, when translatedwith the"Statement Debug" option.

Subroutine Calls

lNS ClSC subroutine calls do a number of table lookups intables that never change during execution. The acceleratedreplacement of this is usually a direct jump from the caller tothe target subroutine's RISe code. The frame building stepsand"segment switching" steps ofthe call instruction are donein the subroutine's prologue.

Exits back to the caller are less direct. The CISC stack frameonly contains the 16-bit TNS return address, not the 32-bitaddressofthecorrespondingRISC code. TheEXITmillicoderoutine converts the lNS address into a RISC address bylooking it up in the target code segment's PMap (ProgramAddress Map) table. This lookup takes 11 R3000 cycles.

Program Address Map

Besides containing accelerated RISe code, an augmentedlNS codefile also contains one or more PMap tables. Thesetables map 16-bit TNS instruction addresses to thecorresponding 32-bitRISCinstruction addresses. Themapping

is sparse. It includes only register-exact points (used byindirectjumps)and memory-exactpoints(usedbythedebuggerto mark statementboundaries). ThePMap is compressed intoan array with onebyteperTNS instruction word,plusa shorterarray with one base address per 8 TNS words. This totals 12bits of table per mapped or unmapped TNS instruction.

ThePMapis used primarilyduringreturns from accelerated orinterpreted subroutines back to accelerated callers. It alsosupports other dynamic jumps using TNS program counters.

Ifthe target address is not found or is not"register-exact", thenthe target has no translated RISC code, or its RISC code hasbeen optimized together with surrounding statements andcannot be arbitrarily entered from outside. Thwarted jumpsare completed by switching over to interpreter mode for ashort time.

Because the PMap is monotonic, it also provides the inversemapping from RISC instruction addresses back to CISCinstruction addresses. The debugger uses this to implement aCISC view of the running program. This lookup is a slowerbinary search, but lookup time is not critical for the debugger.

Entry To and Recovery From Interpreter Mode

Accelerated programs enter interpreter execution mode at thefollowing puzzle points:

• A call or return to untranslated subroutines.

• A jump indirectly to some unexpected place.

• An Accelerator guess about a function's result size turnsout to be wrong.

• Twoor more control flow paths predictdifferent values ofRP at a point of convergence.

The unplannedstumbles donotoccuroften. Butwhen they dooccur, the accelerated program does not stay in interpretermode very long. TheInterpreterautomatically returns controlback to Accelemtor genemted RISC code at the next call orreturn instruction that fmds aregister-exactpoint in the PMap.

Interpretive execution could persist for a long time if theInterpreterencountersa longrunning loop that makes no calls,but such loops are uncommon on TNS machines.

Compatibility Tradeoffs

The TNS/R Interpreter and Accelemtor are very compatiblewith TNS machines and TNS programs. One very minordifference was unavoidable. In both execution modes, a traphandlercan no longeralter the CISC register state in arbitmryways before resuming at the point of the trap. Few traphandlers attempt to do this.

5

For sophisticated users, the Accelerator has options whichgenemte faster, smaller code when exactemulation ofcertainTNS features is not needed. They are primarily used in theoperating systemkerneland file system. Thesearecollectivelyknown as the "Fast" option:

• Omit trapping on 16-bitoverflows. The extra code for thisis pointless within programs which ignore overflow trapsanyhow.

• Omittruncationofindexed 16-bitaddresses. Thisoverheadis needed only when arrays with large lower bounds arelocated near address zero.

• Assume pointers are unaffected by indirect or indexedsingle-byte stores coded inline. This allows more reuse ofcommon subexpressions. The assumption is unsafe insome C programs.

The Accelerator optionally implements all TNS instructionslike "Add to Memory" in an atomic, non-interruptible way.The default is to do this only on specially marked occurrencesof those instructions.

The Accelerator, An Optimizing Compiler

Except for its unusual front end, the Accelerator is much likeother optimizing compilers for RISC machines.

The Accelemtor is very successful at improving on machinecodeoriginallygenemtedfora 16-bitstack machine. Itsmajoroptimization effects include:

• Determining absolute register numbers in the TNS code.

• Avoiding calculation of unneeded hardware side effects.

• Undoing the splitting of 32-bit values across pairs ofregisters.

• Eliminatingredundantdatafetchesandaddresscalculations(the most frequent forms of common subexpressions).

• Omitting unneeded 16-bit truncation effects.

• Replacing address arithmetic by RISC addressing modes.

• Scheduling independent steps into delay slots ofdata load,indirect address load, and branch instructions.

TNS emulation requirescompletecontroloverregisters, stackframe, and allowed optimizations, so the Accelerator has itsown unique code generator and optimizer rather than usingthose components of the MIPS Ucode compiler system.

TNS Code Analysis

The Accelemtorbegins by disassembling the CISC program'sbinary code and working out all of the branch paths that it canpredict. This is similar to object code analysis tools such asPixie [postCompiler, Postload]. For some machines such as

IBMSf370, this stepissurprisinglyhard,becausetheirsimplestbranch instructions depend on base registers that aremanipulatedelsewhere in the program. ForTNS code,branchflow analysis is fairly easy, except for CASE jumps and forjumps through hand-built tables containing unmarked codeaddresses.

The sizeofCASE tables is determined by doing a"depth fust"search of paths reached via the first words of the table. Thissearch marks the instructions or data words following thetable, before they becomeconsidered as possible table words.

Subroutines containing unanalyzable jumps through pointervariables or calculated addresses are handled by markingevery explicitly-labelled statement as a potential target ofthose jumps. In codefIles lacking debugger informationaboutstatement labels, jumps through hand-built tables will causethe subroutine to fall into interpreter mode.

Exactcontrol flow analysis is important for safeand effectiveoptimizations. For every possible path, the Accelerator mustfollow one of the following strategies:

• Analyze the exact data flow on that path.

• Assume worst case data flow and suppress nonlocaloptimizations.

• Switch to interpreter mode if and when that path is taken.

RP Analysis

Thenextphaseinthe Acceleratorispreliminary to determiningwhich registers are used and set by each CISC instruction.This determination is trivial for most machines, but issurprisingly hard for TNS machines. Some references to theregister stack use absolute register numbers, but most aredynamicallyrelative to thecurrentvalueofRPwhichidentifiesthe currently-topmost register. TNS compilers know thevalue of RP at every instruction but this information is notsaved in the codefIle. The Accelerator must recover this lostRP information, sometimes by global analysis across theentire program.

Most TNS instructions have a simple obvious effect on RP.Subroutine calls do not. The final value ofRP depends on thesize of the returned function result When possible, theAccelerator retrieves result size information from summariesin theobjectcodefIle or from standard library descriptions. Inother cases, the Accelerator determines the result size byrecursively analyzing all RP changes within the calledsubroutine, and within the subroutines called from there, andso on. This analysis does terminate; every practical functionhas at least one path to a nonrecursive return statement.

When subroutines are indirectly called via dynamic pointers,the identity of the target is unknowable and the result size isuncertain. TNS compilers now follow such indirectcalls with

6

an additional instruction specifying the expected RP. Forolder code fIles that do not have this clue.. the Acceleratorguesses the result size, based on patterns in the subsequentstack code. The Accelerator also guesses the result size fordirect calls on unknown subroutines that are external to acodefIle. This guess works well in most cases~ but it is fallible.

IfsubsequentTNS code depends on the returned RP value butthe guess is wrong, the correspondingoptimizedRISC code isincorrect and must not be executed. The translated call sitetherefore includes a run-time check which conftrms that thereturned RP has the expected value. If this check fails, theprogram switches over to interpreter mode for a short time.Programmerscan overridepoorguessesbysupplyingoptional"ReturnValSize" hints at translation time.

Other ambiguities in RP analysis include

• Exception-handling subroutines which sometimes exitthrough ancestor stack frames, with different result size.

• Other TNS instructions with statically uncertain effect onRP.

• Real orimaginedpaths thatjoinwithconflictingpredictionsforRP.

Identifying Unneeded Effects

With RP analysis completed, it is clear which registers areusedorchangedbyeachTNS stack instruction. TheAcceleratorthen applies standard iterative data flow analysis methods todetermine whether the values in the eight TNS stack registersare "live" or "dead" at every point. The three instruction sideeffect indicators ENV.CC (condition code), ENV.K (carryout), and ENV.Y (overflow) are also tracked as if they wereseparate registers.

Data flow analysis covers only these eleven "variables"; itdoes not cover any memory resident variables nor does itcover any host machine registers.

Independent (non-nested) subroutines can be analyzedseparately, since theseTNS registers arenormallydeadacrosscalls. Afew exotic librarysubroutineshaveregisterargumentsand require hints in the standard "translation hints" fIle.

RISC Code Selection

ThenextphaseoftheAcceleratorexpandseachTNS instructioninto preliminary sequences of RISC instructions. The live!deadinformationaboutTNS registersdrives thechoicebetweenalternative opcode expansions, such as including or omittingoverflow detection.

As the Accelerator translates lNS instructions of a basicblock, it keeps track ofthe latest format (left or right justified,etc.)ofemulatedelseregisters. Itoptimizes load instructionsto match the format preferred by the operation subsequentlyusing that value. The format ofelse registers is forced to beconsistent at register-exact points and when paths join.

RISC Code Optimization

This phase improves the program by optimizing RiSeinstructions within and across basic blocks, one subroutine ata time.

This phase applies standard global optimization techniques.These include live/dead analysis of RiSe registers, valuenumbering, constant folding, copy propagation, commonsubexpression elimination, movement of address-offsetarithmetic into load/storeops,anddead codeelimination. Theoptimizer also includes peephole optimizations forindex scaling, for register format conversions, and forconversions from lNS 16-bit word addresses to 32-bit byteaddresses.

An importantdifference from standard optimizing techniquesis our emulation requirement that RiSe registers never bespilledintomemorybecausethereisno safeplaceforadditionalregister-save areas. Values are assigned to specific registersrelatively early, and subsequent optimizations never exceedthe available pool ofregisters. Another required difference isthatoptimizationsnevercrossregister-exactpoints. Tosupportdebugging, stores are never moved, eliminated, or replicated.This precludes loop-specific optimizations such as loopunrolling.

After global optimization, the fmal phase of the Acceleratorreorders RiSe instructions within each basic block to filldelay slots, eliminateNOPs, and reduce pipeline stalls. Italsoreplaces multiplicationsofaconstantwith afaster sequenceofshifts and adds.

Debugging

Debugging in a familiar environment makes moving to a newinstruction setmucheasier. Ourdebuggersupportsdebuggingof accelerated programs at the source level, RiSe machinelevel, and else machine level, much as if the program werestill running on a microcoded lNS machine. The debuggerlooks very much like the lNS debugger. As with priorTandem compilers, reliable source level debugging is alwayspossible, even on the full-speed production version of anapplication. Cripplingofcodeoptimizationprior todebuggingis not required.

Theunderlyingmechanicsofdebugging acceleratedprogramsare based on the notions of register-exact points, memoryexact points, and allowed optimizations.

7

"Memory-exact points" are statement boundaries at whichelse code and optimized RISe code correspond closelyenough that most source-level debugger operations can beused reliably. Stepping and breakpointing of execution flowis precise and unconfusing at these points. The memorychanges and overflow traps of prior statements have allcompleted, and the memory changes and overflow traps ofsubsequent statements have not yet begun. The debuggerreliably shows the expected value of variables in memory.The Accelerator's optimizations are designed such that moststatementshavetheirownmemory-exactpointsand hence areindividuallydebuggable. Theallowedoptimizationsall retainthe original sequencing of stores and ofcalculations that maytrap.

However, RISe pipeline scheduling may move a storeinstructionintothedelayslotofasubsequentbranch instruction.To the debugger, this means thatan assignment statementandfollowing branching statement have been intermixed andwelded together into a single unit; they completesimultaneously rather than sequentially. The statement pairhas a single memory-exact point, representing the completionof both.

The memory sequencing rules do not apply to assignments toelse register variables. The RiSe code for such statementsmay either disappear entirely or migrate to anywhere in thecurrent basic block. This causes additional cases of weldingtogether of statements.

Themain limitationofmemory-exactpointsis thatthedebuggercannot reliably modify variables. That is, a manual change tomemory values or clse registers will not necessarily haveany effect on subsequent statements using those variables.The operand fetch steps of those statements may have beenoptimizedaway or scheduled to much earlier in the statementsequence. A secondary limitation ofmemory-exact points isthatthedebuggercannotreliablydisplay CISCregistervalues.

These limitations and the welding effectcan be circumventedby re-accelerating with the Statement Debug option, whichturns most statement boundaries into register-exact points.The Accelerator then translateseach statementindependently,and only optimizes across statement boundaries to eliminateunneeded side effects. The programmer can reliably inspectand modify the program's machine register state in purelylNS CISC terms. Learning the RISe instruction set is nevernecessary!

Another possibility when debugging is to execute the entireaccelerated program in interpreter mode, ignoring its RISCtranslation.

Speed Measurements

The main pointsofinterestabout translating CISCobjectcodeto RISC are thespeedand sizeofthe resulting programs. Howfast is the RISC code, compared to the speed on CISChardware? How efficient is it, in terms of machine cycles?How big is it, compared to the CISC version?

The programs measured here are:

• Transaction Application Language (TAL) compiler forTNS machines. TAL is a low-level systems languagesimilar to C, Bliss, and HP SPL.

• TAL-codedDhrystone, in versions using 16-bitaddressingor 32-bit addressing. It combines features ofC and PascalDhrystone benchmarks in ways typical of our software.

• Axcel (the Accelerator).

• ETl, an ATM debit/credit benchmark characterizing theOL1P market ETl mostly measures work occurringwithin the OS kernel, file system, SQL data base, andtransaction monitor, which are the essence of most TNSapplications. About 10% of the code executed pertransaction varies, depending on processor or disk speed;that part is omitted here to get identical paths.

The machines measured here include three microcoded CISChardware implementations of the TNS stack architecture:

• NonStop CLX 800 system: introduced 1991, 16.5 MHzcache cycle rate, peak execution rate two cycles perinstruction, CMOS, one board per self-checking processorincluding cache, channel, and memory.

• NonStop VLX System: introduced 1986, 12 MHz cyclerate, peak execution rate one instruction per cycle, ECLwith TTL wiring, 4 boards per processor.

• NonStop Cyclone System: introduced 1989, 22.3 MHzcyclerate,superscalar,peak rate two instructionspercycle,ECL,4 boards per processor.

These are compared against software execution modes on aTNS/R RISC system using the MIPS R3000 chip:

• NonStop Cyclone/R System: introduced 1991,25 MHzcycle rate, CMOS, one board per self-checking processor.

The RISC machine's software execution modes include:

• Interpreted TNS code.

• Accelerated TNS code, with 3 degrees of optimization:Statement Debug, Default, and Fastest

Thefollowing gives the codeexecution speedofeachmachineor mode, relative to the CLX 800. (Bigger is better.) This isbased on the amount of "CPU busy" time needed to executesome program or path. It includes cache misses, lLB misses,and I/Ocyclestealing,butnotI/O wait timeorqueuingeffects.

ProgramDhrystone

Machine 16-bit 32-bit TAL Axcel ETlCLX800 1.00 1.00 1.00 1.00 1.00VLX 1.24 1.21 1.22 1.22 1.16Cyclone 4.01 4.39 3.66 3.99 3.73Cyclone/R

Interpreted 0.50 0.52 0.55 0.48 n/aA-StrnUDebug 2.61 3.10 2.41 3.41 n/aA-Default 2.65 3.14 2.52 4.05 n/aA-Fast opts 2.89 3.41 2.71 4.32 2.10

[Note: ThisETl code execution speed metric differs from theusual ETl metric of throughput in transactions per secondused by Tandem. Throughput is affected by peripherals,queuing delays, and reserved CPU capacity to meet required

5r-'

41--

3r-

Relative Code Execution Speed

2r-

1 -

~:::~:,

<. :;:,,.'.

'.. ~.,.

~. :.

:<

"~ ~.

'. .., :'. , "

:~ < ~. , ;:.'.. :~ ~ ::-

" .'.. ::: :<

~

CLX800CISC

VLXCISC

CycloneSuperscalar

clse

CyclonelAInterpreted

8

CyclonelAAcceleratedStmtDebug

Cyclone/RAccelerated

Default

Cyclone/RAccelerated

Fast opts

response times. In balanced systemsadoublingofCPU speedgives more than a doubling of OTLP throughput.]

Accelerated lNS coderuns 5 to 8 times faster than interpretedcode. The time spent in interpretive interludes is 1%or less.The Statement Debug option slows down code by 1 to 16%.

Using the Accelerator, Cyclone/R performs 2 to 4 times fasterthan its contemporary CISC of similar size (CLX 8(0). InCPUintensiveprograms, itgives70-100% oftheperformanceof Cyclone, with just one fourth the components.

The speeds given above are the speeds seen by real users ofthese machines, but much of the differences are due todifferent clock rates and circuit technology. The designers ofa new machine would instead want to know how cycleefficient the alternatives would be. Ifhypothetical new CISCandRISC engineshad identical cachecycle times, how wouldtheycompare in termsofcyclesexecuted? Toexplore this, thefollowing table rescales the previous table. It gives the workaccomplished per cycle by each machine or mode, relative toCLX 800. (Bigger is better.)

ProgramDhrystone

Machine 16-bit 32-bit TAL Axcel ETlCLX800 1.00 1.00 1.00 1.00 1.00VLX 1.71 1.67 1.68 1.68 1.60Cyclone 2.96 3.24 2.71 2.96 2.70Cyclone/R

Interpreted 0.33 0.34 0.37 0.32 nlaA-StmtDebug 1.72 2.04 1.59 2.25 nlaA-Default 1.75 2.07 1.66 2.67 nlaA-Fast opts 1.91 2.25 1.79 2.85 1.39

AcceleratedCISCcodeon thisgenericRISC engineisroughlyas cycle efficient as the CISC engine organized for peak ratesof one cycle/instruction. The superscalar CISC machine isbetter, but superscalar or superpipelined RISC machineswould be more practical at higher clock rates.

Size Measurements

The biggest potential concern with switching to RISC code isthe expansion in code size. The following table gives thenumber ofRISC instructions generated inline for each CISCinstruction, for each Accelerator option. (Lower is better.)

ProgramDhrystone

Accel Option 16-bit 32-bit TAL Axcel ETlA- StmtDebug 2.46 2.02 1.72 1.88 nlaA-Default 2.28 1.84 1.62 1.78 nlaA-Fast opts 1.93 1.60 1.48 1.72 1.30

The Statement Debug option expands code by 6 to 15%.

MIPS RISC instructions are twice the size of lNS CISCinstructions. Accelerated programsalso frequently access thePMapwhose size is 75% as big as theoriginal CISC code. Therunning program's increased need for RAM or cache storagefor "code" is therefore (2i +0.75) times theoriginal code size,where i is the instruction count expansion. The followingtable gives this dynamic size expansion:

ProgramDhrystone

Accel Option 16-bit 32-bit TAL Axcel ETlA- StmtDebug 5.7 4.8 4.2 4.5 nlaA-Default 5.3 4.4 4.0 4.3 nlaA-Fast opts 4.6 4.0 3.7 4.2 3.3

Relative Cycle Efficiency

"

3.5 -3.0 10-

2.5 10-

2.0 10-

1.5 ---1.0 I-

0.5 I-

0.0

CLX 800clse

,~

., ~'.:

.' ~ .,.' ..

:';

~*.. ..: ,

~;.',

~mlm,:::: .: :-;. :::

'" .~: ~~. .: >'" ,i!

VLX Cyclone CycionelRCISC Superscalar Interpreted

CISC

9

" ::.

CyclonelRAcceleratedStmtDebug

:.:

f.. :. x

>: :>

CyclonelRAccelerated

Default

CyclonelRAcceleratedFast opts

Data memory usage is unchanged by acceleration, and mostlNS applications use much more data memory than code. Aprogram's overall memory increase is therefore less thanindicated above. To accommodate this expected amount ofcodeand tableexpansion, the cacheson Cyclone/Rwere madevery large: 256 Kbytes each of code and data. This is 4 timesmore thanCycloneandmostmulti-userR3000UNIX systems.

The accelerated codefile holds a complete image of theoriginal CISC code segments. Thestatic sizeexpansionofthecode portion of accelerated codefiles is +1.0 greater than thenumbers above. Ifan acceleratedprogram uses the Interpreterin many places or uses lots of scattered clumps of inline datatables, then this +1.0 effect may also show up in RAM usage.

Conclusions

Object code translation is a very effective tool for bridgingcomputerarchitectures. Thecombination ofoptimizedobjectcode translation before execution and occasional run-timeinterpretation gives excellent performance and excellentreliability, with acceptable code expansion. Translatoroptimizations can preserve full debugging.

The Accelerator was fundamental to the success ofTandem'sevolution from CISC to RISC processor architecture.

Translation will be widely used in the computer industry tosolve migration or porting problems.

Acknowledgments

Specialthanks toHankMaurer,whodesignedandimplementedmajor parts of the Accelerator.

References

[Alpha] Rich Witek, Alpha panel session ofCOMPCON 92,February 1992.

[Apple] Ron Hochsprung, PowerPC panel session ofCOMPCON 92, February 1992.

[BYlE] Kenneth Sheldon, Owen Linderholm, and TrevorMarshall, "The Future of Personal Computing?", BYTE,February 1992.

[HP3000] ArndtBergh,Keith Keilman, Daniel Magenheimer,and James Miller, "HP3000 Emulation on HP PrecisionArchitectureComputers,"Hewlett-PackardJournal,December1987.

[Hunter] Colin Hunter and John Banning, "DOS at RISC,"BYlE, November 1989.

[IBM] Cathy May, "MIMIC: A Fast System/370 Simulator,"SIGPLAN '87 Symposium on Interpreters and InterpretiveTechniques, June 1987.

10

[Mahler] David Wall and Michael Powell, "The MahlerExperience: Using an intermediate language as the machinedescription," Second International Symposium onArchitectural Support for Programming Languages andOperating Systems, October 1987.

[Moxie] FredChow,Mark Himelstein,Earl Killian, andLarryWeber, "Engineering a RISC Compiler System," Proceedingsof IEEE Compeon March 1986.

[pixie] MIPS Computer Systems, "RISCompiler and CProgrammer's Guide," 1989.

[Purify] Reed HastingsandBobJoyce, "Purify: Fastdetectionof memory leaks and access errors," Proceedings of theWinter 1992 USENIX Conference, January 1992.

[PostCompiler] David Wall, "Post-Compiler CodeTransformation," tutorial at SIGPLAN '92 Conference onProgramming Language Design and Implementation, June1992.

[postload] S. C. Johnson, "Postloading for Fun and Profit,"1990 Winter USENIX Conference.

[lNS] Daniel Siewiorek and Robert Swarz editors, "ReliableComputer Systems: Design and Evaluation", Second edition,chapter 8, Digital Press, 1992.

[VEST] Digital Equipment Corporation, "VEST User'sGuide," 1992.

flW

our1LJ

onUf;

[

r~I IU

or'I !, IU

I ': I

i !...J

[1I il...i

r'. II I...,.;

r"I IU

nI I-o(--;

': 1: I

~

(II ,

U

Documents

Migrating a CISC Computer Family onto RISC via ... - HP Labs · Migrating a CISC Computer Family onto RISC via Object Code Translation Kristy Andrews Duane Sand Technical Report 92.1