Index [booksite.elsevier.com] · Index Note: Online information is listed by chapter and section number followed by page numbers (OL3.11-7).Page references preceded by a single letter

Index

Note: Online information is listed by chapter and section number followed by page numbers (OL3.11-7). Page references preceded by a single letter with hyphen refer to appendices.

1-bit ALU, A-614–617. See also Arithmetic logic unit (ALU)

adder, A-615CarryOut, A-616for most signifi cant bit, A-621illustrated, A-617logical unit for AND/OR, A-615performing AND, OR, and addition,

A-619, A-62164-bit ALU, A-617–626. See also

Arithmetic logic unit (ALU)defi ning in Verilog, A-623–626from 31 copies of 1-bit ALU, A-622illustrated, A-624ripple carry adder, A-617tailoring to MIPS, A-619–623with 32 1-bit ALUs, A-618

7090/7094 hardware, OL3.12-7

A

AArch32, 73AArch64, 73Absolute references, 131Abstractions

hardware/soft ware interface, 22principle, 22to simplify design, 11

Accumulator architectures, OL2.22-2Acronyms, 9Active matrix, 18ADD (add), 64ADDI (add immediate), 64ADDIS (add immediate and set fl ags), 64Addition, 188–191. See also Arithmetic

binary, 188–189fl oating-point, 212–215, 220operands, 189signifi cands, 211speed, 191

Address interleaving, 395Address select logic, C-24, C-25Address space, 442, 445

extending, 493fl at, 493ID (ASID), 460inadequate, OL5.17-6shared, 533–534single physical, 533–534virtual, 460

Address translationfor ARM cortex-A8, 483defi ned, 443fast, 452–454for Intel core i7, 483TLB for, 452–454

Address-control lines, C-26Addresses

base, 69byte, 70defi ned, 69memory, 79virtual, 442, 462, 463

Addressingbase, 120in branches, 117–120displacement, 120immediate, 120LEGv8 modes, 120–121PC-relative, 118, 120register, 120x86 modes, 158

Addressing modesdesktop architectures, D-6

ADDS (add and set fl ags), 64, 164addu (Add Unsigned), 64Advanced Encryption Standard (AES)

encryption, 488Advanced Vector Extensions (AVX), 232,

240AGP, B-9Algol-60, OL2.22-7Aliasing, 458, 459Alignment restriction, 71All-pairs N-body algorithm, B-65Alpha architecture

bit count instructions, D-29fl oating-point instructions, D-28instructions, D-27–29no divide, D-28PAL code, D-28unaligned load-store, D-28VAX fl oating-point formats, D-29

ALU control, 271–273. See also Arithmetic logic unit (ALU)

bits, 272logic, C-6mapping to gates, C-4–7truth tables, C-5

ALU control block, 275defi ned, C-4generating ALU control bits, C-6

ALUOp, 272, C-6bits, 272, 273control signal, 275

Amazon Web Services (AWS), 439AMD Opteron X4 (Barcelona), 559, 560AMD64, 155, 156, 232, OL2.22-6Amdahl’s law, 415, 519

corollary, 49defi ned, 49fallacy, 572

and (AND), 64AND gates, A-600, C-7AND operation, 91AND operation, A-594andi (And Immediate), 65Annual failure rate (AFR), 432

versus MTTF of disks, 433–434Antidependence, 348Antifuse, A-666Apple computer, OL1.12-7Apple iPad 2 A1395, 20

logic board of, 20processor integrated circuit of, 21

Application binary interface (ABI), 22Application programming interfaces (APIs)

defi ned, B-4graphics, B-14

I-1

I-2 Index

Architectural registers, 358Arithmetic, 186–248

addition, 188–191addition and subtraction, 188–191division, 197–204fallacies and pitfalls, 242–245fl oating-point, 205–230historical perspective, 248multiplication, 191–197parallelism and, 230–232Streaming SIMD Extensions and

advanced vector extensions in x86, 232–233

subtraction, 188–191subword parallelism, 230–232subword parallelism and matrix

multiply, 238–242Arithmetic instructions. See also

Instructionsdesktop RISC, D-11embedded RISC, D-14logical, 263operands, 67–74

Arithmetic intensity, 557Arithmetic logic unit (ALU). See also

ALU control; Control units1-bit, A-614–61764-bit, A-617–626before forwarding, 321branch datapath, 266hardware, 190memory-reference instruction use, 257for register values, 264R-format operations, 265signed-immediate input, 323

ARM Cortex-A53, 256, 355–358address translation for, 483caches in, 484data cache miss rates for, 485memory hierarchies of, 482performance of, 485–488specifi cation, 356TLB hardware for, 483

ARM instructions, 152–15412-bit immediate fi eld, 153brief history, OL2.22-5condition fi eld, 334unique, D-36–37

ARMv7, 62ARMv8, 62, 163–169

common features between MIPS and, 152

ARPANET, OL1.12-10

Arrays, 429logic elements, A-606–607multiple dimension, 226pointers versus, 146–150procedures for setting to zero, 147

ASCIIbinary numbers versus, 111character representation, 110defi ned, 110symbols, 113

Assemblers, 129–131defi ned, 14function, 129microcode, C-30number acceptance, 130object fi le, 130

Assembly language, 15defi ned, 14, 129fl oating-point, 221illustrated, 15LEGv8, 64, 86programs, 129translating into machine language,

86Asserted signals, 262, A-592Associativity

in caches, 419degree, increasing, 418, 466increasing, 423set, tag size versus, 423

Atomic compare and swap, 127Atomic exchange, 126Atomic fetch-and-increment, 127Atomic memory operation, B-21Attribute interpolation, B-43–44Automobiles, computer application in, 4Average memory access time (AMAT),

416calculating, 416

B

Bandwidth, 30bisection, 551external to DRAM, 412memory, 394–395, 412network, 549

Barrier synchronization, B-18defi ned, B-20for thread communication, B-34

Base addressing, 69, 120Base registers, 70Basic block, 96

Benchmarks, 554–556defi ned, 46Linpack, 554, OL3.12-4multicores, 538–545multiprocessor, 554–556NAS parallel, 556parallel, 555PARSEC suite, 556SPEC CPU, 46–48SPEC power, 48–49SPECrate, 554Stream, 564

Biased notation, 82, 209Big-endian byte order, 70Binary numbers, 83

ASCII versus, 111conversion to decimal numbers, 78defi ned, 75

Bisection bandwidth, 551Bit maps

defi ned, 18, 73goal, 18storing, 18

Bit-Interleaved Parity (RAID 3), OL5.11-5

BitsALUOp, 272, 273defi ned, 14dirty, 452guard, 228patterns, 228–229reference, 450rounding, 228sign, 77state, C-8sticky, 228valid, 397

Blocking assignment, A-612Blocking factor, 428Block-Interleaved Parity (RAID 4),

OL5.11-5–5.11-6Blocks

combinational, A-592defi ned, 390fi nding, 467fl exible placement, 416–418least recently used (LRU), 423locating in cache, 421–422miss rate and, 405multiword, mapping addresses to, 404placement locations, 466placement strategies, 418replacement selection, 423

Index I-3

replacement strategies, 468spatial locality exploitation, 405state, A-592valid data, 400

Bonding, 28Boolean algebra, A-594Bounds check shortcut, 98Branch address, 168Branch datapath

ALU, 266operations, 266

Branch delay slotsBranch instructions

pipeline impact, 329Branch not taken

assumption, 328–329defi ned, 266

Branch predictionas control hazard solution, 295buff ers, 331, 333defi ned, 294dynamic, 295, 331–334static, 345

Branch predictorsaccuracy, 333correlation, 333information from, 333tournament, 334

Branch register, 168Branch table, 169Branch taken

cost reduction, 330defi ned, 266

Branch targetaddresses, 266buff ers, 333

Branches. See also Conditional branchesaddressing in, 117–120compiler creation, 94decision, moving up, 330delayed, 295, 330–331, 295ending, 96execution in ID stage, 330pipelined, 330target address, 330unconditional, 318

Branch-on-zero instruction, 280B-type instruction format, 113Bubble Sort, 145Bubbles, 326Bus-based coherent multiprocessors,

OL6.15-7Buses, A-607

Bytesaddressing, 70order, 70

C

C.mmp, OL6.15-4C language

assignment, compiling into LEGv8, 66compiling, 150, OL2.15-2–2.15-3compiling assignment with registers,

68compiling while loops in, 95–96sort algorithms, 146translation hierarchy, 128translation to LEGv8 assembly

language, 66variables, 106

C++ language, OL2.15-27, OL2.22-8Cache blocking and matrix multiply,

489–490Cache coherence, 477–481

coherence, 477consistency, 477enforcement schemes, 479implementation techniques, OL5.12-

5–5.12-12migration, 479problem, 477, 478, 481protocol example, OL5.12-12–5.12-16protocols, 479replication, 479snooping protocol, 479–481snoopy, OL5.12-16–5.12-17state diagram, OL5.12-16

Cache coherency protocol, OL5.12-12–5.12-16

fi nite-state transition diagram, OL5.12-15

functioning, OL5.12-14mechanism, OL5.12-14state diagram, OL5.12-16states, OL5.12-13write-back cache, OL5.12-15

Cache controllers, 482coherent cache implementation

techniques, OL5.12-5–5.12-12implementing, OL5.12-2snoopy cache coherence, OL5.12-

16–5.12-17SystemVerilog, OL5.12-2

Cache hits, 458Cache misses

block replacement on, 468capacity, 470, 471compulsory, 470confl ict, 470defi ned, 406direct-mapped cache, 418fully associative cache, 420handling, 406–407memory-stall clock cycles, 413reducing with fl exible block placement,

416–418set-associative cache, 419steps, 407in write-through cache, 407

Cache performance, 412–431calculating, 414hit time and, 415–416impact on processor performance, 414

Cache-aware instructions, 496Caches, 397–412. See also Blocks

accessing, 400–403in ARM cortex-A53, 484associativity in, 419–420bits in, 404bits needed for, 404contents illustration, 401defi ned, 21, 397–398direct-mapped, 398, 399, 404, 416empty, 400–401FSM for controlling, 472fully associative, 417GPU, B-38inconsistent, 407index, 402in Intel Core i7, 484Intrinsity FastMATH example,

409–412locating blocks in, 421–422locations, 399multilevel, 412, 424nonblocking, 483physically addressed, 458, 459physically indexed, 458physically tagged, 458primary, 424, 431secondary, 424, 431set-associative, 417simulating, 491size, 403split, 411summary, 411–412tag fi eld, 402tags, OL5.12-3, OL5.12-11

I-4 Index

virtual memory and TLB integration, 457–459

virtually addressed, 458virtually indexed, 458virtually tagged, 458write-back, 408, 409, 469write-through, 407, 409, 469writes, 407–409

Callee, 101, 103Caller, 101Capabilities, OL5.17-8Capacity misses, 470Carry lookahead, A-626–635

4-bit ALUs using, A-633adder, A-627fast, with fi rst level of abstraction,

A-627–628fast, with “infi nite” hardware,

A-626–627fast, with second level of abstraction,

A-628–634plumbing analogy, A-630, A-631ripple carry speed versus, A-634summary, A-634–635

Carry save adders, 197Cause registerCDC 6600, OL1.12-7, OL4.16-3Cell phones, 7Central processor unit (CPU). See also

Processorsclassic performance equation,

36–40defi ned, 19execution time, 32, 33–34performance, 33–35system, time, 32time, 413time measurements, 33–34user, time, 32

Cg pixel shader program, B-15–17Characters

ASCII representation, 110in Java, 113

Chips, 19, 25, 26manufacturing process, 26

Classesdefi ned, OL2.15-15packages, OL2.15-21

Clear exclusive instruction (CLREX), 488Clock cycles

defi ned, 33

memory-stall, 413number of registers and, 67worst-case delay and, 283

Clock cycles per instruction (CPI), 35, 293

one level of caching, 424two levels of caching, 424

Clock ratedefi ned, 33frequency switched as function of, 41power and, 40

Clocking methodology, 261–263, A-636edge-triggered, 261, A-636, A-661level-sensitive, A-662, A-663–664for predictability, 261

Clocks, A-636–638edge, A-636, A-638in edge-triggered design, A-661skew, A-662specifi cation, A-645synchronous system, A-636–637

Cloud computing, 549defi ned, 7

Cluster networking, 553–554, OL6.9-12Clusters, OL6.15-8–6.15-9

defi ned, 516, 546, OL6.15-8isolation, 547organization, 515scientifi c computing on, OL6.15-8

Cm*, OL6.15-4CMOS (complementary metal oxide

semiconductor), 41Coarse-grained multithreading, 530Cobol, OL2.22-7Code generation, OL2.15-13Code motion, OL2.15-7Cold-start miss, 470Collision misses, 470Column major order, 427Combinational blocks, A-592Combinational control units, C-4–8Combinational elements, 260Combinational logic, 261, A-591,

A-597–608arrays, A-606–607decoders, A-597defi ned, A-593don’t cares, A-605–606multiplexors, A-598ROMs, A-602–604two-level, A-599–602Verilog, A-611–14

Commercial computer development, OL1.12-4–1.12-10

Commit unitsbuff er, 350defi ned, 350in update control, 355

Common case fast, 11Common subexpression elimination,

OL2.15-6Communication, 23–24

overhead, reducing, 44–45thread, B-34

Compact code, OL2.22-4Compare and branch zero, 330Comparisons

constant operands in, 73signed versus unsigned, 97

Compilers, 129branch creation, 95brief history, OL2.22-8–2.22-9conservative, OL2.15-7defi ned, 14front end, OL2.15-3function, 14, 129high-level optimizations,

OL2.15-4ILP exploitation, OL4.16-5Just In Time (JIT), 137optimization, 146, OL2.22-9speculation, 344–345structure, OL2.15-2

CompilingC assignment statements, 66C language, 95, 150, OL2.15-2–2.15-3fl oating-point programs, 222–225if-then-else, 94in Java, OL2.15-19procedures, 102, 104–105recursive procedures, 104–105while loops, 95–96

Compressed sparse row (CSR) matrix, B-55, B-56

Compulsory misses, 470, 471Computer architects, 11–12

abstraction to simplify design, 11common case fast, 11dependability via redundancy, 12hierarchy of memories, 12Moore’s law, 11parallelism, 12pipelining, 12prediction, 12

Caches (Continued)

Index I-5

Computersapplication classes, traditional, 5–6applications, 4arithmetic for, 186–248characteristics, OL1.12-12commercial development, OL1.12-

4–1.12-10component organization, 17components, 17, 177design measure, 53desktop, 5embedded, 5fi rst, OL1.12-2–1.12-4in information revolution, 4instruction representation, 82–89performance measurement,

OL1.12-10post-PC era, 6–7principles, 86servers, 5

Condition codes/fl ags, 97Conditional branches

changing program counter with, 333compiling if-then-else into, 94defi ned, 93desktop RISC, D-16embedded RISC, D-16implementation, 99in loops, 119PA-RISC, D-34, D-35PC-relative addressing, 118RISC, D-10–16SPARC, D-10–12

Conditional move instructions, 334Confl ict misses, 470Constant memory, B-40Constant operands, 73–74

frequent occurrence, 73Content Addressable Memory (CAM), 422Context switch, 460Control

ALU, 271–273challenge, 336fi nishing, 281forwarding, 320FSM, C-8–21implementation, optimizing, C-27–28mapping to hardware, C-2–32memory, C-26organizing, to reduce logic, C-31–32pipelined, 311–315

Control fl ow graphs, OL2.15-9–2.15-10illustrated examples, OL2.15-9, OL2.15-10

Control functionsALU, mapping to gates, C-4–7defi ning, 276PLA, implementation, C-7,

C-20–21ROM, encoding, C-18–19for single-cycle implementation, 281

Control hazards, 292–295, 328–329branch delay reduction, 330branch not taken assumption, 328branch prediction as solution, 295delayed decision approach, 295dynamic branch prediction, 331logic implementation in Verilog,

OL4.13-8pipeline stalls as solution, 293pipeline summary, 335–336simplicity, 328solutions, 293static multiple-issue processors and,

345–346Control lines

asserted, 276in datapath, 275execution/address calculation, 312fi nal three stages, 314instruction decode/register fi le read,

312instruction fetch, 312memory access, 312setting of, 276values, 312write-back, 312

Control signalsALUOp, 275defi ned, 262eff ect of, 276multi-bit, 276pipelined datapaths with, 311–315truth tables, C-14

Control units, 259. See also Arithmetic logic unit (ALU)

address select logic, C-24, C-25combinational, implementing, C-4–8with explicit counter, C-23illustrated, 277logic equations, C-11main, designing, 273–276as microcode, C-28MIPS, C-10next-state outputs, C-10, C-12–13output, 271–273, C-10

Cooperative thread arrays (CTAs), B-30

Coprocessorsdefi ned, 226

Core LEGv8 instruction set, 248. See also MIPS

abstract view, 258desktop RISC, D-9–11implementation, 256–260implementation illustration, 259overview, 257–260subset, 256

Coresdefi ned, 43number per chip, 43

Correlation predictor, 333Cosmic Cube, OL6.15-7CPU, 9Cray computers, OL3.12-5–3.12-6Critical word fi rst, 406Crossbar networks, 551CTSS (Compatible Time-Sharing

System), OL5.18-9CUDA programming environment, 539,

B-5barrier synchronization, B-18, B-34development, B-17, B-18hierarchy of thread groups, B-18kernels, B-19, B-24key abstractions, B-18paradigm, B-19–23parallel plus-scan template, B-61per-block shared memory, B-58plus-reduction implementation, B-63programs, B-6, B-24scalable parallel programming with,

B-17–23shared memories, B-18threads, B-36

Cyclic redundancy check, 437Cylinder, 396

D

D fl ip-fl ops, A-639, A-641D latches, A-639, A-640Data bits, 435Data fl ow analysis, OL2.15-11Data hazards, 289–292, 316–328.

See also Hazardsforwarding, 289, 316–328load-use, 290, 329stalls and, 324–328

Data parallel problem decomposition, B-17, B-18

I-6 Index

Data race, 125Data selectors, 258Data transfer instructions.

See also Instructionsdefi ned, 68, 69load, 69off set, 70store, 71–72

Datacenters, 7Data-level parallelism, 524Datapath elements

defi ned, 263sharing, 268

Datapathsbranch, 266building, 263–271control signal truth tables, C-14control unit, 277defi ned, 19design, 263exception handling, 339for fetching instructions, 265for hazard resolution via forwarding, 323for LEGv8 architecture, 269for memory instructions, 267in operation for branch-on-zero

instruction, 280in operation for load instruction, 279in operation for R-type instruction,

277, 278operation of, 276–280pipelined, 297–315for R-type instructions, 278,

276–277single, creating, 267single-cycle, 296static two-issue, 347

Deasserted signals, 262, A-592DEC PDP-8, OL2.22-3Decimal numbers

binary number conversion to, 78defi ned, 75

Decision-making instructions, 93–99Decoders, A-597

two-level, A-653Decoding machine language, 121–125Defect, 26Delayed branches, 295. See also Branches

as control hazard solution, 295embedded RISCs and, D-23for fi ve-stage pipelines, 323–324reducing, 330

Delayed decision, 295DeMorgan’s theorems, A-599

Denormalized numbers, 230Dependability via redundancy, 12Dependable memory hierarchy, 432–437

failure, defi ning, 432Dependences

between pipeline registers, 319between pipeline registers and ALU

inputs, 319bubble insertion and, 326detection, 318name, 348sequence, 316

Designcompromises and, 85datapath, 263digital, 366logic, 260–263, B-1–79main control unit, 273–276memory hierarchy, challenges, 472pipelining instruction sets, 288

Desktop and server RISCs. See also Reduced instruction set computer (RISC) architectures

addressing modes, D-6architecture summary, D-4arithmetic/logical instructions, D-11conditional branches, D-16constant extension summary, D-9control instructions, D-11conventions equivalent to MIPS core,

D-12data transfer instructions, D-10features added to, D-45fl oating-point instructions, D-12instruction formats, D-7multimedia extensions, D-16–18multimedia support, D-18types of, D-3

Desktop computers, defi ned, 5Device driver, OL6.9-5DGEMM (Double precision General

Matrix Multiply), 238, 363, 365, 427, 553

cache blocked version of, 429optimized C version of, 241, 363, 490performance, 365, 430

Dicing, 27Dies, 26, 26–27Digital design pipeline, 366Digital signal-processing (DSP)

extensions, D-19DIMMs (dual inline memory modules),

OL5.17-5Direct Data IO (DDIO), OL6.9-6

Direct memory access (DMA), OL6.9-4Direct3D, B-13Direct-mapped caches. See also Caches

address portions, 421choice of, 422defi ned, 398, 416illustrated, 399memory block location, 417misses, 419single comparator, 421total number of bits, 404

Dirty bit, 452Dirty pages,Disk memory, 395–397Displacement addressing, 120Distributed Block-Interleaved Parity

(RAID 5), OL5.11-6Divide algorithm, 200Dividend, 198Division, 197–203

algorithm, 199dividend, 198divisor, 198

Divisor, 198divu (Divide Unsigned). See also

Arithmeticfaster, 202–203fl oating-point, 220hardware, 198–201hardware, improved version, 201in LEGv8, 203operands, 198quotient, 198remainder, 198signed, 201–202SRT, 203

Don’t cares, A-605–606example, A-605–606term, 273

Double data rate (DDR), 393Double Data Rate (DDR) SDRAM,

393–394, A-653Double precision. See also Single

precisiondefi ned, 207FMA, B-45–46GPU, B-45–46, B-74representation, 206–207

Doubleword, 66, 158Dual inline memory modules (DIMMs), 395Dynamic branch prediction, 331–334.

See also Control hazardsbranch prediction buff er, 331loops and, 333

Index I-7

Dynamic hardware predictors, 295Dynamic multiple-issue processors, 343,

349–352. See also Multiple issuepipeline scheduling, 350–352superscalar, 349

Dynamic pipeline scheduling, 350–352commit unit, 350concept, 350hardware-based speculation, 352primary units, 351reorder buff er, 355reservation station, 350

Dynamic random access memory (DRAM), 392, 393–395, A-651–653

bandwidth external to, 412cost, 23defi ned, 19, A-651DIMM, OL5.17-5Double Date Rate (DDR), 393–394early board, OL5.17-4GPU, B-37–38growth of capacity, 25history, OL5.17-2internal organization of, 394pass transistor, A-651SIMM, OL5.17-5, OL5.17-6single-transistor, A-652size, 412speed, 23synchronous (SDRAM), 393–394,

A-648, A-653two-level decoder, A-653

Dynamically linked libraries (DLLs), 134–136

defi ned, 134lazy procedure linkage version, 135

E

Early restart, 406Edge-triggered clocking methodology,

261, 262, A-636, A-661advantage, A-637clocks, A-661drawbacks, A-662illustrated, A-638rising edge/falling edge, A-636

EDSAC (Electronic Delay Storage Automatic Calculator), OL1.12-3, OL5.17-2

Eispack, OL3.12-4Electrically erasable programmable

read-only memory (EEPROM), 395

Elementscombinational, 260datapath, 263, 268memory, A-638–646state, 260, 262, 264, A-636, A-638

Embedded computers, 5application requirements, 6design, 5growth, OL1.12-12–1.12-13

Embedded Microprocessor Benchmark Consortium (EEMBC), OL1.12-12

Embedded RISCs. See also Reduced instruction set computer (RISC) architectures

addressing modes, D-6architecture summary, D-4arithmetic/logical instructions, D-14conditional branches, D-16constant extension summary, D-9control instructions, D-15data transfer instructions, D-13delayed branch and, D-23DSP extensions, D-19general purpose registers, D-5instruction conventions, D-15instruction formats, D-8multiply-accumulate approaches, D-19types of, D-4

Encodingdefi ned, C-31LEGv8 instruction, 86, 122ROM control function, C-18–19ROM logic function, A-603x86 instruction, 161–162

ENIAC (Electronic Numerical Integrator and Calculator), OL1.12-2, OL1.12-3, OL5.17-2

EPIC, OL4.16-5Error correction, A-653–655Error Detecting and Correcting Code

(RAID 2), OL5.11-5Error detection, A-654Error detection code, 434Ethernet, 23EX stage

load instructions, 303overfl ow exception detection, 338, 341store instructions, 305

Exabyte, 6Exception enable, 461Exception link register (ELR), 337, 459, 461

address capture, 340defi ned, 338in restart determination, 337

Exception program counters (EPCs), 326address capture, 331copying, 181defi ned, 181, 327in restart determination, 326–327transferring, 182

Exception Syndrome Register (ESR), 337, 461

Exceptions, 336–342association, 342datapath with controls for handling,

339defi ned, 207, 336detecting, 336event types and, 336imprecise, 342interrupts versus, 336in LEGv8 architecture, 337–338overfl ow, 339pipelined computer example, 339in pipelined implementation, 338–342precise, 342reasons for, 337–338result due to overfl ow in add

instruction, 341saving/restoring stage on, 462

Executable fi lesdefi ned, 131

Execute or address calculation stage, 303Execute/address calculation

control line, 312load instruction, 303store instruction, 303

Execution timeas valid performance measure, 51CPU, 32, 33–34pipelining and, 297

Explicit counters, C-23, C-26Exponents, 206Extended-register instructions, 164

F

Failures, synchronizer, A-665Fallacies. See also Pitfalls

add immediate unsigned, 227Amdahl’s law, 572arithmetic, 242–245assembly language for performance,

169commercial binary compatibility

importance, 170defi ned, 49GPUs, B-72–74, B-75

I-8 Index

low utilization uses little power, 50peak performance, 572pipelining, 366powerful instructions mean higher

performance, 169right shift , 242

False sharing, 480Fast carry

with “infi nite” hardware, A-626–627with fi rst level of abstraction,

A-627–628with second level of abstraction,

A-628–634Fast Fourier Transforms (FFT), B-53Fault avoidance, 433Fault forecasting, 433Fault tolerance, 433Fermi architecture, 539, 568Field programmable devices (FPDs),

A-666Field programmable gate arrays (FPGAs),

A-666Fields

defi ned, 84format, C-31LEGv8, 84–86names, 84

Files, register, 264, 269, A-638, A-642–644

Fine-grained multithreading, 530Finite-state machines (FSMs), 472–477,

A-655–660control, C-8–22controllers, 475for multicycle control, C-9for simple cache controller, 476–477implementation, 474, A-658Mealy, 475Moore, 475next-state function, 474, A-655output function, A-655, A-657state assignment, A-658state register implementation, A-659style of, 475synchronous, A-655SystemVerilog, OL5.12-7traffi c light example, A-656–658

Flash memory, 395characteristics, 23defi ned, 23

Flat address space, 493

Flip-fl opsD fl ip-fl ops, A-639, A-641defi ned, A-639

Floating point, 205–230, 232assembly language, 221backward step, OL3.12-4–3.12-5binary to decimal conversion, 211branch, 220challenges, 246diversity versus portability,

OL3.12-3–3.12-4division, 220fi rst dispute, OL3.12-2–3.12-3form, 206fused multiply add, 228guard digits, 226–227history, OL3.12-3IEEE 754 standard, 207–211intermediate calculations, 226LEGv8 instruction frequency for, 248LEGv8 instructions, 220–226machine language, 221operands, 221overfl ow, 206packed format, 232precision, 243procedure with two-dimensional

matrices, 223–225programs, compiling, 222–225registers, 226representation, 206–211rounding, 226sign and magnitude, 206SSE2 architecture, 232, 233subtraction, 220underfl ow, 206units, 227in x86, 233

Floating vectors, OL3.12-3Floating-point addition, 212–215

arithmetic unit block diagram, 216binary, 213illustrated, 214instructions, 220steps, 212

Floating-point arithmetic (GPUs), B-41–46basic, B-42double precision, B-45–46, B-74performance, B-44specialized, B-42–44supported formats, B-42texture operations, B-44

Floating-point instructionsdesktop RISC, D-12SPARC, D-31

Floating-point multiplication, 215–219binary, 219illustrated, 218instructions, 220signifi cands, 215steps, 215, 217

Flow-sensitive information, OL2.15-15

Flushing instructions, 329, 330exceptions and, 340

For loops, 147, OL2.15-26inner, OL2.15-24SIMD and, OL6.15-2

Format fi elds, C-31Fortran, OL2.22-7Forwarding, 316–328

ALU before, 321control, 320datapath for hazard resolution, 323defi ned, 289functioning, 317graphical representation, 290illustrations, OL4.13-26multiple results and, 292multiplexors, 322pipeline registers before, 321with two instructions, 289–290Verilog implementation,

OL4.13-2–4.13-4Fractions, 206, 207Frame buff er, 18Frame pointers, 106Front end, OL2.15-3Fully associative caches. See also Caches

block replacement strategies, 468–469

choice of, 422defi ned, 417memory block location, 417misses, 420

Fully connected networks, 551Fused-multiply-add (FMA) operation,

228, B-45–46

G

Galois/Counter Mode ( GCM ) encryption, 488

Game consoles, B-9

Fallacies (Continued)

Index I-9

Gates, A-591, A-596AND, A-600, C-7delays, A-634–635mapping ALU control function to,

C-4–7NAND, A-596NOR, A-596, A-638

Gather-scatter, 527, 568General Purpose GPUs (GPGPUs),

B-5General-purpose registers, 154

architectures, OL2.22-3embedded RISCs, D-5

Generatedefi ned, A-628example, A-632super, A-629

Gigabyte, 6Global common subexpression

elimination, OL2.15-6Global memory, B-21, B-39Global miss rates, 430Global optimization, OL2.15-5

code, OL2.15-7implementing, OL2.15-8–2.15-11

Global pointers, 106GPU computing. See also Graphics

processing units (GPUs)defi ned, B-5visual applications, B-6–7

GPU system architectures, B-7–12graphics logical pipeline, B-10heterogeneous, B-7–9implications for, B-24interfaces and drivers, B-9unifi ed, B-10–12

Graph coloring, OL2.15-12Graphics displays

computer hardware support, 18LCD, 18

Graphics logical pipeline, B-10Graphics processing units (GPUs),

538–543. See also GPU computingas accelerators, 538attribute interpolation, B-43–44defi ned, 46, 522, B-3evolution, B-5fallacies and pitfalls, B-72–75fl oating-point arithmetic, B-17,

B-41–46, B-74GeForce 8-series generation, B-5general computation, B-73–74

General Purpose (GPGPUs), B-5graphics mode, B-6graphics trends, B-4history, B-3–4logical graphics pipeline, B-13–14mapping applications to, B-55–72memory, 538multilevel caches and, 538N-body applications, B-65–72NVIDIA architecture, 539–541parallel memory system, B-36–41parallelism, 539, B-76performance doubling, B-4perspective, 543–545programming, B-12–24programming interfaces to, B-17real-time graphics, B-13summary, B-76

Graphics shader programs, B-14–15Gresham’s Law, 248, OL3.12-2Grid computing, 549Grids, B-19GTX 280, 564–569Guard digits

defi ned, 226rounding with, 227

H

Half precision, B-42Halfwords, 114Hamming, Richard, 434Hamming distance, 434Hamming Error Correction Code (ECC),

434–435calculating, 434–435

Hard disksaccess times, 23defi ned, 23

Hardwareas hierarchical layer, 13language of, 14–16operations, 63–67supporting procedures in, 100–110synthesis, A-609translating microprograms to, C-28–32virtualizable, 440

Hardware description languages. See also Verilog

defi ned, A-608using, A-608–614VHDL, A-608–609

Hardware multithreading, 530–533coarse-grained, 530options, 531simultaneous, 531

Hardware-based speculation, 352Harvard architecture, OL1.12-4Hazard detection units, 324

functions, 324pipeline connections for, 327

Hazards. See also Pipeliningcontrol, 292–293, 328–336data, 289, 316–328forwarding and, 323structural, 288–289, 305

Heapallocating space on, 107–110defi ned, 107

Heterogeneous systems, B-4–5architecture, B-7–9defi ned, B-3

Hexadecimal numbers, 83binary number conversion to,

83, 84Hierarchy of memories, 12High-level languages, 14–16

benefi ts, 16computer architectures, OL2.22-5importance, 16

High-level optimizations, OL2.15-4–2.15-5

Hit rate, 390Hit time

cache performance and, 415–416defi ned, 390

Hit under miss, 483Hold time, A-642Horizontal microcode, C-32Hot-swapping, OL5.11-7Human genome project, 4

I

I/O, OL6.9-2, OL6.9-3on system performance, OL5.11-2

I/O benchmarks.See BenchmarksIBM 360/85, OL5.17-7IBM 701, OL1.12-5IBM 7030, OL4.16-2IBM ALOG, OL3.12-7IBM Blue Gene, OL6.15-9–6.15-10IBM Personal Computer, OL1.12-7,

OL2.22-6

I-10 Index

IBM System/360 computers, OL1.12-6, OL3.12-6, OL4.16-2

IBM z/VM, OL5.17-8ID stage

branch execution in, 330, 331load instructions, 303store instruction in, 302

IEEE 754 fl oating-point standard, 207–211, 208, OL3.12-8–3.12-10. See also Floating point

fi rst chips, OL3.12-8–3.12-9in GPU arithmetic, B-42–43implementation, OL3.12-10rounding modes, 227today, OL3.12-10

If statements, 118I-format, 87If-then-else, 94Immediate addressing, 120Immediate instructions, 73Imprecise interrupts, 342, OL4.16-4Index-out-of-bounds check, 98Induction variable elimination, OL2.15-7Inheritance, OL2.15-15In-order commit, 351Input devices, 16Inputs, 273Instances, OL2.15-15Instruction count, 36, 38Instruction decode/register fi le read

stagecontrol line, 311–312load instruction, 300store instruction, 305

Instruction execution illustrations, OL4.13-16–4.13-17

clock cycle 9, OL4.13-24clock cycles 1 and 2, OL4.13-21clock cycles 3 and 4, OL4.13-22clock cycles 5 and 6, OL4.13-23clock cycles 7 and 8, OL4.13-24examples, OL4.13-20–4.13-25forwarding, OL4.13-26–4.13-31no hazard, OL4.13-17pipelines with stalls and forwarding,

OL4.13-26, OL4.13-20Instruction fetch stage


Instruction formats, 161ARMv7, 151

defi ned, 83desktop/server RISC architectures, D-7embedded RISC architectures, D-8I-type, 85LEGv8, 151MIPS, 151R-type, 85, 273x86, 161

Instruction latency, 367Instruction mix, 39, OL1.12-10Instruction set architecture

ARM, 152–154branch address calculation, 266defi ned, 22, 52history, 173–174maintaining, 52protection and, 441thread, B-31–34virtual machine support, 440–441

Instruction sets, B-49ARMv8, 171design for pipelining, 228LEGv8, 247MIPS-32, 151x86 growth, 170

Instruction-level parallelism (ILP), 365. See also Parallelism

compiler exploitation, OL4.16-5–4.16-6defi ned, 43, 344exploitation, increasing, 354and matrix multiply, 363–365

Instructions, 60–174, D-25–27, D-40–42. See also Arithmetic instructions; MIPS; Operands

add immediate, 73addition, 190Alpha, D-27–29arithmetic-logical, 263ARM, 152–154, D-36–37assembly, 66basic block, 96cache-aware, 496conditional branch, 93, 94conditional move, 334core, 246data transfer, 68decision-making, 93–99defi ned, 14, 62desktop RISC conventions, D-12as electronic signals, 82embedded RISC conventions, D-15encoding, 86

fetching, 265fi elds, 83fl oating-point (x86), 232, 233fl oating-point, 220–221fl ushing, 329, 330, 340immediate, 73introduction to, 62–63jumpleft -to-right fl ow, 298load, 69logical operations, 90–93M32R, D-40memory access, B-33–34memory-reference, 257multiplication, 197nop, 325–326PA-RISC, D-34–36performance, 35–36pipeline sequence, 325PowerPC, D-12–13, D-32–34PTX, B-31, B-32representation in computer, 82–89restartable, 462resuming,R-type, 263, 268SPARC, D-29–32store, 72store exclusive register (STXR), 126subtraction, 190SuperH, D-39–40thread, B-30–31Th umb, D-38vector, 524as words, 62x86, 154–159

Instructions per clock cycle (IPC), 343Integrated circuits (ICs), 19. See also

specifi c chipscost, 27defi ned, 25manufacturing process, 26very large-scale (VLSIs), 25

Intel Core i7, 46–49, 256, 517, 564–569address translation for, 483architectural registers, 358caches in, 484memory hierarchies of, 482–488microarchitecture, 358performance of, 485–486SPEC CPU benchmark, 46–48SPEC power benchmark, 48–49TLB hardware for, 483

Index I-11

Intel Core i7 920, 358–360microarchitecture, 358

Intel Core i7 960benchmarking and roofl ines of,

564–569Intel Core i7 Pipelines, 354, 358–360

memory components, 359performance, 361–362program performance, 362specifi cation, 356

Intel IA-64 architecture, OL2.22-3Intel Paragon, OL6.15-8Intel Th reading Building Blocks, B-60Intel x86 microprocessors

clock rate and power for, 40Interference graphs, OL2.15-12Interleaving, 412Interprocedural analysis, OL2.15-14Interrupt enable, 461Interrupt-driven I/O, OL6.9-4Interrupts

defi ned, 207, 336event types and, 336exceptions versus, 336imprecise, 342, OL4.16-4precise, 342vectored, 337

Intrinsity FastMATH processor, 409–412caches, 410data miss rates, 411, 421read processing, 456TLB, 454–457write-through processing, 456

Inverted page tables, 451Issue packets, 345

J

Javabytecode, 136bytecode architecture, OL2.15-17characters in, 113–115compiling in, OL2.15-19–2.15-20goals, 136interpreting, 136, 150, OL2.15-15keywords, OL2.15-21method invocation in, OL2.15-21pointers, OL2.15-26primitive types, OL2.15-26programs, starting, 136–137reference types, OL2.15-26sort algorithms, 146

strings in, 113–115translation hierarchy, 136while loop compilation in, OL2.15-

18–2.15-19Java Virtual Machine (JVM), 150,

OL2.15-16Jump instructions, 254, D-26

branch instruction versus, 270control and datapath for, 271implementing, 270instruction format, 270

Just In Time (JIT) compilers, 137, 576

K

Karnaugh maps, A-606Kernel mode, 459Kernels

CUDA, B-19, B-24defi ned, B-19

Kilobyte, 6

L

LAPACK, 243Large-scale multiprocessors, OL6.15-7,

OL6.15-9–6.15-10Latches

D latch, A-639, A-640defi ned, A-639

Latencyinstruction, 367memory, B-74–75pipeline, 297use, 346

LDUR (load register), 64LDURB (load byte), 64LDURH (load half), 64LDURSW (load signed word), 64LDXR (load exclusive register), 64, 122Leaf procedures. See also Procedures

defi ned, 104example, 113

Least recently used (LRU)as block replacement strategy, 468–469defi ned, 423pages, 448

Least signifi cant bits, A-620defi ned, 75SPARC, D-31

Left -to-right instruction fl ow, 298–299LEGv8, 62, 64, 86

architecture, 204arithmetic core, 246arithmetic instructions, 63arithmetic/logical instructions not in,

D-21, D-23assembly instruction, mapping, 82–83common extensions to, D-20–25compiling C assignment statements

into, 66compiling complex C assignment into,

66control instructions not in, D-21control registers, 461control unit, C-10data transfer instructions not in, D-20,

D-22divide in, 203exceptions in, 337–338fi elds, 84–85fl oating-point instructions not in,

D-22fl oating-point instructions, 220–221instruction classes, 173instruction encoding, 86, 122instruction formats, 124, 151instruction set, 62, 171, 246, 247,

256–260, D-9–10machine language, 88memory addresses, 71memory allocation for program and

data, 108multiply in, 197operands, 64Pseudo, 246register conventions, 109static multiple issue with, 345–347

Level-sensitive clocking, A-662, A-663–664

defi ned, A-662two-phase, A-663

Link, OL6.9-2Linkers, 131–134

defi ned, 131executable fi les, 131steps, 131using, 131–134

Linking object fi les, 132–134Linpack, 554, OL3.12-4Liquid crystal displays (LCDs), 18LISP, SPARC support, D-30Live range, OL2.15-11Livermore Loops, OL1.12-11

I-12 Index

Load balancing, 521–522Load byte, 167Load halfword, 167Load instructions. See also Store

instructionsaccess, B-41base register, 274compiling with, 71–72datapath in operation for, 279defi ned, 69EX stage, 303halfword unsigned, 114ID stage, 302IF stage, 302load byte unsigned, 79load half, 114MEM stage, 304move wide with keep, 115move wide with zeros, 115pipelined datapath in, 307signed, 79unit for implementing, 267unsigned, 79WB stage, 304

Load register, 69, 72Loaders, 134Load-store architectures, OL2.22-3Load-use data hazard, 290, 329Load-use stalls, 329Local area networks (LANs), 24. See also

NetworksLocal memory, B-21, B-40Local miss rates, 430Local optimization, OL2.15-5. See also

Optimizationimplementing, OL2.15-8

Localityprinciple, 388spatial, 388, 391temporal, 388, 391

Lock synchronization, 125Locks, 534Logic

address select, C-24, C-25ALU control, C-6combinational, 262, A-593, A-597–608components, 261control unit equations, C-11design, 260–263, B-1–79equations, A-595minimization, A-606programmable array (PAL), A-666

sequential, A-593, A-644–646two-level, A-599–602

Logical operations, 90–93AND, 91ARM, 154desktop RISC, D-11embedded RISC, D-14EOR, 92NOT, 91OR, 91shift s, 90

Long instruction word (LIW), OL4.16-5Lookup tables (LUTs), A-667Loop unrolling

defi ned, 348, OL2.15-4for multiple-issue pipelines, 348register renaming and, 348

Loops, 95–96conditional branches in, 118for, 147prediction and, 333–334test, 147, 148while, compiling, 95–96

M

M32R, D-15, D-40Machine code, 83Machine instructions, 83Machine language, 15

branch off set in, 119decoding, 121–124defi ned, 14, 83fl oating-point, 221illustrated, 15LEGv8, 88SRAM, 21translating MIPS assembly language

into, 86Main memory, 442. See also Memory

defi ned, 23page tables, 451physical addresses, 442

Mapping applications, B-55–72Mark computers, OL1.12-14Matrix multiply, 238–242, 569–571Mealy machine, 475, A-656, A-659, A-660Mean time to failure (MTTF), 432

improving, 433versus AFR of disks, 433–434

Media Access Control (MAC) address, OL6.9-7

Megabyte, 6Memory

addresses, 79affi nity, 562atomic, B-21bandwidth, 394–395, 411cache, 21, 397–412, 412–431CAM, 422constant, B-40control, C-26defi ned, 19DRAM, 19, 393–394, A-651–653fl ash, 23global, B-21, B-39GPU, 538instructions, datapath for, 267local, B-21, B-40main, 23nonvolatile, 22operands, 68–69parallel system, B-36–41read-only (ROM),A-602–604SDRAM, 393–394secondary, 23shared, B-21, B-39–40spaces, B-39SRAM, A-646–650stalls, 414technologies for building, 24–28texture, B-40virtual, 441–465volatile, 22

Memory access instructions, B-33–34Memory access stage


Memory bandwidth, 565, 573Memory consistency model, 481Memory elements, A-638–646

clocked, A-639D fl ip-fl op, A-639, A-641D latch, A-640DRAMs, A-651–655fl ip-fl op, A-639hold time, A-642latch, A-639setup time, A-641, A-642SRAMs, A-646–650unclocked, A-639

Memory hierarchies, 559of ARM cortex-A8, 482–488

Index I-13

block (or line), 390cache performance, 412–431caches, 397–431common framework, 465–472defi ned, 389design challenges, 472development, OL5.17-6–5.17-8exploiting, 386–513of Intel Core i7, 482–488level pairs, 390multiple levels, 389overall operation of, 457–458parallelism and, 477–481, OL5.11-2pitfalls, 491–495program execution time and, 431quantitative design parameters, 466redundant arrays and inexpensive

disks, 481reliance on, 390structure, 389structure diagram, 392variance, 431virtual memory, 441–465

Memory rank, 395Memory technologies, 392–397

disk memory, 395–397DRAM technology, 392, 393–395fl ash memory, 395SRAM technology, 392, 393

Memory-mapped I/O, OL6.9-3Memory-stall clock cycles, 413Message passing

defi ned, 543multiprocessors, 543–548

Metastability, A-664Methods

defi ned, OL2.15-5invoking in Java, OL2.15-20–2.15-21

Microarchitectures, 358Intel Core i7 920, 358

Microcodeassembler, C-30control unit as, C-28defi ned, C-27dispatch ROMs, C-30–31horizontal, C-32vertical, C-32

Microinstructions, C-31Microprocessors

design shift , 517multicore, 8, 43, 517

Microprograms

as abstract control representation, C-30fi eld translation, C-29translating to hardware, C-28–32

Migration, 479Million instructions per second (MIPS),

51Minterms

defi ned, A-600, C-20in PLA implementation, C-20

MIP-map, B-44MIPS and ARMv8

common features beween, 152MIPS-16

16-bit instruction set, D-41–42immediate fi elds, D-41instructions, D-40–42MIPS core instruction changes, D-42PC-relative addressing, D-41

MIPS-32 instruction set, 151MIPS-64 instructions, 151, D-25–27

conditional procedure call instructions, D-27

constant shift amount, D-25jump/call not PC-relative, D-26move to/from control registers, D-26nonaligned data transfers, D-25NOR, D-25parallel single precision fl oating-point

operations, D-27reciprocal and reciprocal square root,

D-27SYSCALL, D-25TLB instructions, D-26–27

Mirroring, OL5.11-5Miss penalty

defi ned, 390determination, 405–406multilevel caches, reducing, 424

Miss ratesblock size versus, 406data cache, 467defi ned, 390global, 430improvement, 405–406Intrinsity FastMATH processor, 411local, 430miss sources, 471split cache, 411

Miss under miss, 483MMX (MultiMedia eXtension), 232Moore machines, 475, A-656, A-659,

A-660

Moore’s law, 11, 393, 538, OL6.9-2, B-72–73

Most signifi cant bit1-bit ALU for, A-621defi ned, 75

move (Move), 144MOVK (move wide with keep), 64, 115MOVZ (move wide with zero), 64, 115MS-DOS, OL5.17-11Multicore, 533–537Multicore multiprocessors, 8, 43

defi ned, 8, 517MULTICS (Multiplexed Information

and Computing Service), OL5.17-9–5.17-10

Multilevel caches. See also Cachescomplications, 430defi ned, 412, 430miss penalty, reducing, 424performance of, 424summary, 431–432

Multimedia extensionsdesktop/server RISCs, D-16–18as SIMD extensions to instruction sets,

OL6.15-4vector versus, 525–526

Multiple dimension arrays, 226Multiple instruction multiple data

(MIMD), 574defi ned, 523, 524fi rst multiprocessor, OL6.15-14

Multiple instruction single data (MISD), 523

Multiple issue, 343–350code scheduling, 347–348dynamic, 343, 349–350issue packets, 345loop unrolling and, 348processors, 343, 344static, 343, 345–349throughput and, 353

Multiple processors, 569–571Multiple-clock-cycle pipeline diagrams, 308

fi ve instructions, 309illustrated, 309

Multiplexors, A-598controls, 473in datapath, 275defi ned, 258forwarding, control values, 322selector control, 271two-input, A-598

I-14 Index

Multiplicand, 192Multiplication, 191–197. See also

Arithmeticfast, hardware, 196faster, 196–197fi rst algorithm, 194fl oating-point, 215–217hardware, 192–194instructions, 197in MIPS, 197multiplicand, 197multiplier, 197operands, 197product, 197sequential version, 192–194signed, 196

Multiplier, 192Multiply algorithm, 195Multiply-add (MAD), B-42Multiprocessors

benchmarks, 554–556bus-based coherent, OL6.15-7defi ned, 516historical perspective, 577large-scale, OL6.15-7–6.15-8,

OL6.15-9–6.15-10message-passing, 545–550multithreaded architecture, B-26–27,

B-35–36organization, 515, 545for performance, 573shared memory, 517, 533–537soft ware, 517TFLOPS, OL6.15-6UMA, 534

Multistage networks, 551Multithreaded multiprocessor

architecture, B-25–36conclusion, B-36ISA, B-31–34massive multithreading, B-25–26multiprocessor, B-26–27multiprocessor comparison,

B-35–36SIMT, B-27–30special function units (SFUs), B-35streaming processor (SP), B-34thread instructions, B-30–31threads/thread blocks management,

B-30Multithreading, B-25–26

coarse-grained, 530defi ned, 522

fi ne-grained, 530hardware, 530–533simultaneous (SMT), 531–533

Must-information, OL2.15-5Mutual exclusion, 125

N

Name dependence, 348NAND gates, A-596NAS (NASA Advanced Supercomputing),

556N-body

all-pairs algorithm, B-65GPU simulation, B-71mathematics, B-65–67multiple threads per body, B-68–69optimization, B-67performance comparison, B-69–70results, B-70–72shared memory use, B-67–68

Negation shortcut, 79Nested procedures, 104–105

compiling recursive procedure showing, 104–105

NetFPGA 10-Gigagit Ethernet card, OL6.9-2, OL6.9-3

Network of Workstations, OL6.15-8–6.15-9

Network topologies, 550–553implementing, 552multistage, 553

Networking, OL6.9-4operating system in, OL6.9-4–6.9-5performance improvement, OL6.9-

7–6.9-10Networks, 23–24

advantages, 23bandwidth, 549crossbar, 551fully connected, 551local area (LANs), 24multistage, 551wide area (WANs), 24

Newton’s iteration, 226Next state

nonsequential, C-24sequential, C-23

Next-state function, 474, A-655defi ned, 474implementing, with sequencer,

C-22–28Next-state outputs, C-10, C-12–13

example, C-12–13implementation, C-12logic equations, C-12–13truth tables, C-15

No Redundancy (RAID 0), OL5.11-4No write allocation, 408Nonblocking assignment, A-612Nonblocking caches, 355, 483Nonuniform memory access (NUMA), 534Nonvolatile memory, 22Nops, 326NOR gates, A-596

cross-coupled, A-638D latch implemented with, A-640

NOR operation, D-25NOT operation, 91, A-594Not-A-Number (NaN), 235–236Numbers

binary, 75computer versus real-world, 229decimal, 75, 78denormalized, 230hexadecimal, 84signed, 75–82unsigned, 75–82

NVIDIA GeForce 8800, B-46–55all-pairs N-body algorithm, B-71dense linear algebra computations,

B-51–53FFT performance, B-53instruction set, B-49performance, B-51rasterization, B-50ROP, B-50–51scalability, B-51sorting performance, B-54–55special function approximationstatistics, B-43special function unit (SFU), B-50streaming multiprocessor (SM), B-48–49streaming processor, B-49–50streaming processor array (SPA), B-46texture/processor cluster (TPC),

B-47–48NVIDIA GPU architecture, 539–541NVIDIA GTX 280, 565, 566NVIDIA Tesla GPU, 564–569

O

Object fi les, 132debugging information, 131header, 130

Index I-15

linking, 132–134relocation information, 130static data segment, 130symbol table, 130text segment, 130

Object-oriented languages. See also Javabrief history, OL2.22-8defi ned, 150, OL2.15-5

One’s complement, 82, A-617Opcodes

control line setting and, 276defi ned, 84, 274

OpenGL, B-13OpenMP (Open MultiProcessing), 536,

556Operands, 67–72. See also Instructions

32-bit immediate, 115–116adding, 189arithmetic instructions, 67compiling assignment when in

memory, 69constant, 73–74division, 197fl oating-point, 221LEGv8, 64memory, 68–69multiplication, 191

Operating systemsbrief history, OL5.17-9–5.17-12defi ned, 13encapsulation, 22in networking, OL6.9-4–6.9-5

Operationsatomic, implementing, 126hardware, 63–67logical, 90–93x86 integer, 157, 158

Optimizationclass explanation, OL2.15-14compiler, 146control implementation,

C-27–28global, OL2.15-5high-level, OL2.15-4–2.15-5local, OL2.15-5, OL2.15-8manual, 150

OR operation, 91, A-594Out-of-order execution

defi ned, 351performance complexity, 430processors, 355

Output devices, 16Overfl ow

defi ned, 76, 206detection, 190exceptions, 339fl oating-point, 207occurrence, 77saturation and, 191subtraction, 189

P

P+Q redundancy (RAID 6), OL5.11-7Packed fl oating-point format, 232Page faults, 448. See also Virtual memory

for data access, 463defi ned, 442handling, 443, 461–464virtual address causing, 457, 458

Page tables, 468defi ned, 446illustrated, 449indexing, 446inverted, 451levels, 451main memory, 451register, 446storage reduction techniques, 451updating, 446VMM, 463

Pages. See also Virtual memorydefi ned, 442dirty, 452fi nding, 446–447LRU, 448off set, 443physical number, 443placing, 432–434size, 444virtual number, 443

Parallel bus, OL6.9-3Parallel execution, 125Parallel memory system, B-36–41.

See also Graphics processing units (GPUs)

caches, B-38constant memory, B-40DRAM considerations, B-37–38global memory, B-39load/store access, B-41local memory, B-40memory spaces, B-39MMU, B-38–39ROP, B-41shared memory, B-39–40

surfaces, B-41texture memory, B-40

Parallel processing programs, 518–523creation diffi culty, 518–523defi ned, 516for message passing, 533great debates in, OL6.15-5for shared address space, 533–534use of, 573

Parallel reduction, B-62Parallel scan, B-60–63

CUDA template, B-61inclusive, B-60tree-based, B-62

Parallel soft ware, 517Parallelism, 12, 43, 342–355

and computers arithmetic, 230–232data-level, 246, 524debates, OL6.15-5–6.15-7GPUs and, 538, B-76instruction-level, 43, 342, 354memory hierarchies and, 477–481,

OL5.11-2multicore and, 533multiple issue, 342–349multithreading and, 531performance benefi ts, 44process-level, 516redundant arrays and inexpensive

disks, 481subword, D-17task, B-24task-level, 516thread, B-22

Paravirtualization, 495PA-RISC, D-14, D-17

branch vectored, D-35conditional branches, D-34, D-35debug instructions, D-36decimal operations, D-35extract and deposit, D-35instructions, D-34–36load and clear instructions, D-36multiply/add and multiply/subtract,

D-36nullifi cation, D-34nullifying branch option, D-25store bytes short, D-36synthesized multiply and divide,

D-34–35Parity, OL5.11-5

bits, 435code, 434, A-653

I-16 Index

PARSEC (Princeton Application Repository for Shared Memory Computers), 556

Pass transistor, A-651PCI-Express (PCIe), 553, B-8, OL6.9-2PC-relative addressing, 118, 120Peak fl oating-point performance, 558Pentium bug morality play, 244Performance, 28–40

assessing, 28classic CPU equation, 36–40components, 38CPU, 33–35defi ning, 29–32equation, using, 36improving, 34–35instruction, 35–36measuring, 33–35, OL1.12-10program, 39–40ratio, 31relative, 31–32response time, 30–31sorting, B-54–55throughput, 30–31time measurement, 32

Personal computers (PCs), 7defi ned, 5

Personal mobile device (PMD)defi ned, 7

Petabyte, 6Physical addresses, 442

mapping to, 442–443space, 533, 535

Physically addressed caches, 458Pipeline registers

before forwarding, 320dependences, 319forwarding unit selection, 323

Pipeline stalls, 291avoiding with code reordering, 291data hazards and, 324–328insertion, 326load-use, 329as solution to control hazards, 293

Pipelined branches, 331Pipelined control, 311–315. See also

Controlcontrol lines, 311, 312overview illustration, 327specifying, 312

Pipelined datapaths, 297–315with connected control signals, 315

with control signals, 311–315corrected, 307illustrated, 300in load instruction stages, 307

Pipelined dependencies, 317Pipelines

branch instruction impact, 329eff ectiveness, improving, OL4.16-

4–4.16-5execute and address calculation stage,

301, 303fi ve-stage, 285, 301, 309graphic representation, 290, 307–311instruction decode and register fi le

read stage, 300, 303instruction fetch stage, 301, 303instructions sequence, 325latency, 297memory access stage, 301, 303multiple-clock-cycle diagrams, 308performance bottlenecks, 353single-clock-cycle diagrams, 308stages, 285static two-issue, 345write-back stage, 301, 305

Pipelining, 12, 283–297advanced, 354–355benefi ts, 283control hazards, 292–293data hazards, 289exceptions and, 338–342execution time and, 297fallacies, 366–367hazards, 288instruction set design for, 288laundry analogy, 284overview, 283–297paradox, 285performance improvement, 288pitfall, 366–367simultaneous executing instructions,

297speed-up formula, 285structural hazards, 288, 305summary, 296throughput and, 297

Pitfalls. See also Fallaciesaddress space extension, 493arithmetic, 242–245associativity, 492defi ned, 49GPUs, B-74–75

ignoring memory system behavior, 491memory hierarchies, 491–495out-of-order processor evaluation, 492performance equation subset, 50–51pipelining, 366–367pointer to automatic variables, 171sequential word addresses, 171simulating cache, 491soft ware development with

multiprocessors, 570VMM implementation, 495

Pixel shader example, B-15–17Pixels, 18Pointers

arrays versus, 146–150frame, 106global, 106incrementing, 148Java, OL2.15-26stack, 101, 105

Polling, OL6.9-8Pop, 101Power

clock rate and, 40critical nature of, 53effi ciency, 354–355relative, 41

PowerPCalgebraic right shift , D-33branch registers, D-32–33condition codes, D-12instructions, D-12–13instructions unique to, D-31–33load multiple/store multiple, D-33logical shift ed immediate, D-33rotate with mask, D-33

Precise interrupts, 342Prediction, 12

2-bit scheme, 333accuracy, 333dynamic branch, 331–333loops and, 333–334steady-state, 333

Prefetching, 496, 560Primitive types, OL2.15-26Procedure calls

preservation across, 106Procedures, 100–110

compiling, 102compiling, showing nested procedure

linking, 102–104execution steps, 100

Index I-17

frames, 106leaf, 104nested, 104–106recursive, 108for setting arrays to zero, 147sort, 140–145strcpy, 112–113string copy, 112–113swap, 138

Process identifi ers, 460Process-level parallelism, 516Processors, 254–368

as cores, 43control, 19datapath, 19defi ned, 17, 19dynamic multiple-issue, 343multiple-issue, 343out-of-order execution, 355, 430performance growth, 44ROP, B-12, B-41speculation, 344–345static multiple-issue, 343, 345–349streaming, B-34superscalar, 349, 531–532, OL4.16-5technologies for building, 24–28two-issue, 346vector, 523–524VLIW, 345

Product, 192Product of sums, A-599Program counters (PCs), 263

changing with conditional branch, 334

defi ned, 101, 263exception, 459, 461incrementing, 263, 265instruction updates, 300

Program performanceelements aff ecting, 39understanding, 9

Programmable array logic (PAL), A-666Programmable logic arrays (PLAs)

component dots illustration, A-604control function implementation, C-7,

C-20–21defi ned, A-600example, A-601–602illustrated, A-601ROMs and, A-603–604size, C-20truth table implementation, A-601

Programmable logic devices (PLDs), A-666

Programmable ROMs (PROMs), A-602Programming languages. See also specifi c

languagesbrief history of, OL2.22-7–2.22-8object-oriented, 150variables, 67

Programsassembly language, 129Java, starting, 136–137parallel processing, 516–523starting, 128–137translating, 128–137

Propagatedefi ned, A-628example, A-632super, A-629

Protected keywords, OL2.15-21Protection

defi ned, 442implementing, 459–460mechanisms, OL5.17-9VMs for, 438

Protection group, OL5.11-5Pseudo MIPS

defi ned, 246instruction set, 248

Pseudoinstructionsdefi ned, 129summary, 130

Pthreads (POSIX threads), 556PTX instructions, B-31, B-32Public keywords, OL2.15-21Push

defi ned, 101using, 104

Q

Quad words, 158Quicksort, 425, 426Quotient, 198

R

Race, A-661Radix sort, 425, 426, B-63–65

CUDA code, B-64implementation, B-63–65

RAID, See Redundant arrays of inexpensive disks (RAID)

RAM, 9Raster operation (ROP) processors, B-12,

B-41, B-50–51fi xed function, B-41

Raster refresh buff er, 18Rasterization, B-50Ray casting (RC), 568Read-only memories (ROMs),

A-602–604control entries, C-16–17control function encoding, C-18–19dispatch, C-25implementation, C-15–19logic function encoding, A-603overhead, C-18PLAs and, A-603–604programmable (PROM), A-602total size, C-16

Read-stall cycles, 413Read-write head, 395Receive message routine, 545Recursive procedures, 108. See also

Proceduresclone invocation, 104

Reduced instruction set computer (RISC) architectures, D-2–45, OL2.22-5, OL4.16-4. See also Desktop and server RISCs; Embedded RISCs

group types, D-3–4instruction set lineage, D-44

Reduction, 535Redundant arrays of inexpensive disks

(RAID), OL5.11-2–5.11-8history, OL5.11-8RAID 0, OL5.11-4RAID 1, OL5.11-5RAID 2, OL5.11-5RAID 3, OL5.11-5RAID 4, OL5.11-5–5.11-6RAID 5, OL5.11-6–5.11-7RAID 6, OL5.11-7spread of, OL5.11-6summary, OL5.11-7–5.11-8use statistics, OL5.11-7

Reference bit, 450References

absolute, 131types, OL2.15-26

Register 31, 74, 102, 175Register addressing, 120Register allocation, OL2.15-11–2.15-13

I-18 Index

Register fi les, A-638,A-642–644defi ned, 264, A-638, A-642in behavioral Verilog, A-645single, 269two read ports implementation, A-643with two read ports/one write port,

A-643write port implementation, A-644

Register-memory architecture, OL2.22-3Registers, 156, 157–158

architectural, 336–342base, 69clock cycle time and, 67compiling C assignment with, 68defi ned, 67destination, 85, 274fl oating-point, 226left half, 301LEGv8 conventions, 108mapping, 82number specifi cation, 264page table, 446pipeline, 319, 321, 323primitives, 67renaming, 348right half, 301spilling, 72Status, 337temporary, 68, 102variables, 67

Relative performance, 31–32Relative power, 41Reliability, 432Remainder

defi ned, 198Reorder buff ers, 355Replication, 479Requested word fi rst, 406Request-level parallelism, 548Reservation stations

buff ering operands in, 350defi ned, 350

Response time, 30–31Restartable instructions, 462Return address, 100Return from exception (ERET), 459R-format, 274

ALU operations, 265defi ned, 86

Ripple carryadder, A-617carry lookahead speed versus,

A-634–635

Roofl ine model, 558–559, 560, 561with ceilings, 561computational roofl ine, 559illustrated, 557Opteron generations, 558with overlapping areas shaded, 563peak fl oating-point performance, 562peak memory performance, 562with two kernels, 563

Rotational delay.See Rotational latencyRotational latency, 397Rounding, 226

accurate, 226bits, 228with guard digits, 227IEEE 754 modes, 227

Row-major order, 225, 427R-type instructions, 264

datapath for, 276datapath in operation for, 278

S

Saturation, 191SCALAPAK, 244Scaling

strong, 521weak, 521

Scientifi c notationadding numbers in, 213defi ned, 205for reals, 205

Search engines, 4Secondary memory, 23Sectors, 395Secure Hash Algorithm ( SHA )

encryption, 488Seek, 396Segmentation, 445Selector values, A-598Semiconductors, 25Send message routine, 545Sensitivity list, A-612Sequencers

explicit, C-32implementing next-state function with,

C-22–28Sequential logic, A-593Servers, OL5. See also Desktop and server

RISCscost and capability, 5

Service accomplishment, 432Service interruption, 432

Set instructions, 97Set-associative caches, 417. See also

Cachesaddress portions, 421block replacement strategies, 468choice of, 467four-way, 418, 421memory-block location, 417misses, 419–420n-way, 417two-way, 418

Setup time, A-641, A-642Shaders

defi ned, B-14fl oating-point arithmetic, B-14graphics, B-14–15pixel example, B-15–17

Shading languages, B-14Shadowing, OL5.11-5Shared memory. See also Memory

as low-latency memory, B-21caching in, B-58–60CUDA, B-58N-body and, B-67–68per-CTA, B-39SRAM banks, B-40

Shared memory multiprocessors (SMP), 531–535

defi ned, 517, 531single physical address space, 531synchronization, 534

Shift amount, 84Shift instructions, 90Sign and magnitude, 206Sign bit, 78Sign extension, 266

defi ned, 78shortcut, 80

Signalsasserted, 262, A-592control, 262, 274–275deasserted, 262, A-592

Signed division, 201–202Signed multiplication, 196Signed numbers, 75–82

sign and magnitude, 77treating as unsigned, 98

Signifi cands, 207addition, 212multiplication, 215

Silicon, 25as key hardware technology, 53crystal ingot, 26

Index I-19

defi ned, 26wafers, 26

Silicon crystal ingot, 26SIMD (Single Instruction Multiple Data),

522, 574computers, OL6.15-2–6.15-4data vector, B-35extensions, OL6.15-4for loops and, OL6.15-3massively parallel multiprocessors,

OL6.15-2small-scale, OL6.15-4vector architecture, 524–525in x86, 524

SIMMs (single inline memory modules), OL5.17-5, OL5.17-6

Simple programmable logic devices (SPLDs), A-666

Simplicity, 171Simultaneous multithreading (SMT),

531–533support, 531thread-level parallelism, 531unused issue slots, 531

Single error correcting/Double error correcting (SEC/DEC), 434–436

Single instruction single data (SISD), 523Single precision. See also Double

precisionbinary representation, 210defi ned, 207

Single-clock-cycle pipeline diagrams, 308illustrated, 310

Single-cycle datapaths. See also Datapathsillustrated, 298instruction execution, 299

Single-cycle implementationcontrol function for, 281defi ned, 281nonpipelined execution versus

pipelined execution, 287non-use of, 284penalty, 283pipelined performance versus, 285

Single-instruction multiple-thread (SIMT), B-27–30

overhead, B-35multithreaded warp scheduling, B-28processor architecture, B-28warp execution and divergence,

B-29–30Single-program multiple data (SPMD),

B-22

Smalltalk-80, OL2.22-8Smart phones, 7Snooping protocol, 479–481Snoopy cache coherence, OL5.12-7Soft ware optimization

via blocking, 427–432Sort algorithms, 146Soft ware

layers, 13multiprocessor, 516parallel, 517as service, 7, 547, 574systems, 13

Sort procedure, 140–144. See also Procedures

code for body, 140–142full procedure, 143–144passing parameters in, 143preserving registers in, 143procedure call, 143register allocation for, 140

Sorting performance, B-54–55Space allocation

on heap, 107–110on stack, 106

SPARCannulling branch, D-23CASA, D-31conditional branches, D-10–12fast traps, D-30fl oating-point operations, D-31instructions, D-29–32least signifi cant bits, D-31multiple precision fl oating-point

results, D-32nonfaulting loads, D-32overlapping integer operations, D-31quadruple precision fl oating-point

arithmetic, D-32register windows, D-29–30support for LISP and Smalltalk, D-30

Sparse matrices, B-55–58Sparse Matrix-Vector multiply (SpMV),

B-55, B-57, B-58CUDA version, B-57serial code, B-57shared memory version, B-59

Spatial locality, 388large block exploitation of, 405tendency, 392

SPEC, OL1.12-11–1.12-12CPU benchmark, 46–48power benchmark, 48–49

SPEC2000, OL1.12-12SPEC2006, 246, OL1.12-12SPEC89, OL1.12-11SPEC92, OL1.12-12SPEC95, OL1.12-12SPECrate, 554SPECratio, 47

Special function units (SFUs), B-35, B-50

defi ned, B-43Speculation, 344–345

hardware-based, 352implementation, 344performance and, 344problems, 344recovery mechanism, 344

Speed-up challenge, 518balancing load, 518–519bigger problem, 520–521

Spilling registers, 72, 101Split algorithm, 568Split caches, 411Stack architectures, OL2.22-4Stack pointers

adjustment, 104defi ned, 101values, 104

Stacksallocating space on, 106for arguments, 145defi ned, 101pop, 101push, 101, 104

Stalls, 291as solution to control hazard, 292avoiding with code reordering, 291behavioral Verilog with detection,

OL4.13-6data hazards and, 324–328illustrations, OL4.13-23, OL4.13-30insertion into pipeline, 326load-use, 329memory, 414write-back scheme, 413write buff er, 413

Standby spares, OL5.11-8State

in 2-bit prediction scheme, 333assignment, A-658, C-27bits, C-8exception, saving/restoring, 462logic components, 261specifi cation of, 446

I-20 Index

State elementsclock and, 262combinational logic and, 262defi ned, 260, A-636inputs, 261in storing/accessing instructions, 264register fi le, A-638

Static branch prediction, 345Static data

segment, 107Static multiple-issue processors, 343,

345–349. See also Multiple issuecontrol hazards and, 345instruction sets, 345with LEGv8 ISA, 345–348

Static random access memories (SRAMs), 392, 393, A-646–650

array organization, A-650basic structure, A-649defi ned, 21, A-646fi xed access time, A-646large, A-647read/write initiation, A-647synchronous (SSRAMs), A-648three-state buff ers, A-647, A-648

Static variables, 106Steady-state prediction, 333Sticky bits, 228Store buff ers, 355Store instructions. See also Load

instructionsaccess, B-41base register, 274compiling with, 71conditional, 126defi ned, 71EX stage, 305ID stage, 302IF stage, 302instruction dependency, 323MEM stage, 304unit for implementing, 267WB stage, 304

Store register, 72Stored program concept, 63

as computer principle, 88illustrated, 89principles, 171

Strcpy procedure, 112–113. See also Procedures

as leaf procedure, 113pointers, 113

Stream benchmark, 564Streaming multiprocessor (SM),

B-48–49Streaming processors, B-34, B-49–50

array (SPA), B-41, B-46Streaming SIMD Extension 2 (SSE2)

fl oating-point architecture, 232Streaming SIMD Extensions (SSE) and

advanced vector extensions in x86, 232–233

Stretch computer, OL4.16-2Strings

defi ned, 111in Java, 113–115representation, 111

Strip mining, 526Striping, OL5.11-4Strong scaling, 521Structural hazards, 288–289, 305STUR (store register), 64STURB (store byte), 64STURH (store half), 64STURW (store word), 64STXR (store exclusive register), 64, 126SUB (subtract), 64SUBI (subtract immediate), 64SUBIS (subtract immediate and set fl ags),

64SUBS (subtract and set fl ags), 64Subnormals, 230Subtraction, 188–191. See also Arithmetic

binary, 188–189fl oating-point, 220negative number, 190overfl ow, 190

Subword parallelism, 230–232, 365, D-17

and matrix multiply, 238–242Sum of products, A-599, A-600Supercomputers, OL4.16-3

defi ned, 5SuperH, D-15, D-39–40Superscalars

defi ned, 349, OL4.16-5dynamic pipeline scheduling, 349multithreading options, 516

Surfaces, B-41Swap procedure, 138. See also Procedures

body code, 138full, 139–140, 143–145register allocation, 138

Swap space, 448

Symbol tables, 130Synchronization, 125–127, 568

barrier, B-18, B-20, B-34defi ned, 534lock, 125overhead, reducing, 44–45unlock, 125

Synchronizersdefi ned, A-664failure, A-665from D fl ip-fl op, A-664

Synchronous DRAM (SRAM), 393–394, A-648, A-653

Synchronous SRAM (SSRAM), A-648Synchronous system, A-636Syntax tree, OL2.15-3System calls

defi ned, 459Systems soft ware, 13SystemVerilog

cache controller, OL5.12-2cache data and tag modules, OL5.12-6FSM, OL5.12-7simple cache block diagram, OL5.12-4type declarations, OL5.12-2

T

Tablets, 7Tags

defi ned, 398in locating block, 421page tables and, 448size of, 423

Tail call, 109Task identifi ers, 460Task parallelism, B-24Task-level parallelism, 516Tebibyte (TiB), 5Telsa PTX ISA, B-31–34

arithmetic instructions, B-33barrier synchronization, B-34GPU thread instructions, B-32memory access instructions, B-33–34

Temporal locality, 388tendency, 392

Temporary registers, 68, 102Terabyte (TB) , 6

defi ned, 5Texture memory, B-40Texture/processor cluster (TPC), B-47–48TFLOPS multiprocessor, OL6.15-6

Index I-21

Th rashing, 464Th read blocks, 542

creation, B-23defi ned, B-19managing, B-30memory sharing, B-20synchronization, B-20

Th read parallelism, B-22Th reads

creation, B-23CUDA, B-36ISA, B-31–34managing, B-30memory latencies and, B-74–75multiple, per body, B-68–69warps, B-27

Th ree Cs model, 459–461Th ree-state buff ers, A-647, A-648Th roughput

defi ned, 30–31multiple issue and, 342pipelining and, 286, 342

Th umb, D-15, D-38Timing

asynchronous inputs, A-664–665level-sensitive, A-663–664methodologies, A-660–665two-phase, A-663

TLB misses, 453. See also Translation-lookaside buff er (TLB)

handling, 461–465occurrence, 461problem, 464

Tomasulo’s algorithm, OL4.16-3Touchscreen, 19Tournament branch predicators, 334Tracks, 395–396Transfer time, 397Transistors, 25Translation Table Base Register (TTBR),

449Translation-lookaside buff er (TLB),

452–454, D-26–27, OL5.17-6. See also TLB misses

associativities, 454illustrated, 453integration, 457Intrinsity FastMATH, 454–457typical values, 454

Transmit driver and NIC hardware time versus receive driver and NIC hardware time, OL6.9-8

Tree-based parallel scan, B-62Truth tables, A-593

ALU control lines, C-5for control bits, 272datapath control outputs, C-17datapath control signals, C-14defi ned, 272example, A-593next-state output bits, C-15PLA implementation, A-601

Two’s complement representation, 77–78advantage, 78negation shortcut, 79–80rule, 81sign extension shortcut, 80–81

Two-level logic, A-599–602Two-phase clocking, A-663TX-2 computer, OL6.15-4

U

Unconditional branches, 94Underfl ow, 206Unicode

alphabets, 113defi ned, 113example alphabets, 114

Unifi ed GPU architecture, B-10–12illustrated, B-11processor array, B-11–12

Uniform memory access (UMA), 534, B-9

multiprocessors, 534Units

commit, 350, 355control, 259, 271–273, C-4–8, C-10,

C-12–13defi ned, 227fl oating point, 227hazard detection, 324, 327–328for load/store implementation, 267special function (SFUs), B-35, B-43,

B-50UNIVAC I, OL1.12-5UNIX, OL2.22-8, OL5.17-9–5.17-12

AT&T, OL5.17-10Berkeley version (BSD), OL5.17-10genius, OL5.17-12history, OL5.17-9–5.17-12

Unlock synchronization, 125Unscaled signed immediate off set, 166Unsigned numbers, 75–82

Use latencydefi ned, 346one-instruction, 346

V

Vacuum tubes, 25Valid bit, 400Variables

C language, 106programming language, 67register, 67static, 106storage class, 106type, 106

VAX architecture, OL2.22-4, OL5.17-7Vector lanes, 526Vector processors, 523–524. See also

Processorsconventional code comparison,

525–526instructions, 524multimedia extensions and, 524–525scalar versus, 526–527

Vectored interrupts, 337Verilog

behavioral defi nition of MIPS ALU, A-613

behavioral defi nition with bypassing, OL4.13-4–4.13-6

behavioral defi nition with stalls for loads, OL4.13-6

behavioral specifi cation, A-609, OL4.13-2–4.13-4

behavioral specifi cation of multicycle MIPS design, OL4.13-12–4.13-13

behavioral specifi cation with simulation, OL4.13-2

behavioral specifi cation with stall detection, OL4.13-6

behavioral specifi cation with synthesis, OL4.13-11–4.13-16

blocking assignment, A-612branch hazard logic implementation,

OL4.13-8–4.13-10combinational logic, A-611–614datatypes, A-609–610defi ned, A-608forwarding implementation, OL4.13-4MIPS ALU defi nition in, A-623–626modules, A-611multicycle MIPS datapath, OL4.13-14

I-22 Index

nonblocking assignment, A-612operators, A-610program structure, A-611reg, A-609–610sensitivity list, A-612sequential logic

specifi cation,A-644–646structural specifi cation, A-609wire, A-609–610

Vertical microcode, C-32Very large-scale integrated (VLSI)

circuits, 25Very Long Instruction Word (VLIW)

defi ned, 345fi rst generation computers, OL4.16-5processors, 345

VHDL, A-608–609Video graphics array (VGA) controllers,

B-3–4Virtual addresses

causing page faults, 462defi ned, 442mapping from, 442–443size, 444

Virtual machine monitors (VMMs)defi ned, 438implementing, 494laissez-faire attitude, 494page tables, 463in performance improvement, 441requirements, 440

Virtual machines (VMs), 438–441benefi ts, 438illusion, 463instruction set architecture support,

441performance improvement, 441for protection improvement, 438

Virtual memory, 441–465. See also Pagesaddress translation, 443, 452–454integration, 457–459for large virtual addresses, 450–451mechanism, 464motivations, 427–442page faults, 442, 448protection implementation, 459–460segmentation, 445summary, 463virtualization of, 463writes, 452

Virtualizable hardware, 440Virtually addressed caches, 458Visual computing, B-3Volatile memory, 22

W

Wafers, 26defects, 26dies, 26–27yield, 27

Warehouse Scale Computers (WSCs), 7, 545–550, 574

Warps, 544, B-27Weak scaling, 521Wear levelling, 395While loops, 95Whirlwind, OL5.17-2Wide area networks (WANs), 24. See also

NetworksWide immediate operands, 115–117Words

accessing, 69defi ned, 67double, 158load, 69, 71quad, 158store, 71

Working set, 464World Wide Web, 4Worst-case delay, 283Write buff ers

defi ned, 408stalls, 413write-back cache, 409

Write invalidate protocols, 479Write serialization, 479Write-back caches. See also Caches

advantages, 469cache coherency protocol, OL5.12-5complexity, 409defi ned, 408, 469stalls, 413write buff ers, 409

Write-back stagecontrol line, 313load instruction, 303store instruction, 305

Writescomplications, 408expense, 464

handling, 407–409memory hierarchy handling of,

469–470schemes, 408virtual memory, 451write-back cache, 408, 409write-through cache, 408, 409

Write-stall cycles, 414Write-through caches. See also Caches

advantages, 469defi ned, 407, 469tag mismatch, 408

X

x86, 154–162Advanced Vector Extensions in, 232brief history, OL2.22-6conclusion, 162data addressing modes, 157–158evolution, 154–157fi rst address specifi er encoding, 162instruction encoding, 161–162instruction formats, 161instruction set growth, 170instruction types, 160integer operations, 158–160registers, 157SIMD in, 522Streaming SIMD Extensions in,

232–233typical instructions/functions, 161typical operations, 161

Xerox Alto computer, OL1.12-8XMM, 232

Y

Yahoo! Cloud Serving Benchmark (YCSB), 556

Yield, 27YMM, 232

Z

Zettabyte, 6

Verilog (Continued)

Documents

Index [booksite.elsevier.com] · Index Note: Online information is listed by chapter and section number followed by page numbers (OL3.11-7).Page references preceded by a single letter