Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Index
Note: Online information is listed by chapter and section number followed by page numbers (OL3.11-7). Page references preceded by a single letter with hyphen refer to appendices.
1-bit ALU, A-614–617. See also Arithmetic logic unit (ALU)
adder, A-615CarryOut, A-616for most signifi cant bit, A-621illustrated, A-617logical unit for AND/OR, A-615performing AND, OR, and addition,
A-619, A-62164-bit ALU, A-617–626. See also
Arithmetic logic unit (ALU)defi ning in Verilog, A-623–626from 31 copies of 1-bit ALU, A-622illustrated, A-624ripple carry adder, A-617tailoring to MIPS, A-619–623with 32 1-bit ALUs, A-618
7090/7094 hardware, OL3.12-7
A
AArch32, 73AArch64, 73Absolute references, 131Abstractions
hardware/soft ware interface, 22principle, 22to simplify design, 11
Accumulator architectures, OL2.22-2Acronyms, 9Active matrix, 18ADD (add), 64ADDI (add immediate), 64ADDIS (add immediate and set fl ags), 64Addition, 188–191. See also Arithmetic
binary, 188–189fl oating-point, 212–215, 220operands, 189signifi cands, 211speed, 191
Address interleaving, 395Address select logic, C-24, C-25Address space, 442, 445
extending, 493fl at, 493ID (ASID), 460inadequate, OL5.17-6shared, 533–534single physical, 533–534virtual, 460
Address translationfor ARM cortex-A8, 483defi ned, 443fast, 452–454for Intel core i7, 483TLB for, 452–454
Address-control lines, C-26Addresses
base, 69byte, 70defi ned, 69memory, 79virtual, 442, 462, 463
Addressingbase, 120in branches, 117–120displacement, 120immediate, 120LEGv8 modes, 120–121PC-relative, 118, 120register, 120x86 modes, 158
Addressing modesdesktop architectures, D-6
ADDS (add and set fl ags), 64, 164addu (Add Unsigned), 64Advanced Encryption Standard (AES)
encryption, 488Advanced Vector Extensions (AVX), 232,
240AGP, B-9Algol-60, OL2.22-7Aliasing, 458, 459Alignment restriction, 71All-pairs N-body algorithm, B-65Alpha architecture
bit count instructions, D-29fl oating-point instructions, D-28instructions, D-27–29no divide, D-28PAL code, D-28unaligned load-store, D-28VAX fl oating-point formats, D-29
ALU control, 271–273. See also Arithmetic logic unit (ALU)
bits, 272logic, C-6mapping to gates, C-4–7truth tables, C-5
ALU control block, 275defi ned, C-4generating ALU control bits, C-6
ALUOp, 272, C-6bits, 272, 273control signal, 275
Amazon Web Services (AWS), 439AMD Opteron X4 (Barcelona), 559, 560AMD64, 155, 156, 232, OL2.22-6Amdahl’s law, 415, 519
corollary, 49defi ned, 49fallacy, 572
and (AND), 64AND gates, A-600, C-7AND operation, 91AND operation, A-594andi (And Immediate), 65Annual failure rate (AFR), 432
versus MTTF of disks, 433–434Antidependence, 348Antifuse, A-666Apple computer, OL1.12-7Apple iPad 2 A1395, 20
logic board of, 20processor integrated circuit of, 21
Application binary interface (ABI), 22Application programming interfaces (APIs)
defi ned, B-4graphics, B-14
I-1
I-2 Index
Architectural registers, 358Arithmetic, 186–248
addition, 188–191addition and subtraction, 188–191division, 197–204fallacies and pitfalls, 242–245fl oating-point, 205–230historical perspective, 248multiplication, 191–197parallelism and, 230–232Streaming SIMD Extensions and
advanced vector extensions in x86, 232–233
subtraction, 188–191subword parallelism, 230–232subword parallelism and matrix
multiply, 238–242Arithmetic instructions. See also
Instructionsdesktop RISC, D-11embedded RISC, D-14logical, 263operands, 67–74
Arithmetic intensity, 557Arithmetic logic unit (ALU). See also
ALU control; Control units1-bit, A-614–61764-bit, A-617–626before forwarding, 321branch datapath, 266hardware, 190memory-reference instruction use, 257for register values, 264R-format operations, 265signed-immediate input, 323
ARM Cortex-A53, 256, 355–358address translation for, 483caches in, 484data cache miss rates for, 485memory hierarchies of, 482performance of, 485–488specifi cation, 356TLB hardware for, 483
ARM instructions, 152–15412-bit immediate fi eld, 153brief history, OL2.22-5condition fi eld, 334unique, D-36–37
ARMv7, 62ARMv8, 62, 163–169
common features between MIPS and, 152
ARPANET, OL1.12-10
Arrays, 429logic elements, A-606–607multiple dimension, 226pointers versus, 146–150procedures for setting to zero, 147
ASCIIbinary numbers versus, 111character representation, 110defi ned, 110symbols, 113
Assemblers, 129–131defi ned, 14function, 129microcode, C-30number acceptance, 130object fi le, 130
Assembly language, 15defi ned, 14, 129fl oating-point, 221illustrated, 15LEGv8, 64, 86programs, 129translating into machine language,
86Asserted signals, 262, A-592Associativity
in caches, 419degree, increasing, 418, 466increasing, 423set, tag size versus, 423
Atomic compare and swap, 127Atomic exchange, 126Atomic fetch-and-increment, 127Atomic memory operation, B-21Attribute interpolation, B-43–44Automobiles, computer application in, 4Average memory access time (AMAT),
416calculating, 416
B
Bandwidth, 30bisection, 551external to DRAM, 412memory, 394–395, 412network, 549
Barrier synchronization, B-18defi ned, B-20for thread communication, B-34
Base addressing, 69, 120Base registers, 70Basic block, 96
Benchmarks, 554–556defi ned, 46Linpack, 554, OL3.12-4multicores, 538–545multiprocessor, 554–556NAS parallel, 556parallel, 555PARSEC suite, 556SPEC CPU, 46–48SPEC power, 48–49SPECrate, 554Stream, 564
Biased notation, 82, 209Big-endian byte order, 70Binary numbers, 83
ASCII versus, 111conversion to decimal numbers, 78defi ned, 75
Bisection bandwidth, 551Bit maps
defi ned, 18, 73goal, 18storing, 18
Bit-Interleaved Parity (RAID 3), OL5.11-5
BitsALUOp, 272, 273defi ned, 14dirty, 452guard, 228patterns, 228–229reference, 450rounding, 228sign, 77state, C-8sticky, 228valid, 397
Blocking assignment, A-612Blocking factor, 428Block-Interleaved Parity (RAID 4),
OL5.11-5–5.11-6Blocks
combinational, A-592defi ned, 390fi nding, 467fl exible placement, 416–418least recently used (LRU), 423locating in cache, 421–422miss rate and, 405multiword, mapping addresses to, 404placement locations, 466placement strategies, 418replacement selection, 423
Index I-3
replacement strategies, 468spatial locality exploitation, 405state, A-592valid data, 400
Bonding, 28Boolean algebra, A-594Bounds check shortcut, 98Branch address, 168Branch datapath
ALU, 266operations, 266
Branch delay slotsBranch instructions
pipeline impact, 329Branch not taken
assumption, 328–329defi ned, 266
Branch predictionas control hazard solution, 295buff ers, 331, 333defi ned, 294dynamic, 295, 331–334static, 345
Branch predictorsaccuracy, 333correlation, 333information from, 333tournament, 334
Branch register, 168Branch table, 169Branch taken
cost reduction, 330defi ned, 266
Branch targetaddresses, 266buff ers, 333
Branches. See also Conditional branchesaddressing in, 117–120compiler creation, 94decision, moving up, 330delayed, 295, 330–331, 295ending, 96execution in ID stage, 330pipelined, 330target address, 330unconditional, 318
Branch-on-zero instruction, 280B-type instruction format, 113Bubble Sort, 145Bubbles, 326Bus-based coherent multiprocessors,
OL6.15-7Buses, A-607
Bytesaddressing, 70order, 70
C
C.mmp, OL6.15-4C language
assignment, compiling into LEGv8, 66compiling, 150, OL2.15-2–2.15-3compiling assignment with registers,
68compiling while loops in, 95–96sort algorithms, 146translation hierarchy, 128translation to LEGv8 assembly
language, 66variables, 106
C++ language, OL2.15-27, OL2.22-8Cache blocking and matrix multiply,
489–490Cache coherence, 477–481
coherence, 477consistency, 477enforcement schemes, 479implementation techniques, OL5.12-
5–5.12-12migration, 479problem, 477, 478, 481protocol example, OL5.12-12–5.12-16protocols, 479replication, 479snooping protocol, 479–481snoopy, OL5.12-16–5.12-17state diagram, OL5.12-16
Cache coherency protocol, OL5.12-12–5.12-16
fi nite-state transition diagram, OL5.12-15
functioning, OL5.12-14mechanism, OL5.12-14state diagram, OL5.12-16states, OL5.12-13write-back cache, OL5.12-15
Cache controllers, 482coherent cache implementation
techniques, OL5.12-5–5.12-12implementing, OL5.12-2snoopy cache coherence, OL5.12-
16–5.12-17SystemVerilog, OL5.12-2
Cache hits, 458Cache misses
block replacement on, 468capacity, 470, 471compulsory, 470confl ict, 470defi ned, 406direct-mapped cache, 418fully associative cache, 420handling, 406–407memory-stall clock cycles, 413reducing with fl exible block placement,
416–418set-associative cache, 419steps, 407in write-through cache, 407
Cache performance, 412–431calculating, 414hit time and, 415–416impact on processor performance, 414
Cache-aware instructions, 496Caches, 397–412. See also Blocks
accessing, 400–403in ARM cortex-A53, 484associativity in, 419–420bits in, 404bits needed for, 404contents illustration, 401defi ned, 21, 397–398direct-mapped, 398, 399, 404, 416empty, 400–401FSM for controlling, 472fully associative, 417GPU, B-38inconsistent, 407index, 402in Intel Core i7, 484Intrinsity FastMATH example,
409–412locating blocks in, 421–422locations, 399multilevel, 412, 424nonblocking, 483physically addressed, 458, 459physically indexed, 458physically tagged, 458primary, 424, 431secondary, 424, 431set-associative, 417simulating, 491size, 403split, 411summary, 411–412tag fi eld, 402tags, OL5.12-3, OL5.12-11
I-4 Index
virtual memory and TLB integration, 457–459
virtually addressed, 458virtually indexed, 458virtually tagged, 458write-back, 408, 409, 469write-through, 407, 409, 469writes, 407–409
Callee, 101, 103Caller, 101Capabilities, OL5.17-8Capacity misses, 470Carry lookahead, A-626–635
4-bit ALUs using, A-633adder, A-627fast, with fi rst level of abstraction,
A-627–628fast, with “infi nite” hardware,
A-626–627fast, with second level of abstraction,
A-628–634plumbing analogy, A-630, A-631ripple carry speed versus, A-634summary, A-634–635
Carry save adders, 197Cause registerCDC 6600, OL1.12-7, OL4.16-3Cell phones, 7Central processor unit (CPU). See also
Processorsclassic performance equation,
36–40defi ned, 19execution time, 32, 33–34performance, 33–35system, time, 32time, 413time measurements, 33–34user, time, 32
Cg pixel shader program, B-15–17Characters
ASCII representation, 110in Java, 113
Chips, 19, 25, 26manufacturing process, 26
Classesdefi ned, OL2.15-15packages, OL2.15-21
Clear exclusive instruction (CLREX), 488Clock cycles
defi ned, 33
memory-stall, 413number of registers and, 67worst-case delay and, 283
Clock cycles per instruction (CPI), 35, 293
one level of caching, 424two levels of caching, 424
Clock ratedefi ned, 33frequency switched as function of, 41power and, 40
Clocking methodology, 261–263, A-636edge-triggered, 261, A-636, A-661level-sensitive, A-662, A-663–664for predictability, 261
Clocks, A-636–638edge, A-636, A-638in edge-triggered design, A-661skew, A-662specifi cation, A-645synchronous system, A-636–637
Cloud computing, 549defi ned, 7
Cluster networking, 553–554, OL6.9-12Clusters, OL6.15-8–6.15-9
defi ned, 516, 546, OL6.15-8isolation, 547organization, 515scientifi c computing on, OL6.15-8
Cm*, OL6.15-4CMOS (complementary metal oxide
semiconductor), 41Coarse-grained multithreading, 530Cobol, OL2.22-7Code generation, OL2.15-13Code motion, OL2.15-7Cold-start miss, 470Collision misses, 470Column major order, 427Combinational blocks, A-592Combinational control units, C-4–8Combinational elements, 260Combinational logic, 261, A-591,
A-597–608arrays, A-606–607decoders, A-597defi ned, A-593don’t cares, A-605–606multiplexors, A-598ROMs, A-602–604two-level, A-599–602Verilog, A-611–14
Commercial computer development, OL1.12-4–1.12-10
Commit unitsbuff er, 350defi ned, 350in update control, 355
Common case fast, 11Common subexpression elimination,
OL2.15-6Communication, 23–24
overhead, reducing, 44–45thread, B-34
Compact code, OL2.22-4Compare and branch zero, 330Comparisons
constant operands in, 73signed versus unsigned, 97
Compilers, 129branch creation, 95brief history, OL2.22-8–2.22-9conservative, OL2.15-7defi ned, 14front end, OL2.15-3function, 14, 129high-level optimizations,
OL2.15-4ILP exploitation, OL4.16-5Just In Time (JIT), 137optimization, 146, OL2.22-9speculation, 344–345structure, OL2.15-2
CompilingC assignment statements, 66C language, 95, 150, OL2.15-2–2.15-3fl oating-point programs, 222–225if-then-else, 94in Java, OL2.15-19procedures, 102, 104–105recursive procedures, 104–105while loops, 95–96
Compressed sparse row (CSR) matrix, B-55, B-56
Compulsory misses, 470, 471Computer architects, 11–12
abstraction to simplify design, 11common case fast, 11dependability via redundancy, 12hierarchy of memories, 12Moore’s law, 11parallelism, 12pipelining, 12prediction, 12
Caches (Continued)
Index I-5
Computersapplication classes, traditional, 5–6applications, 4arithmetic for, 186–248characteristics, OL1.12-12commercial development, OL1.12-
4–1.12-10component organization, 17components, 17, 177design measure, 53desktop, 5embedded, 5fi rst, OL1.12-2–1.12-4in information revolution, 4instruction representation, 82–89performance measurement,
OL1.12-10post-PC era, 6–7principles, 86servers, 5
Condition codes/fl ags, 97Conditional branches
changing program counter with, 333compiling if-then-else into, 94defi ned, 93desktop RISC, D-16embedded RISC, D-16implementation, 99in loops, 119PA-RISC, D-34, D-35PC-relative addressing, 118RISC, D-10–16SPARC, D-10–12
Conditional move instructions, 334Confl ict misses, 470Constant memory, B-40Constant operands, 73–74
frequent occurrence, 73Content Addressable Memory (CAM), 422Context switch, 460Control
ALU, 271–273challenge, 336fi nishing, 281forwarding, 320FSM, C-8–21implementation, optimizing, C-27–28mapping to hardware, C-2–32memory, C-26organizing, to reduce logic, C-31–32pipelined, 311–315
Control fl ow graphs, OL2.15-9–2.15-10illustrated examples, OL2.15-9, OL2.15-10
Control functionsALU, mapping to gates, C-4–7defi ning, 276PLA, implementation, C-7,
C-20–21ROM, encoding, C-18–19for single-cycle implementation, 281
Control hazards, 292–295, 328–329branch delay reduction, 330branch not taken assumption, 328branch prediction as solution, 295delayed decision approach, 295dynamic branch prediction, 331logic implementation in Verilog,
OL4.13-8pipeline stalls as solution, 293pipeline summary, 335–336simplicity, 328solutions, 293static multiple-issue processors and,
345–346Control lines
asserted, 276in datapath, 275execution/address calculation, 312fi nal three stages, 314instruction decode/register fi le read,
312instruction fetch, 312memory access, 312setting of, 276values, 312write-back, 312
Control signalsALUOp, 275defi ned, 262eff ect of, 276multi-bit, 276pipelined datapaths with, 311–315truth tables, C-14
Control units, 259. See also Arithmetic logic unit (ALU)
address select logic, C-24, C-25combinational, implementing, C-4–8with explicit counter, C-23illustrated, 277logic equations, C-11main, designing, 273–276as microcode, C-28MIPS, C-10next-state outputs, C-10, C-12–13output, 271–273, C-10
Cooperative thread arrays (CTAs), B-30
Coprocessorsdefi ned, 226
Core LEGv8 instruction set, 248. See also MIPS
abstract view, 258desktop RISC, D-9–11implementation, 256–260implementation illustration, 259overview, 257–260subset, 256
Coresdefi ned, 43number per chip, 43
Correlation predictor, 333Cosmic Cube, OL6.15-7CPU, 9Cray computers, OL3.12-5–3.12-6Critical word fi rst, 406Crossbar networks, 551CTSS (Compatible Time-Sharing
System), OL5.18-9CUDA programming environment, 539,
B-5barrier synchronization, B-18, B-34development, B-17, B-18hierarchy of thread groups, B-18kernels, B-19, B-24key abstractions, B-18paradigm, B-19–23parallel plus-scan template, B-61per-block shared memory, B-58plus-reduction implementation, B-63programs, B-6, B-24scalable parallel programming with,
B-17–23shared memories, B-18threads, B-36
Cyclic redundancy check, 437Cylinder, 396
D
D fl ip-fl ops, A-639, A-641D latches, A-639, A-640Data bits, 435Data fl ow analysis, OL2.15-11Data hazards, 289–292, 316–328.
See also Hazardsforwarding, 289, 316–328load-use, 290, 329stalls and, 324–328
Data parallel problem decomposition, B-17, B-18
I-6 Index
Data race, 125Data selectors, 258Data transfer instructions.
See also Instructionsdefi ned, 68, 69load, 69off set, 70store, 71–72
Datacenters, 7Data-level parallelism, 524Datapath elements
defi ned, 263sharing, 268
Datapathsbranch, 266building, 263–271control signal truth tables, C-14control unit, 277defi ned, 19design, 263exception handling, 339for fetching instructions, 265for hazard resolution via forwarding, 323for LEGv8 architecture, 269for memory instructions, 267in operation for branch-on-zero
instruction, 280in operation for load instruction, 279in operation for R-type instruction,
277, 278operation of, 276–280pipelined, 297–315for R-type instructions, 278,
276–277single, creating, 267single-cycle, 296static two-issue, 347
Deasserted signals, 262, A-592DEC PDP-8, OL2.22-3Decimal numbers
binary number conversion to, 78defi ned, 75
Decision-making instructions, 93–99Decoders, A-597
two-level, A-653Decoding machine language, 121–125Defect, 26Delayed branches, 295. See also Branches
as control hazard solution, 295embedded RISCs and, D-23for fi ve-stage pipelines, 323–324reducing, 330
Delayed decision, 295DeMorgan’s theorems, A-599
Denormalized numbers, 230Dependability via redundancy, 12Dependable memory hierarchy, 432–437
failure, defi ning, 432Dependences
between pipeline registers, 319between pipeline registers and ALU
inputs, 319bubble insertion and, 326detection, 318name, 348sequence, 316
Designcompromises and, 85datapath, 263digital, 366logic, 260–263, B-1–79main control unit, 273–276memory hierarchy, challenges, 472pipelining instruction sets, 288
Desktop and server RISCs. See also Reduced instruction set computer (RISC) architectures
addressing modes, D-6architecture summary, D-4arithmetic/logical instructions, D-11conditional branches, D-16constant extension summary, D-9control instructions, D-11conventions equivalent to MIPS core,
D-12data transfer instructions, D-10features added to, D-45fl oating-point instructions, D-12instruction formats, D-7multimedia extensions, D-16–18multimedia support, D-18types of, D-3
Desktop computers, defi ned, 5Device driver, OL6.9-5DGEMM (Double precision General
Matrix Multiply), 238, 363, 365, 427, 553
cache blocked version of, 429optimized C version of, 241, 363, 490performance, 365, 430
Dicing, 27Dies, 26, 26–27Digital design pipeline, 366Digital signal-processing (DSP)
extensions, D-19DIMMs (dual inline memory modules),
OL5.17-5Direct Data IO (DDIO), OL6.9-6
Direct memory access (DMA), OL6.9-4Direct3D, B-13Direct-mapped caches. See also Caches
address portions, 421choice of, 422defi ned, 398, 416illustrated, 399memory block location, 417misses, 419single comparator, 421total number of bits, 404
Dirty bit, 452Dirty pages,Disk memory, 395–397Displacement addressing, 120Distributed Block-Interleaved Parity
(RAID 5), OL5.11-6Divide algorithm, 200Dividend, 198Division, 197–203
algorithm, 199dividend, 198divisor, 198
Divisor, 198divu (Divide Unsigned). See also
Arithmeticfaster, 202–203fl oating-point, 220hardware, 198–201hardware, improved version, 201in LEGv8, 203operands, 198quotient, 198remainder, 198signed, 201–202SRT, 203
Don’t cares, A-605–606example, A-605–606term, 273
Double data rate (DDR), 393Double Data Rate (DDR) SDRAM,
393–394, A-653Double precision. See also Single
precisiondefi ned, 207FMA, B-45–46GPU, B-45–46, B-74representation, 206–207
Doubleword, 66, 158Dual inline memory modules (DIMMs), 395Dynamic branch prediction, 331–334.
See also Control hazardsbranch prediction buff er, 331loops and, 333
Index I-7
Dynamic hardware predictors, 295Dynamic multiple-issue processors, 343,
349–352. See also Multiple issuepipeline scheduling, 350–352superscalar, 349
Dynamic pipeline scheduling, 350–352commit unit, 350concept, 350hardware-based speculation, 352primary units, 351reorder buff er, 355reservation station, 350
Dynamic random access memory (DRAM), 392, 393–395, A-651–653
bandwidth external to, 412cost, 23defi ned, 19, A-651DIMM, OL5.17-5Double Date Rate (DDR), 393–394early board, OL5.17-4GPU, B-37–38growth of capacity, 25history, OL5.17-2internal organization of, 394pass transistor, A-651SIMM, OL5.17-5, OL5.17-6single-transistor, A-652size, 412speed, 23synchronous (SDRAM), 393–394,
A-648, A-653two-level decoder, A-653
Dynamically linked libraries (DLLs), 134–136
defi ned, 134lazy procedure linkage version, 135
E
Early restart, 406Edge-triggered clocking methodology,
261, 262, A-636, A-661advantage, A-637clocks, A-661drawbacks, A-662illustrated, A-638rising edge/falling edge, A-636
EDSAC (Electronic Delay Storage Automatic Calculator), OL1.12-3, OL5.17-2
Eispack, OL3.12-4Electrically erasable programmable
read-only memory (EEPROM), 395
Elementscombinational, 260datapath, 263, 268memory, A-638–646state, 260, 262, 264, A-636, A-638
Embedded computers, 5application requirements, 6design, 5growth, OL1.12-12–1.12-13
Embedded Microprocessor Benchmark Consortium (EEMBC), OL1.12-12
Embedded RISCs. See also Reduced instruction set computer (RISC) architectures
addressing modes, D-6architecture summary, D-4arithmetic/logical instructions, D-14conditional branches, D-16constant extension summary, D-9control instructions, D-15data transfer instructions, D-13delayed branch and, D-23DSP extensions, D-19general purpose registers, D-5instruction conventions, D-15instruction formats, D-8multiply-accumulate approaches, D-19types of, D-4
Encodingdefi ned, C-31LEGv8 instruction, 86, 122ROM control function, C-18–19ROM logic function, A-603x86 instruction, 161–162
ENIAC (Electronic Numerical Integrator and Calculator), OL1.12-2, OL1.12-3, OL5.17-2
EPIC, OL4.16-5Error correction, A-653–655Error Detecting and Correcting Code
(RAID 2), OL5.11-5Error detection, A-654Error detection code, 434Ethernet, 23EX stage
load instructions, 303overfl ow exception detection, 338, 341store instructions, 305
Exabyte, 6Exception enable, 461Exception link register (ELR), 337, 459, 461
address capture, 340defi ned, 338in restart determination, 337
Exception program counters (EPCs), 326address capture, 331copying, 181defi ned, 181, 327in restart determination, 326–327transferring, 182
Exception Syndrome Register (ESR), 337, 461
Exceptions, 336–342association, 342datapath with controls for handling,
339defi ned, 207, 336detecting, 336event types and, 336imprecise, 342interrupts versus, 336in LEGv8 architecture, 337–338overfl ow, 339pipelined computer example, 339in pipelined implementation, 338–342precise, 342reasons for, 337–338result due to overfl ow in add
instruction, 341saving/restoring stage on, 462
Executable fi lesdefi ned, 131
Execute or address calculation stage, 303Execute/address calculation
control line, 312load instruction, 303store instruction, 303
Execution timeas valid performance measure, 51CPU, 32, 33–34pipelining and, 297
Explicit counters, C-23, C-26Exponents, 206Extended-register instructions, 164
F
Failures, synchronizer, A-665Fallacies. See also Pitfalls
add immediate unsigned, 227Amdahl’s law, 572arithmetic, 242–245assembly language for performance,
169commercial binary compatibility
importance, 170defi ned, 49GPUs, B-72–74, B-75
I-8 Index
low utilization uses little power, 50peak performance, 572pipelining, 366powerful instructions mean higher
performance, 169right shift , 242
False sharing, 480Fast carry
with “infi nite” hardware, A-626–627with fi rst level of abstraction,
A-627–628with second level of abstraction,
A-628–634Fast Fourier Transforms (FFT), B-53Fault avoidance, 433Fault forecasting, 433Fault tolerance, 433Fermi architecture, 539, 568Field programmable devices (FPDs),
A-666Field programmable gate arrays (FPGAs),
A-666Fields
defi ned, 84format, C-31LEGv8, 84–86names, 84
Files, register, 264, 269, A-638, A-642–644
Fine-grained multithreading, 530Finite-state machines (FSMs), 472–477,
A-655–660control, C-8–22controllers, 475for multicycle control, C-9for simple cache controller, 476–477implementation, 474, A-658Mealy, 475Moore, 475next-state function, 474, A-655output function, A-655, A-657state assignment, A-658state register implementation, A-659style of, 475synchronous, A-655SystemVerilog, OL5.12-7traffi c light example, A-656–658
Flash memory, 395characteristics, 23defi ned, 23
Flat address space, 493
Flip-fl opsD fl ip-fl ops, A-639, A-641defi ned, A-639
Floating point, 205–230, 232assembly language, 221backward step, OL3.12-4–3.12-5binary to decimal conversion, 211branch, 220challenges, 246diversity versus portability,
OL3.12-3–3.12-4division, 220fi rst dispute, OL3.12-2–3.12-3form, 206fused multiply add, 228guard digits, 226–227history, OL3.12-3IEEE 754 standard, 207–211intermediate calculations, 226LEGv8 instruction frequency for, 248LEGv8 instructions, 220–226machine language, 221operands, 221overfl ow, 206packed format, 232precision, 243procedure with two-dimensional
matrices, 223–225programs, compiling, 222–225registers, 226representation, 206–211rounding, 226sign and magnitude, 206SSE2 architecture, 232, 233subtraction, 220underfl ow, 206units, 227in x86, 233
Floating vectors, OL3.12-3Floating-point addition, 212–215
arithmetic unit block diagram, 216binary, 213illustrated, 214instructions, 220steps, 212
Floating-point arithmetic (GPUs), B-41–46basic, B-42double precision, B-45–46, B-74performance, B-44specialized, B-42–44supported formats, B-42texture operations, B-44
Floating-point instructionsdesktop RISC, D-12SPARC, D-31
Floating-point multiplication, 215–219binary, 219illustrated, 218instructions, 220signifi cands, 215steps, 215, 217
Flow-sensitive information, OL2.15-15
Flushing instructions, 329, 330exceptions and, 340
For loops, 147, OL2.15-26inner, OL2.15-24SIMD and, OL6.15-2
Format fi elds, C-31Fortran, OL2.22-7Forwarding, 316–328
ALU before, 321control, 320datapath for hazard resolution, 323defi ned, 289functioning, 317graphical representation, 290illustrations, OL4.13-26multiple results and, 292multiplexors, 322pipeline registers before, 321with two instructions, 289–290Verilog implementation,
OL4.13-2–4.13-4Fractions, 206, 207Frame buff er, 18Frame pointers, 106Front end, OL2.15-3Fully associative caches. See also Caches
block replacement strategies, 468–469
choice of, 422defi ned, 417memory block location, 417misses, 420
Fully connected networks, 551Fused-multiply-add (FMA) operation,
228, B-45–46
G
Galois/Counter Mode ( GCM ) encryption, 488
Game consoles, B-9
Fallacies (Continued)
Index I-9
Gates, A-591, A-596AND, A-600, C-7delays, A-634–635mapping ALU control function to,
C-4–7NAND, A-596NOR, A-596, A-638
Gather-scatter, 527, 568General Purpose GPUs (GPGPUs),
B-5General-purpose registers, 154
architectures, OL2.22-3embedded RISCs, D-5
Generatedefi ned, A-628example, A-632super, A-629
Gigabyte, 6Global common subexpression
elimination, OL2.15-6Global memory, B-21, B-39Global miss rates, 430Global optimization, OL2.15-5
code, OL2.15-7implementing, OL2.15-8–2.15-11
Global pointers, 106GPU computing. See also Graphics
processing units (GPUs)defi ned, B-5visual applications, B-6–7
GPU system architectures, B-7–12graphics logical pipeline, B-10heterogeneous, B-7–9implications for, B-24interfaces and drivers, B-9unifi ed, B-10–12
Graph coloring, OL2.15-12Graphics displays
computer hardware support, 18LCD, 18
Graphics logical pipeline, B-10Graphics processing units (GPUs),
538–543. See also GPU computingas accelerators, 538attribute interpolation, B-43–44defi ned, 46, 522, B-3evolution, B-5fallacies and pitfalls, B-72–75fl oating-point arithmetic, B-17,
B-41–46, B-74GeForce 8-series generation, B-5general computation, B-73–74
General Purpose (GPGPUs), B-5graphics mode, B-6graphics trends, B-4history, B-3–4logical graphics pipeline, B-13–14mapping applications to, B-55–72memory, 538multilevel caches and, 538N-body applications, B-65–72NVIDIA architecture, 539–541parallel memory system, B-36–41parallelism, 539, B-76performance doubling, B-4perspective, 543–545programming, B-12–24programming interfaces to, B-17real-time graphics, B-13summary, B-76
Graphics shader programs, B-14–15Gresham’s Law, 248, OL3.12-2Grid computing, 549Grids, B-19GTX 280, 564–569Guard digits
defi ned, 226rounding with, 227
H
Half precision, B-42Halfwords, 114Hamming, Richard, 434Hamming distance, 434Hamming Error Correction Code (ECC),
434–435calculating, 434–435
Hard disksaccess times, 23defi ned, 23
Hardwareas hierarchical layer, 13language of, 14–16operations, 63–67supporting procedures in, 100–110synthesis, A-609translating microprograms to, C-28–32virtualizable, 440
Hardware description languages. See also Verilog
defi ned, A-608using, A-608–614VHDL, A-608–609
Hardware multithreading, 530–533coarse-grained, 530options, 531simultaneous, 531
Hardware-based speculation, 352Harvard architecture, OL1.12-4Hazard detection units, 324
functions, 324pipeline connections for, 327
Hazards. See also Pipeliningcontrol, 292–293, 328–336data, 289, 316–328forwarding and, 323structural, 288–289, 305
Heapallocating space on, 107–110defi ned, 107
Heterogeneous systems, B-4–5architecture, B-7–9defi ned, B-3
Hexadecimal numbers, 83binary number conversion to,
83, 84Hierarchy of memories, 12High-level languages, 14–16
benefi ts, 16computer architectures, OL2.22-5importance, 16
High-level optimizations, OL2.15-4–2.15-5
Hit rate, 390Hit time
cache performance and, 415–416defi ned, 390
Hit under miss, 483Hold time, A-642Horizontal microcode, C-32Hot-swapping, OL5.11-7Human genome project, 4
I
I/O, OL6.9-2, OL6.9-3on system performance, OL5.11-2
I/O benchmarks.See BenchmarksIBM 360/85, OL5.17-7IBM 701, OL1.12-5IBM 7030, OL4.16-2IBM ALOG, OL3.12-7IBM Blue Gene, OL6.15-9–6.15-10IBM Personal Computer, OL1.12-7,
OL2.22-6
I-10 Index
IBM System/360 computers, OL1.12-6, OL3.12-6, OL4.16-2
IBM z/VM, OL5.17-8ID stage
branch execution in, 330, 331load instructions, 303store instruction in, 302
IEEE 754 fl oating-point standard, 207–211, 208, OL3.12-8–3.12-10. See also Floating point
fi rst chips, OL3.12-8–3.12-9in GPU arithmetic, B-42–43implementation, OL3.12-10rounding modes, 227today, OL3.12-10
If statements, 118I-format, 87If-then-else, 94Immediate addressing, 120Immediate instructions, 73Imprecise interrupts, 342, OL4.16-4Index-out-of-bounds check, 98Induction variable elimination, OL2.15-7Inheritance, OL2.15-15In-order commit, 351Input devices, 16Inputs, 273Instances, OL2.15-15Instruction count, 36, 38Instruction decode/register fi le read
stagecontrol line, 311–312load instruction, 300store instruction, 305
Instruction execution illustrations, OL4.13-16–4.13-17
clock cycle 9, OL4.13-24clock cycles 1 and 2, OL4.13-21clock cycles 3 and 4, OL4.13-22clock cycles 5 and 6, OL4.13-23clock cycles 7 and 8, OL4.13-24examples, OL4.13-20–4.13-25forwarding, OL4.13-26–4.13-31no hazard, OL4.13-17pipelines with stalls and forwarding,
OL4.13-26, OL4.13-20Instruction fetch stage
control line, 312load instruction, 300store instruction, 305
Instruction formats, 161ARMv7, 151
defi ned, 83desktop/server RISC architectures, D-7embedded RISC architectures, D-8I-type, 85LEGv8, 151MIPS, 151R-type, 85, 273x86, 161
Instruction latency, 367Instruction mix, 39, OL1.12-10Instruction set architecture
ARM, 152–154branch address calculation, 266defi ned, 22, 52history, 173–174maintaining, 52protection and, 441thread, B-31–34virtual machine support, 440–441
Instruction sets, B-49ARMv8, 171design for pipelining, 228LEGv8, 247MIPS-32, 151x86 growth, 170
Instruction-level parallelism (ILP), 365. See also Parallelism
compiler exploitation, OL4.16-5–4.16-6defi ned, 43, 344exploitation, increasing, 354and matrix multiply, 363–365
Instructions, 60–174, D-25–27, D-40–42. See also Arithmetic instructions; MIPS; Operands
add immediate, 73addition, 190Alpha, D-27–29arithmetic-logical, 263ARM, 152–154, D-36–37assembly, 66basic block, 96cache-aware, 496conditional branch, 93, 94conditional move, 334core, 246data transfer, 68decision-making, 93–99defi ned, 14, 62desktop RISC conventions, D-12as electronic signals, 82embedded RISC conventions, D-15encoding, 86
fetching, 265fi elds, 83fl oating-point (x86), 232, 233fl oating-point, 220–221fl ushing, 329, 330, 340immediate, 73introduction to, 62–63jumpleft -to-right fl ow, 298load, 69logical operations, 90–93M32R, D-40memory access, B-33–34memory-reference, 257multiplication, 197nop, 325–326PA-RISC, D-34–36performance, 35–36pipeline sequence, 325PowerPC, D-12–13, D-32–34PTX, B-31, B-32representation in computer, 82–89restartable, 462resuming,R-type, 263, 268SPARC, D-29–32store, 72store exclusive register (STXR), 126subtraction, 190SuperH, D-39–40thread, B-30–31Th umb, D-38vector, 524as words, 62x86, 154–159
Instructions per clock cycle (IPC), 343Integrated circuits (ICs), 19. See also
specifi c chipscost, 27defi ned, 25manufacturing process, 26very large-scale (VLSIs), 25
Intel Core i7, 46–49, 256, 517, 564–569address translation for, 483architectural registers, 358caches in, 484memory hierarchies of, 482–488microarchitecture, 358performance of, 485–486SPEC CPU benchmark, 46–48SPEC power benchmark, 48–49TLB hardware for, 483
Index I-11
Intel Core i7 920, 358–360microarchitecture, 358
Intel Core i7 960benchmarking and roofl ines of,
564–569Intel Core i7 Pipelines, 354, 358–360
memory components, 359performance, 361–362program performance, 362specifi cation, 356
Intel IA-64 architecture, OL2.22-3Intel Paragon, OL6.15-8Intel Th reading Building Blocks, B-60Intel x86 microprocessors
clock rate and power for, 40Interference graphs, OL2.15-12Interleaving, 412Interprocedural analysis, OL2.15-14Interrupt enable, 461Interrupt-driven I/O, OL6.9-4Interrupts
defi ned, 207, 336event types and, 336exceptions versus, 336imprecise, 342, OL4.16-4precise, 342vectored, 337
Intrinsity FastMATH processor, 409–412caches, 410data miss rates, 411, 421read processing, 456TLB, 454–457write-through processing, 456
Inverted page tables, 451Issue packets, 345
J
Javabytecode, 136bytecode architecture, OL2.15-17characters in, 113–115compiling in, OL2.15-19–2.15-20goals, 136interpreting, 136, 150, OL2.15-15keywords, OL2.15-21method invocation in, OL2.15-21pointers, OL2.15-26primitive types, OL2.15-26programs, starting, 136–137reference types, OL2.15-26sort algorithms, 146
strings in, 113–115translation hierarchy, 136while loop compilation in, OL2.15-
18–2.15-19Java Virtual Machine (JVM), 150,
OL2.15-16Jump instructions, 254, D-26
branch instruction versus, 270control and datapath for, 271implementing, 270instruction format, 270
Just In Time (JIT) compilers, 137, 576
K
Karnaugh maps, A-606Kernel mode, 459Kernels
CUDA, B-19, B-24defi ned, B-19
Kilobyte, 6
L
LAPACK, 243Large-scale multiprocessors, OL6.15-7,
OL6.15-9–6.15-10Latches
D latch, A-639, A-640defi ned, A-639
Latencyinstruction, 367memory, B-74–75pipeline, 297use, 346
LDUR (load register), 64LDURB (load byte), 64LDURH (load half), 64LDURSW (load signed word), 64LDXR (load exclusive register), 64, 122Leaf procedures. See also Procedures
defi ned, 104example, 113
Least recently used (LRU)as block replacement strategy, 468–469defi ned, 423pages, 448
Least signifi cant bits, A-620defi ned, 75SPARC, D-31
Left -to-right instruction fl ow, 298–299LEGv8, 62, 64, 86
architecture, 204arithmetic core, 246arithmetic instructions, 63arithmetic/logical instructions not in,
D-21, D-23assembly instruction, mapping, 82–83common extensions to, D-20–25compiling C assignment statements
into, 66compiling complex C assignment into,
66control instructions not in, D-21control registers, 461control unit, C-10data transfer instructions not in, D-20,
D-22divide in, 203exceptions in, 337–338fi elds, 84–85fl oating-point instructions not in,
D-22fl oating-point instructions, 220–221instruction classes, 173instruction encoding, 86, 122instruction formats, 124, 151instruction set, 62, 171, 246, 247,
256–260, D-9–10machine language, 88memory addresses, 71memory allocation for program and
data, 108multiply in, 197operands, 64Pseudo, 246register conventions, 109static multiple issue with, 345–347
Level-sensitive clocking, A-662, A-663–664
defi ned, A-662two-phase, A-663
Link, OL6.9-2Linkers, 131–134
defi ned, 131executable fi les, 131steps, 131using, 131–134
Linking object fi les, 132–134Linpack, 554, OL3.12-4Liquid crystal displays (LCDs), 18LISP, SPARC support, D-30Live range, OL2.15-11Livermore Loops, OL1.12-11
I-12 Index
Load balancing, 521–522Load byte, 167Load halfword, 167Load instructions. See also Store
instructionsaccess, B-41base register, 274compiling with, 71–72datapath in operation for, 279defi ned, 69EX stage, 303halfword unsigned, 114ID stage, 302IF stage, 302load byte unsigned, 79load half, 114MEM stage, 304move wide with keep, 115move wide with zeros, 115pipelined datapath in, 307signed, 79unit for implementing, 267unsigned, 79WB stage, 304
Load register, 69, 72Loaders, 134Load-store architectures, OL2.22-3Load-use data hazard, 290, 329Load-use stalls, 329Local area networks (LANs), 24. See also
NetworksLocal memory, B-21, B-40Local miss rates, 430Local optimization, OL2.15-5. See also
Optimizationimplementing, OL2.15-8
Localityprinciple, 388spatial, 388, 391temporal, 388, 391
Lock synchronization, 125Locks, 534Logic
address select, C-24, C-25ALU control, C-6combinational, 262, A-593, A-597–608components, 261control unit equations, C-11design, 260–263, B-1–79equations, A-595minimization, A-606programmable array (PAL), A-666
sequential, A-593, A-644–646two-level, A-599–602
Logical operations, 90–93AND, 91ARM, 154desktop RISC, D-11embedded RISC, D-14EOR, 92NOT, 91OR, 91shift s, 90
Long instruction word (LIW), OL4.16-5Lookup tables (LUTs), A-667Loop unrolling
defi ned, 348, OL2.15-4for multiple-issue pipelines, 348register renaming and, 348
Loops, 95–96conditional branches in, 118for, 147prediction and, 333–334test, 147, 148while, compiling, 95–96
M
M32R, D-15, D-40Machine code, 83Machine instructions, 83Machine language, 15
branch off set in, 119decoding, 121–124defi ned, 14, 83fl oating-point, 221illustrated, 15LEGv8, 88SRAM, 21translating MIPS assembly language
into, 86Main memory, 442. See also Memory
defi ned, 23page tables, 451physical addresses, 442
Mapping applications, B-55–72Mark computers, OL1.12-14Matrix multiply, 238–242, 569–571Mealy machine, 475, A-656, A-659, A-660Mean time to failure (MTTF), 432
improving, 433versus AFR of disks, 433–434
Media Access Control (MAC) address, OL6.9-7
Megabyte, 6Memory
addresses, 79affi nity, 562atomic, B-21bandwidth, 394–395, 411cache, 21, 397–412, 412–431CAM, 422constant, B-40control, C-26defi ned, 19DRAM, 19, 393–394, A-651–653fl ash, 23global, B-21, B-39GPU, 538instructions, datapath for, 267local, B-21, B-40main, 23nonvolatile, 22operands, 68–69parallel system, B-36–41read-only (ROM),A-602–604SDRAM, 393–394secondary, 23shared, B-21, B-39–40spaces, B-39SRAM, A-646–650stalls, 414technologies for building, 24–28texture, B-40virtual, 441–465volatile, 22
Memory access instructions, B-33–34Memory access stage
control line, 313load instruction, 303store instruction, 303
Memory bandwidth, 565, 573Memory consistency model, 481Memory elements, A-638–646
clocked, A-639D fl ip-fl op, A-639, A-641D latch, A-640DRAMs, A-651–655fl ip-fl op, A-639hold time, A-642latch, A-639setup time, A-641, A-642SRAMs, A-646–650unclocked, A-639
Memory hierarchies, 559of ARM cortex-A8, 482–488
Index I-13
block (or line), 390cache performance, 412–431caches, 397–431common framework, 465–472defi ned, 389design challenges, 472development, OL5.17-6–5.17-8exploiting, 386–513of Intel Core i7, 482–488level pairs, 390multiple levels, 389overall operation of, 457–458parallelism and, 477–481, OL5.11-2pitfalls, 491–495program execution time and, 431quantitative design parameters, 466redundant arrays and inexpensive
disks, 481reliance on, 390structure, 389structure diagram, 392variance, 431virtual memory, 441–465
Memory rank, 395Memory technologies, 392–397
disk memory, 395–397DRAM technology, 392, 393–395fl ash memory, 395SRAM technology, 392, 393
Memory-mapped I/O, OL6.9-3Memory-stall clock cycles, 413Message passing
defi ned, 543multiprocessors, 543–548
Metastability, A-664Methods
defi ned, OL2.15-5invoking in Java, OL2.15-20–2.15-21
Microarchitectures, 358Intel Core i7 920, 358
Microcodeassembler, C-30control unit as, C-28defi ned, C-27dispatch ROMs, C-30–31horizontal, C-32vertical, C-32
Microinstructions, C-31Microprocessors
design shift , 517multicore, 8, 43, 517
Microprograms
as abstract control representation, C-30fi eld translation, C-29translating to hardware, C-28–32
Migration, 479Million instructions per second (MIPS),
51Minterms
defi ned, A-600, C-20in PLA implementation, C-20
MIP-map, B-44MIPS and ARMv8
common features beween, 152MIPS-16
16-bit instruction set, D-41–42immediate fi elds, D-41instructions, D-40–42MIPS core instruction changes, D-42PC-relative addressing, D-41
MIPS-32 instruction set, 151MIPS-64 instructions, 151, D-25–27
conditional procedure call instructions, D-27
constant shift amount, D-25jump/call not PC-relative, D-26move to/from control registers, D-26nonaligned data transfers, D-25NOR, D-25parallel single precision fl oating-point
operations, D-27reciprocal and reciprocal square root,
D-27SYSCALL, D-25TLB instructions, D-26–27
Mirroring, OL5.11-5Miss penalty
defi ned, 390determination, 405–406multilevel caches, reducing, 424
Miss ratesblock size versus, 406data cache, 467defi ned, 390global, 430improvement, 405–406Intrinsity FastMATH processor, 411local, 430miss sources, 471split cache, 411
Miss under miss, 483MMX (MultiMedia eXtension), 232Moore machines, 475, A-656, A-659,
A-660
Moore’s law, 11, 393, 538, OL6.9-2, B-72–73
Most signifi cant bit1-bit ALU for, A-621defi ned, 75
move (Move), 144MOVK (move wide with keep), 64, 115MOVZ (move wide with zero), 64, 115MS-DOS, OL5.17-11Multicore, 533–537Multicore multiprocessors, 8, 43
defi ned, 8, 517MULTICS (Multiplexed Information
and Computing Service), OL5.17-9–5.17-10
Multilevel caches. See also Cachescomplications, 430defi ned, 412, 430miss penalty, reducing, 424performance of, 424summary, 431–432
Multimedia extensionsdesktop/server RISCs, D-16–18as SIMD extensions to instruction sets,
OL6.15-4vector versus, 525–526
Multiple dimension arrays, 226Multiple instruction multiple data
(MIMD), 574defi ned, 523, 524fi rst multiprocessor, OL6.15-14
Multiple instruction single data (MISD), 523
Multiple issue, 343–350code scheduling, 347–348dynamic, 343, 349–350issue packets, 345loop unrolling and, 348processors, 343, 344static, 343, 345–349throughput and, 353
Multiple processors, 569–571Multiple-clock-cycle pipeline diagrams, 308
fi ve instructions, 309illustrated, 309
Multiplexors, A-598controls, 473in datapath, 275defi ned, 258forwarding, control values, 322selector control, 271two-input, A-598
I-14 Index
Multiplicand, 192Multiplication, 191–197. See also
Arithmeticfast, hardware, 196faster, 196–197fi rst algorithm, 194fl oating-point, 215–217hardware, 192–194instructions, 197in MIPS, 197multiplicand, 197multiplier, 197operands, 197product, 197sequential version, 192–194signed, 196
Multiplier, 192Multiply algorithm, 195Multiply-add (MAD), B-42Multiprocessors
benchmarks, 554–556bus-based coherent, OL6.15-7defi ned, 516historical perspective, 577large-scale, OL6.15-7–6.15-8,
OL6.15-9–6.15-10message-passing, 545–550multithreaded architecture, B-26–27,
B-35–36organization, 515, 545for performance, 573shared memory, 517, 533–537soft ware, 517TFLOPS, OL6.15-6UMA, 534
Multistage networks, 551Multithreaded multiprocessor
architecture, B-25–36conclusion, B-36ISA, B-31–34massive multithreading, B-25–26multiprocessor, B-26–27multiprocessor comparison,
B-35–36SIMT, B-27–30special function units (SFUs), B-35streaming processor (SP), B-34thread instructions, B-30–31threads/thread blocks management,
B-30Multithreading, B-25–26
coarse-grained, 530defi ned, 522
fi ne-grained, 530hardware, 530–533simultaneous (SMT), 531–533
Must-information, OL2.15-5Mutual exclusion, 125
N
Name dependence, 348NAND gates, A-596NAS (NASA Advanced Supercomputing),
556N-body
all-pairs algorithm, B-65GPU simulation, B-71mathematics, B-65–67multiple threads per body, B-68–69optimization, B-67performance comparison, B-69–70results, B-70–72shared memory use, B-67–68
Negation shortcut, 79Nested procedures, 104–105
compiling recursive procedure showing, 104–105
NetFPGA 10-Gigagit Ethernet card, OL6.9-2, OL6.9-3
Network of Workstations, OL6.15-8–6.15-9
Network topologies, 550–553implementing, 552multistage, 553
Networking, OL6.9-4operating system in, OL6.9-4–6.9-5performance improvement, OL6.9-
7–6.9-10Networks, 23–24
advantages, 23bandwidth, 549crossbar, 551fully connected, 551local area (LANs), 24multistage, 551wide area (WANs), 24
Newton’s iteration, 226Next state
nonsequential, C-24sequential, C-23
Next-state function, 474, A-655defi ned, 474implementing, with sequencer,
C-22–28Next-state outputs, C-10, C-12–13
example, C-12–13implementation, C-12logic equations, C-12–13truth tables, C-15
No Redundancy (RAID 0), OL5.11-4No write allocation, 408Nonblocking assignment, A-612Nonblocking caches, 355, 483Nonuniform memory access (NUMA), 534Nonvolatile memory, 22Nops, 326NOR gates, A-596
cross-coupled, A-638D latch implemented with, A-640
NOR operation, D-25NOT operation, 91, A-594Not-A-Number (NaN), 235–236Numbers
binary, 75computer versus real-world, 229decimal, 75, 78denormalized, 230hexadecimal, 84signed, 75–82unsigned, 75–82
NVIDIA GeForce 8800, B-46–55all-pairs N-body algorithm, B-71dense linear algebra computations,
B-51–53FFT performance, B-53instruction set, B-49performance, B-51rasterization, B-50ROP, B-50–51scalability, B-51sorting performance, B-54–55special function approximationstatistics, B-43special function unit (SFU), B-50streaming multiprocessor (SM), B-48–49streaming processor, B-49–50streaming processor array (SPA), B-46texture/processor cluster (TPC),
B-47–48NVIDIA GPU architecture, 539–541NVIDIA GTX 280, 565, 566NVIDIA Tesla GPU, 564–569
O
Object fi les, 132debugging information, 131header, 130
Index I-15
linking, 132–134relocation information, 130static data segment, 130symbol table, 130text segment, 130
Object-oriented languages. See also Javabrief history, OL2.22-8defi ned, 150, OL2.15-5
One’s complement, 82, A-617Opcodes
control line setting and, 276defi ned, 84, 274
OpenGL, B-13OpenMP (Open MultiProcessing), 536,
556Operands, 67–72. See also Instructions
32-bit immediate, 115–116adding, 189arithmetic instructions, 67compiling assignment when in
memory, 69constant, 73–74division, 197fl oating-point, 221LEGv8, 64memory, 68–69multiplication, 191
Operating systemsbrief history, OL5.17-9–5.17-12defi ned, 13encapsulation, 22in networking, OL6.9-4–6.9-5
Operationsatomic, implementing, 126hardware, 63–67logical, 90–93x86 integer, 157, 158
Optimizationclass explanation, OL2.15-14compiler, 146control implementation,
C-27–28global, OL2.15-5high-level, OL2.15-4–2.15-5local, OL2.15-5, OL2.15-8manual, 150
OR operation, 91, A-594Out-of-order execution
defi ned, 351performance complexity, 430processors, 355
Output devices, 16Overfl ow
defi ned, 76, 206detection, 190exceptions, 339fl oating-point, 207occurrence, 77saturation and, 191subtraction, 189
P
P+Q redundancy (RAID 6), OL5.11-7Packed fl oating-point format, 232Page faults, 448. See also Virtual memory
for data access, 463defi ned, 442handling, 443, 461–464virtual address causing, 457, 458
Page tables, 468defi ned, 446illustrated, 449indexing, 446inverted, 451levels, 451main memory, 451register, 446storage reduction techniques, 451updating, 446VMM, 463
Pages. See also Virtual memorydefi ned, 442dirty, 452fi nding, 446–447LRU, 448off set, 443physical number, 443placing, 432–434size, 444virtual number, 443
Parallel bus, OL6.9-3Parallel execution, 125Parallel memory system, B-36–41.
See also Graphics processing units (GPUs)
caches, B-38constant memory, B-40DRAM considerations, B-37–38global memory, B-39load/store access, B-41local memory, B-40memory spaces, B-39MMU, B-38–39ROP, B-41shared memory, B-39–40
surfaces, B-41texture memory, B-40
Parallel processing programs, 518–523creation diffi culty, 518–523defi ned, 516for message passing, 533great debates in, OL6.15-5for shared address space, 533–534use of, 573
Parallel reduction, B-62Parallel scan, B-60–63
CUDA template, B-61inclusive, B-60tree-based, B-62
Parallel soft ware, 517Parallelism, 12, 43, 342–355
and computers arithmetic, 230–232data-level, 246, 524debates, OL6.15-5–6.15-7GPUs and, 538, B-76instruction-level, 43, 342, 354memory hierarchies and, 477–481,
OL5.11-2multicore and, 533multiple issue, 342–349multithreading and, 531performance benefi ts, 44process-level, 516redundant arrays and inexpensive
disks, 481subword, D-17task, B-24task-level, 516thread, B-22
Paravirtualization, 495PA-RISC, D-14, D-17
branch vectored, D-35conditional branches, D-34, D-35debug instructions, D-36decimal operations, D-35extract and deposit, D-35instructions, D-34–36load and clear instructions, D-36multiply/add and multiply/subtract,
D-36nullifi cation, D-34nullifying branch option, D-25store bytes short, D-36synthesized multiply and divide,
D-34–35Parity, OL5.11-5
bits, 435code, 434, A-653
I-16 Index
PARSEC (Princeton Application Repository for Shared Memory Computers), 556
Pass transistor, A-651PCI-Express (PCIe), 553, B-8, OL6.9-2PC-relative addressing, 118, 120Peak fl oating-point performance, 558Pentium bug morality play, 244Performance, 28–40
assessing, 28classic CPU equation, 36–40components, 38CPU, 33–35defi ning, 29–32equation, using, 36improving, 34–35instruction, 35–36measuring, 33–35, OL1.12-10program, 39–40ratio, 31relative, 31–32response time, 30–31sorting, B-54–55throughput, 30–31time measurement, 32
Personal computers (PCs), 7defi ned, 5
Personal mobile device (PMD)defi ned, 7
Petabyte, 6Physical addresses, 442
mapping to, 442–443space, 533, 535
Physically addressed caches, 458Pipeline registers
before forwarding, 320dependences, 319forwarding unit selection, 323
Pipeline stalls, 291avoiding with code reordering, 291data hazards and, 324–328insertion, 326load-use, 329as solution to control hazards, 293
Pipelined branches, 331Pipelined control, 311–315. See also
Controlcontrol lines, 311, 312overview illustration, 327specifying, 312
Pipelined datapaths, 297–315with connected control signals, 315
with control signals, 311–315corrected, 307illustrated, 300in load instruction stages, 307
Pipelined dependencies, 317Pipelines
branch instruction impact, 329eff ectiveness, improving, OL4.16-
4–4.16-5execute and address calculation stage,
301, 303fi ve-stage, 285, 301, 309graphic representation, 290, 307–311instruction decode and register fi le
read stage, 300, 303instruction fetch stage, 301, 303instructions sequence, 325latency, 297memory access stage, 301, 303multiple-clock-cycle diagrams, 308performance bottlenecks, 353single-clock-cycle diagrams, 308stages, 285static two-issue, 345write-back stage, 301, 305
Pipelining, 12, 283–297advanced, 354–355benefi ts, 283control hazards, 292–293data hazards, 289exceptions and, 338–342execution time and, 297fallacies, 366–367hazards, 288instruction set design for, 288laundry analogy, 284overview, 283–297paradox, 285performance improvement, 288pitfall, 366–367simultaneous executing instructions,
297speed-up formula, 285structural hazards, 288, 305summary, 296throughput and, 297
Pitfalls. See also Fallaciesaddress space extension, 493arithmetic, 242–245associativity, 492defi ned, 49GPUs, B-74–75
ignoring memory system behavior, 491memory hierarchies, 491–495out-of-order processor evaluation, 492performance equation subset, 50–51pipelining, 366–367pointer to automatic variables, 171sequential word addresses, 171simulating cache, 491soft ware development with
multiprocessors, 570VMM implementation, 495
Pixel shader example, B-15–17Pixels, 18Pointers
arrays versus, 146–150frame, 106global, 106incrementing, 148Java, OL2.15-26stack, 101, 105
Polling, OL6.9-8Pop, 101Power
clock rate and, 40critical nature of, 53effi ciency, 354–355relative, 41
PowerPCalgebraic right shift , D-33branch registers, D-32–33condition codes, D-12instructions, D-12–13instructions unique to, D-31–33load multiple/store multiple, D-33logical shift ed immediate, D-33rotate with mask, D-33
Precise interrupts, 342Prediction, 12
2-bit scheme, 333accuracy, 333dynamic branch, 331–333loops and, 333–334steady-state, 333
Prefetching, 496, 560Primitive types, OL2.15-26Procedure calls
preservation across, 106Procedures, 100–110
compiling, 102compiling, showing nested procedure
linking, 102–104execution steps, 100
Index I-17
frames, 106leaf, 104nested, 104–106recursive, 108for setting arrays to zero, 147sort, 140–145strcpy, 112–113string copy, 112–113swap, 138
Process identifi ers, 460Process-level parallelism, 516Processors, 254–368
as cores, 43control, 19datapath, 19defi ned, 17, 19dynamic multiple-issue, 343multiple-issue, 343out-of-order execution, 355, 430performance growth, 44ROP, B-12, B-41speculation, 344–345static multiple-issue, 343, 345–349streaming, B-34superscalar, 349, 531–532, OL4.16-5technologies for building, 24–28two-issue, 346vector, 523–524VLIW, 345
Product, 192Product of sums, A-599Program counters (PCs), 263
changing with conditional branch, 334
defi ned, 101, 263exception, 459, 461incrementing, 263, 265instruction updates, 300
Program performanceelements aff ecting, 39understanding, 9
Programmable array logic (PAL), A-666Programmable logic arrays (PLAs)
component dots illustration, A-604control function implementation, C-7,
C-20–21defi ned, A-600example, A-601–602illustrated, A-601ROMs and, A-603–604size, C-20truth table implementation, A-601
Programmable logic devices (PLDs), A-666
Programmable ROMs (PROMs), A-602Programming languages. See also specifi c
languagesbrief history of, OL2.22-7–2.22-8object-oriented, 150variables, 67
Programsassembly language, 129Java, starting, 136–137parallel processing, 516–523starting, 128–137translating, 128–137
Propagatedefi ned, A-628example, A-632super, A-629
Protected keywords, OL2.15-21Protection
defi ned, 442implementing, 459–460mechanisms, OL5.17-9VMs for, 438
Protection group, OL5.11-5Pseudo MIPS
defi ned, 246instruction set, 248
Pseudoinstructionsdefi ned, 129summary, 130
Pthreads (POSIX threads), 556PTX instructions, B-31, B-32Public keywords, OL2.15-21Push
defi ned, 101using, 104
Q
Quad words, 158Quicksort, 425, 426Quotient, 198
R
Race, A-661Radix sort, 425, 426, B-63–65
CUDA code, B-64implementation, B-63–65
RAID, See Redundant arrays of inexpensive disks (RAID)
RAM, 9Raster operation (ROP) processors, B-12,
B-41, B-50–51fi xed function, B-41
Raster refresh buff er, 18Rasterization, B-50Ray casting (RC), 568Read-only memories (ROMs),
A-602–604control entries, C-16–17control function encoding, C-18–19dispatch, C-25implementation, C-15–19logic function encoding, A-603overhead, C-18PLAs and, A-603–604programmable (PROM), A-602total size, C-16
Read-stall cycles, 413Read-write head, 395Receive message routine, 545Recursive procedures, 108. See also
Proceduresclone invocation, 104
Reduced instruction set computer (RISC) architectures, D-2–45, OL2.22-5, OL4.16-4. See also Desktop and server RISCs; Embedded RISCs
group types, D-3–4instruction set lineage, D-44
Reduction, 535Redundant arrays of inexpensive disks
(RAID), OL5.11-2–5.11-8history, OL5.11-8RAID 0, OL5.11-4RAID 1, OL5.11-5RAID 2, OL5.11-5RAID 3, OL5.11-5RAID 4, OL5.11-5–5.11-6RAID 5, OL5.11-6–5.11-7RAID 6, OL5.11-7spread of, OL5.11-6summary, OL5.11-7–5.11-8use statistics, OL5.11-7
Reference bit, 450References
absolute, 131types, OL2.15-26
Register 31, 74, 102, 175Register addressing, 120Register allocation, OL2.15-11–2.15-13
I-18 Index
Register fi les, A-638,A-642–644defi ned, 264, A-638, A-642in behavioral Verilog, A-645single, 269two read ports implementation, A-643with two read ports/one write port,
A-643write port implementation, A-644
Register-memory architecture, OL2.22-3Registers, 156, 157–158
architectural, 336–342base, 69clock cycle time and, 67compiling C assignment with, 68defi ned, 67destination, 85, 274fl oating-point, 226left half, 301LEGv8 conventions, 108mapping, 82number specifi cation, 264page table, 446pipeline, 319, 321, 323primitives, 67renaming, 348right half, 301spilling, 72Status, 337temporary, 68, 102variables, 67
Relative performance, 31–32Relative power, 41Reliability, 432Remainder
defi ned, 198Reorder buff ers, 355Replication, 479Requested word fi rst, 406Request-level parallelism, 548Reservation stations
buff ering operands in, 350defi ned, 350
Response time, 30–31Restartable instructions, 462Return address, 100Return from exception (ERET), 459R-format, 274
ALU operations, 265defi ned, 86
Ripple carryadder, A-617carry lookahead speed versus,
A-634–635
Roofl ine model, 558–559, 560, 561with ceilings, 561computational roofl ine, 559illustrated, 557Opteron generations, 558with overlapping areas shaded, 563peak fl oating-point performance, 562peak memory performance, 562with two kernels, 563
Rotational delay.See Rotational latencyRotational latency, 397Rounding, 226
accurate, 226bits, 228with guard digits, 227IEEE 754 modes, 227
Row-major order, 225, 427R-type instructions, 264
datapath for, 276datapath in operation for, 278
S
Saturation, 191SCALAPAK, 244Scaling
strong, 521weak, 521
Scientifi c notationadding numbers in, 213defi ned, 205for reals, 205
Search engines, 4Secondary memory, 23Sectors, 395Secure Hash Algorithm ( SHA )
encryption, 488Seek, 396Segmentation, 445Selector values, A-598Semiconductors, 25Send message routine, 545Sensitivity list, A-612Sequencers
explicit, C-32implementing next-state function with,
C-22–28Sequential logic, A-593Servers, OL5. See also Desktop and server
RISCscost and capability, 5
Service accomplishment, 432Service interruption, 432
Set instructions, 97Set-associative caches, 417. See also
Cachesaddress portions, 421block replacement strategies, 468choice of, 467four-way, 418, 421memory-block location, 417misses, 419–420n-way, 417two-way, 418
Setup time, A-641, A-642Shaders
defi ned, B-14fl oating-point arithmetic, B-14graphics, B-14–15pixel example, B-15–17
Shading languages, B-14Shadowing, OL5.11-5Shared memory. See also Memory
as low-latency memory, B-21caching in, B-58–60CUDA, B-58N-body and, B-67–68per-CTA, B-39SRAM banks, B-40
Shared memory multiprocessors (SMP), 531–535
defi ned, 517, 531single physical address space, 531synchronization, 534
Shift amount, 84Shift instructions, 90Sign and magnitude, 206Sign bit, 78Sign extension, 266
defi ned, 78shortcut, 80
Signalsasserted, 262, A-592control, 262, 274–275deasserted, 262, A-592
Signed division, 201–202Signed multiplication, 196Signed numbers, 75–82
sign and magnitude, 77treating as unsigned, 98
Signifi cands, 207addition, 212multiplication, 215
Silicon, 25as key hardware technology, 53crystal ingot, 26
Index I-19
defi ned, 26wafers, 26
Silicon crystal ingot, 26SIMD (Single Instruction Multiple Data),
522, 574computers, OL6.15-2–6.15-4data vector, B-35extensions, OL6.15-4for loops and, OL6.15-3massively parallel multiprocessors,
OL6.15-2small-scale, OL6.15-4vector architecture, 524–525in x86, 524
SIMMs (single inline memory modules), OL5.17-5, OL5.17-6
Simple programmable logic devices (SPLDs), A-666
Simplicity, 171Simultaneous multithreading (SMT),
531–533support, 531thread-level parallelism, 531unused issue slots, 531
Single error correcting/Double error correcting (SEC/DEC), 434–436
Single instruction single data (SISD), 523Single precision. See also Double
precisionbinary representation, 210defi ned, 207
Single-clock-cycle pipeline diagrams, 308illustrated, 310
Single-cycle datapaths. See also Datapathsillustrated, 298instruction execution, 299
Single-cycle implementationcontrol function for, 281defi ned, 281nonpipelined execution versus
pipelined execution, 287non-use of, 284penalty, 283pipelined performance versus, 285
Single-instruction multiple-thread (SIMT), B-27–30
overhead, B-35multithreaded warp scheduling, B-28processor architecture, B-28warp execution and divergence,
B-29–30Single-program multiple data (SPMD),
B-22
Smalltalk-80, OL2.22-8Smart phones, 7Snooping protocol, 479–481Snoopy cache coherence, OL5.12-7Soft ware optimization
via blocking, 427–432Sort algorithms, 146Soft ware
layers, 13multiprocessor, 516parallel, 517as service, 7, 547, 574systems, 13
Sort procedure, 140–144. See also Procedures
code for body, 140–142full procedure, 143–144passing parameters in, 143preserving registers in, 143procedure call, 143register allocation for, 140
Sorting performance, B-54–55Space allocation
on heap, 107–110on stack, 106
SPARCannulling branch, D-23CASA, D-31conditional branches, D-10–12fast traps, D-30fl oating-point operations, D-31instructions, D-29–32least signifi cant bits, D-31multiple precision fl oating-point
results, D-32nonfaulting loads, D-32overlapping integer operations, D-31quadruple precision fl oating-point
arithmetic, D-32register windows, D-29–30support for LISP and Smalltalk, D-30
Sparse matrices, B-55–58Sparse Matrix-Vector multiply (SpMV),
B-55, B-57, B-58CUDA version, B-57serial code, B-57shared memory version, B-59
Spatial locality, 388large block exploitation of, 405tendency, 392
SPEC, OL1.12-11–1.12-12CPU benchmark, 46–48power benchmark, 48–49
SPEC2000, OL1.12-12SPEC2006, 246, OL1.12-12SPEC89, OL1.12-11SPEC92, OL1.12-12SPEC95, OL1.12-12SPECrate, 554SPECratio, 47
Special function units (SFUs), B-35, B-50
defi ned, B-43Speculation, 344–345
hardware-based, 352implementation, 344performance and, 344problems, 344recovery mechanism, 344
Speed-up challenge, 518balancing load, 518–519bigger problem, 520–521
Spilling registers, 72, 101Split algorithm, 568Split caches, 411Stack architectures, OL2.22-4Stack pointers
adjustment, 104defi ned, 101values, 104
Stacksallocating space on, 106for arguments, 145defi ned, 101pop, 101push, 101, 104
Stalls, 291as solution to control hazard, 292avoiding with code reordering, 291behavioral Verilog with detection,
OL4.13-6data hazards and, 324–328illustrations, OL4.13-23, OL4.13-30insertion into pipeline, 326load-use, 329memory, 414write-back scheme, 413write buff er, 413
Standby spares, OL5.11-8State
in 2-bit prediction scheme, 333assignment, A-658, C-27bits, C-8exception, saving/restoring, 462logic components, 261specifi cation of, 446
I-20 Index
State elementsclock and, 262combinational logic and, 262defi ned, 260, A-636inputs, 261in storing/accessing instructions, 264register fi le, A-638
Static branch prediction, 345Static data
segment, 107Static multiple-issue processors, 343,
345–349. See also Multiple issuecontrol hazards and, 345instruction sets, 345with LEGv8 ISA, 345–348
Static random access memories (SRAMs), 392, 393, A-646–650
array organization, A-650basic structure, A-649defi ned, 21, A-646fi xed access time, A-646large, A-647read/write initiation, A-647synchronous (SSRAMs), A-648three-state buff ers, A-647, A-648
Static variables, 106Steady-state prediction, 333Sticky bits, 228Store buff ers, 355Store instructions. See also Load
instructionsaccess, B-41base register, 274compiling with, 71conditional, 126defi ned, 71EX stage, 305ID stage, 302IF stage, 302instruction dependency, 323MEM stage, 304unit for implementing, 267WB stage, 304
Store register, 72Stored program concept, 63
as computer principle, 88illustrated, 89principles, 171
Strcpy procedure, 112–113. See also Procedures
as leaf procedure, 113pointers, 113
Stream benchmark, 564Streaming multiprocessor (SM),
B-48–49Streaming processors, B-34, B-49–50
array (SPA), B-41, B-46Streaming SIMD Extension 2 (SSE2)
fl oating-point architecture, 232Streaming SIMD Extensions (SSE) and
advanced vector extensions in x86, 232–233
Stretch computer, OL4.16-2Strings
defi ned, 111in Java, 113–115representation, 111
Strip mining, 526Striping, OL5.11-4Strong scaling, 521Structural hazards, 288–289, 305STUR (store register), 64STURB (store byte), 64STURH (store half), 64STURW (store word), 64STXR (store exclusive register), 64, 126SUB (subtract), 64SUBI (subtract immediate), 64SUBIS (subtract immediate and set fl ags),
64SUBS (subtract and set fl ags), 64Subnormals, 230Subtraction, 188–191. See also Arithmetic
binary, 188–189fl oating-point, 220negative number, 190overfl ow, 190
Subword parallelism, 230–232, 365, D-17
and matrix multiply, 238–242Sum of products, A-599, A-600Supercomputers, OL4.16-3
defi ned, 5SuperH, D-15, D-39–40Superscalars
defi ned, 349, OL4.16-5dynamic pipeline scheduling, 349multithreading options, 516
Surfaces, B-41Swap procedure, 138. See also Procedures
body code, 138full, 139–140, 143–145register allocation, 138
Swap space, 448
Symbol tables, 130Synchronization, 125–127, 568
barrier, B-18, B-20, B-34defi ned, 534lock, 125overhead, reducing, 44–45unlock, 125
Synchronizersdefi ned, A-664failure, A-665from D fl ip-fl op, A-664
Synchronous DRAM (SRAM), 393–394, A-648, A-653
Synchronous SRAM (SSRAM), A-648Synchronous system, A-636Syntax tree, OL2.15-3System calls
defi ned, 459Systems soft ware, 13SystemVerilog
cache controller, OL5.12-2cache data and tag modules, OL5.12-6FSM, OL5.12-7simple cache block diagram, OL5.12-4type declarations, OL5.12-2
T
Tablets, 7Tags
defi ned, 398in locating block, 421page tables and, 448size of, 423
Tail call, 109Task identifi ers, 460Task parallelism, B-24Task-level parallelism, 516Tebibyte (TiB), 5Telsa PTX ISA, B-31–34
arithmetic instructions, B-33barrier synchronization, B-34GPU thread instructions, B-32memory access instructions, B-33–34
Temporal locality, 388tendency, 392
Temporary registers, 68, 102Terabyte (TB) , 6
defi ned, 5Texture memory, B-40Texture/processor cluster (TPC), B-47–48TFLOPS multiprocessor, OL6.15-6
Index I-21
Th rashing, 464Th read blocks, 542
creation, B-23defi ned, B-19managing, B-30memory sharing, B-20synchronization, B-20
Th read parallelism, B-22Th reads
creation, B-23CUDA, B-36ISA, B-31–34managing, B-30memory latencies and, B-74–75multiple, per body, B-68–69warps, B-27
Th ree Cs model, 459–461Th ree-state buff ers, A-647, A-648Th roughput
defi ned, 30–31multiple issue and, 342pipelining and, 286, 342
Th umb, D-15, D-38Timing
asynchronous inputs, A-664–665level-sensitive, A-663–664methodologies, A-660–665two-phase, A-663
TLB misses, 453. See also Translation-lookaside buff er (TLB)
handling, 461–465occurrence, 461problem, 464
Tomasulo’s algorithm, OL4.16-3Touchscreen, 19Tournament branch predicators, 334Tracks, 395–396Transfer time, 397Transistors, 25Translation Table Base Register (TTBR),
449Translation-lookaside buff er (TLB),
452–454, D-26–27, OL5.17-6. See also TLB misses
associativities, 454illustrated, 453integration, 457Intrinsity FastMATH, 454–457typical values, 454
Transmit driver and NIC hardware time versus receive driver and NIC hardware time, OL6.9-8
Tree-based parallel scan, B-62Truth tables, A-593
ALU control lines, C-5for control bits, 272datapath control outputs, C-17datapath control signals, C-14defi ned, 272example, A-593next-state output bits, C-15PLA implementation, A-601
Two’s complement representation, 77–78advantage, 78negation shortcut, 79–80rule, 81sign extension shortcut, 80–81
Two-level logic, A-599–602Two-phase clocking, A-663TX-2 computer, OL6.15-4
U
Unconditional branches, 94Underfl ow, 206Unicode
alphabets, 113defi ned, 113example alphabets, 114
Unifi ed GPU architecture, B-10–12illustrated, B-11processor array, B-11–12
Uniform memory access (UMA), 534, B-9
multiprocessors, 534Units
commit, 350, 355control, 259, 271–273, C-4–8, C-10,
C-12–13defi ned, 227fl oating point, 227hazard detection, 324, 327–328for load/store implementation, 267special function (SFUs), B-35, B-43,
B-50UNIVAC I, OL1.12-5UNIX, OL2.22-8, OL5.17-9–5.17-12
AT&T, OL5.17-10Berkeley version (BSD), OL5.17-10genius, OL5.17-12history, OL5.17-9–5.17-12
Unlock synchronization, 125Unscaled signed immediate off set, 166Unsigned numbers, 75–82
Use latencydefi ned, 346one-instruction, 346
V
Vacuum tubes, 25Valid bit, 400Variables
C language, 106programming language, 67register, 67static, 106storage class, 106type, 106
VAX architecture, OL2.22-4, OL5.17-7Vector lanes, 526Vector processors, 523–524. See also
Processorsconventional code comparison,
525–526instructions, 524multimedia extensions and, 524–525scalar versus, 526–527
Vectored interrupts, 337Verilog
behavioral defi nition of MIPS ALU, A-613
behavioral defi nition with bypassing, OL4.13-4–4.13-6
behavioral defi nition with stalls for loads, OL4.13-6
behavioral specifi cation, A-609, OL4.13-2–4.13-4
behavioral specifi cation of multicycle MIPS design, OL4.13-12–4.13-13
behavioral specifi cation with simulation, OL4.13-2
behavioral specifi cation with stall detection, OL4.13-6
behavioral specifi cation with synthesis, OL4.13-11–4.13-16
blocking assignment, A-612branch hazard logic implementation,
OL4.13-8–4.13-10combinational logic, A-611–614datatypes, A-609–610defi ned, A-608forwarding implementation, OL4.13-4MIPS ALU defi nition in, A-623–626modules, A-611multicycle MIPS datapath, OL4.13-14
I-22 Index
nonblocking assignment, A-612operators, A-610program structure, A-611reg, A-609–610sensitivity list, A-612sequential logic
specifi cation,A-644–646structural specifi cation, A-609wire, A-609–610
Vertical microcode, C-32Very large-scale integrated (VLSI)
circuits, 25Very Long Instruction Word (VLIW)
defi ned, 345fi rst generation computers, OL4.16-5processors, 345
VHDL, A-608–609Video graphics array (VGA) controllers,
B-3–4Virtual addresses
causing page faults, 462defi ned, 442mapping from, 442–443size, 444
Virtual machine monitors (VMMs)defi ned, 438implementing, 494laissez-faire attitude, 494page tables, 463in performance improvement, 441requirements, 440
Virtual machines (VMs), 438–441benefi ts, 438illusion, 463instruction set architecture support,
441performance improvement, 441for protection improvement, 438
Virtual memory, 441–465. See also Pagesaddress translation, 443, 452–454integration, 457–459for large virtual addresses, 450–451mechanism, 464motivations, 427–442page faults, 442, 448protection implementation, 459–460segmentation, 445summary, 463virtualization of, 463writes, 452
Virtualizable hardware, 440Virtually addressed caches, 458Visual computing, B-3Volatile memory, 22
W
Wafers, 26defects, 26dies, 26–27yield, 27
Warehouse Scale Computers (WSCs), 7, 545–550, 574
Warps, 544, B-27Weak scaling, 521Wear levelling, 395While loops, 95Whirlwind, OL5.17-2Wide area networks (WANs), 24. See also
NetworksWide immediate operands, 115–117Words
accessing, 69defi ned, 67double, 158load, 69, 71quad, 158store, 71
Working set, 464World Wide Web, 4Worst-case delay, 283Write buff ers
defi ned, 408stalls, 413write-back cache, 409
Write invalidate protocols, 479Write serialization, 479Write-back caches. See also Caches
advantages, 469cache coherency protocol, OL5.12-5complexity, 409defi ned, 408, 469stalls, 413write buff ers, 409
Write-back stagecontrol line, 313load instruction, 303store instruction, 305
Writescomplications, 408expense, 464
handling, 407–409memory hierarchy handling of,
469–470schemes, 408virtual memory, 451write-back cache, 408, 409write-through cache, 408, 409
Write-stall cycles, 414Write-through caches. See also Caches
advantages, 469defi ned, 407, 469tag mismatch, 408
X
x86, 154–162Advanced Vector Extensions in, 232brief history, OL2.22-6conclusion, 162data addressing modes, 157–158evolution, 154–157fi rst address specifi er encoding, 162instruction encoding, 161–162instruction formats, 161instruction set growth, 170instruction types, 160integer operations, 158–160registers, 157SIMD in, 522Streaming SIMD Extensions in,
232–233typical instructions/functions, 161typical operations, 161
Xerox Alto computer, OL1.12-8XMM, 232
Y
Yahoo! Cloud Serving Benchmark (YCSB), 556
Yield, 27YMM, 232
Z
Zettabyte, 6
Verilog (Continued)