Index 0-9, and symbols - Elsevier

Index

Note : Online information is listed by print page number and a period followed by “e” with online page number (54.e1). Page references preceded by a single letter with hyphen refer to appendices. Page references followed by “ f ,” “ t ,” and “ b ” refer to fi gures, tables, and boxes, respectively.

0-9, and symbols

1-bit ALU , A-26–A-29 . See also Arithmetic logic unit (ALU)

adder , A-27 f CarryOut , A-28 for most signifi cant bit , A-33 f illustrated , A-29 f logical unit for AND/OR , A-27 f performing AND, OR, and addition ,

A-31 , A-33 f 64-bit ALU , A-29–A-31 . See also

Arithmetic logic unit (ALU) from 63 copies of 1-bit ALU , A-34 f with 64 1-bit ALUs , A-30 f defi ning in Verilog , A-36–A-37 illustrated , A-35 f ripple carry adder , A-29

7090/7094 hardware , 248.e6

A

Absolute references , 127 Abstractions

hardware/soft ware interface , 22 principle , 22 to simplify design , 11

Accumulator architectures , 173.e1–173.e2 Acronyms , 9 Active matrix , 18 add (add) , 64 f addi (add immediate) , 64 f , 72 , 84 Addition , 172–175 . See also Arithmetic

binary , 172 b –173 b fl oating-point , 196–199 , 204 operands , 173 , 173 signifi cands , 195 b –196 b speed , 175 b

Address interleaving , 370–371 Address select logic , C-24 , C-25 f Address space , 418 , 421 b

extending , 467 b fl at , 467

ID (ASID) , 436 inadequate , 497.e5–497.e6 shared , 507–508 single physical , 507 , 507–508 virtual , 436

Address translation for ARM cortex-A53 , 458 f defi ned , 418–419 fast , 428–430 for Intel core i7 , 458 f TLB for , 428–430

Address-control lines , C-26 f Addresses

base , 69 byte , 70 defi ned , 68 memory , 78 b virtual , 418–419 , 438 , 439 b

Addressing base , 118 f in branches , 115–117 displacement , 118 immediate , 118 f PC-relative , 115–116 , 118 f register , 118 f RISC-V modes , 117–118 x86 modes , 151

Addressing modes desktop architectures , D-5–D-6

Advanced Vector Extensions (AVX) , 216 , 217

AGP , B-9–B-10 Algol-60 , 173.e6 Aliasing , 434 Alignment restriction , 70 All-pairs N-body algorithm , B-65 Alpha architecture

bit count instructions , D-29 fl oating-point instructions , D-28–D-29 instructions , D-27–D-29 no divide , D-28 PAL code , D-28 unaligned load-store , D-28

VAX fl oating-point formats , D-29 ALU control , 249–251 . See also

Arithmetic logic unit (ALU) bits , 250–251 , 250 f logic , C-6–C-7 mapping to gates , C-4–C-7 truth tables , C-5 f , C-5 f

ALU control block , 253 defi ned , C-4–C-6 generating ALU control bits , C-6 f

ALUOp , 250 , C-6 b –C-7 b bits , 250 , 251 control signal , 253

Amazon Web Services (AWS) , 415 b AMD Opteron X4 (Barcelona) , 533 , 534 f AMD64 , 148 , 148 , 215 , 173.e5 Amdahl’s law , 391 , 493–494

corollary , 49 defi ned , 49 fallacy , 546

and (and) , 64 f AND gates , A-12–A-13 , C-7 AND operation , 90 , A-6 andi (and immediate) , 64 f Annual failure rate (AFR) , 408–409

versus MTTF of disks , 408 b –409 b Antidependence , 325 Antifuse , A-77 Apple computer , 54.e6 Apple iPad 2 A1395 , 20 f

logic board of , 20 f processor integrated circuit of , 21 f

Application binary interface (ABI) , 22 Application programming interfaces

(APIs) defi ned , B-4 graphics , B-14

Architectural registers , 335–336 Arithmetic , 170

addition , 172–175 addition and subtraction , 172–175 division , 181–189 fallacies and pitfalls , 220–223

I-1

I-2 Index

fl oating-point , 189–214 historical perspective , 225 multiplication , 175–181 parallelism and , 214–215 Streaming SIMD Extensions and

advanced vector extensions in x86 , 215–216

subtraction , 172–175 subword parallelism , 214–215 subword parallelism and matrix

multiply , 216–220 Arithmetic instructions . See also

Instructions desktop RISC , D-11 f , D-11 f embedded RISC , D-13 f logical , 241–242 operands , 67–74

Arithmetic intensity , 531–532 Arithmetic logic unit (ALU) . See also

ALU control ; Control units 1-bit , A-26–A-29 64-bit , A-29–A-31 before forwarding , 297 f branch datapath , 244–245 hardware , 174 memory-reference instruction

use , 235 for register values , 242 R-format operations , 243 f signed-immediate input , 300

ARM Cortex-A53 , 234 , 332–340 address translation for , 458 f caches in , 459 f data cache miss rates for , 460 f memory hierarchies of , 457 performance of , 460–462 specifi cation , 333 f TLB hardware for , 458 f

ARPAnet , 54.e9 Arrays , 405 f

logic elements , A-18–A-20 multiple dimension , 210 pointers versus, 141–144 procedures for setting to zero , 141 f

ASCII binary numbers versus, 109 b character representation , 108 f defi ned , 108–109 symbols , 111

Assemblers , 125–127 defi ned , 14 function , 125–127

microcode , C-30 number acceptance , 126 object fi le , 126

Assembly language , 15 f defi ned , 14 , 125 fl oating-point , 205 f illustrated , 15 f programs , 125 RISC-V , 64 f , 85 b –86 b translating into machine language ,

85 b –86 b Asserted signals , 240 , A-4 Associativity

in caches , 395 b –396 b degree, increasing , 394–396 , 442 increasing , 399–400 set, tag size versus, 399 b –400 b

Atomic compare and swap , 123 b Atomic exchange , 122 Atomic fetch-and-increment , 123 b Atomic memory operation , B-21 Attribute interpolation , B-43–B-44 auipc’s eff ect , 156 Automobiles, computer application in , 4 Average memory access time (AMAT) ,

392 calculating , 392 b

B

Bandwidth , 29–30 bisection , 525 external to DRAM , 388 memory , 388 network , 523–524

Barrier synchronization , B-18 defi ned , B-20 for thread communication , B-34

Base addressing , 69 , 118 Base registers , 69 Basic block , 95 b Benchmarks , 528–538

defi ned , 46 Linpack , 528 , 248.e2–248dir.e3 ,

248.e3 multiprocessor , 528–538 NAS parallel , 530 parallel , 529 f PARSEC suite , 530 SPEC CPU , 46–48 SPEC power , 48–49 SPECrate , 528 Stream , 538 b

Biased notation , 81 , 193 Binary numbers , 82

ASCII versus, 109 b conversion to decimal numbers , 77 b defi ned , 74

Bisection bandwidth , 525 Bit maps

defi ned , 18 goal , 18 storing , 18

Bit-Interleaved Parity (RAID 3) , 481.e4 Bits

ALUOp , 250 , 251 defi ned , 14 dirty , 428 b guard , 212 patterns , 212 b –213 b reference , 426 b rounding , 212 sign , 75 state , C-8–C-10 sticky , 212 valid , 374–376

Blocking assignment , A-24 Blocking factor , 404 Block-Interleaved Parity (RAID 4) , 481.

e4–481.e5 Blocks

combinational , A-4–A-5 defi ned , 365–366 fi nding , 442–443 fl exible placement , 392–396 least recently used (LRU) , 399 locating in cache , 397–399 miss rate and , 381 f multiword, mapping addresses to ,

380 b –381 b placement locations , 441 placement strategies , 394 replacement selection , 399 replacement strategies , 444 spatial locality exploitation , 381 state , A-4–A-5 valid data , 374–376

Bonding , 28 Boolean algebra , A-6–A-7 Bounds check shortcut , 96 Branch datapath

ALU , 244–245 operations , 244–245

Branch if Equal (beq) , A-32 Branch if greater than or equal, unsigned

(bgeu) , 95–96

Arithmetic (Continued)

Index I-3

Branch if less than (blt) instruction , 95–96

Branch if less than, unsigned (bltu) , 95–96

Branch instructions pipeline impact , 306 f

Branch not taken assumption , 305–306 defi ned , 244

Branch prediction buff ers , 308 as control hazard solution , 272 defi ned , 271–272 dynamic , 272 , 308–312 static , 322

Branch predictors accuracy , 310 correlation , 310–311 information from , 310–311 tournament , 311–312

Branch table , 97–98 Branch taken

cost reduction , 306–307 defi ned , 244

Branch target addresses , 244 buff ers , 310

Branches . See also Conditional branches

addressing in , 115–117 compiler creation , 93–94 decision, moving up , 306–307 delayed , 272 , 306–308 ending , 95 b execution in ID stage , 307 pipelined , 308 b target address , 306–307

Branch-on-zero instruction , 258–259 Bubble Sort , 140 Bubbles , 303 Bus-based coherent multiprocessors ,

577.e1 Buses , A-18–A-19 Bytes

addressing , 70 order , 70

C

C language assignment, compiling into RISC-V ,

65 b compiling , 144–145 , 150.e1–150.e2

compiling assignment with registers , 67 b –68 b

compiling while loops in , 94 b –95 b sort algorithms , 141 f translation hierarchy , 124 f translation to RISC-V assembly

language , 65 variables , 104 b

C.mmp , 577.e3–577.e4 C + + language , 173.e7 , 150.e26 Cache blocking and matrix multiply ,

463–466 Cache coherence , 452–456

coherence , 452 consistency , 452 enforcement schemes , 454 implementation techniques , 482.

e10–482.e11 migration , 454 problem , 452 , 453 f , 456 b protocol example , 482.e11–482.e15 protocols , 454 replication , 454 snooping protocol , 454–456 snoopy , 482.e16 state diagram , 482.e15 f

Cache coherency protocol , 482.e11–482.e15

fi nite-state transition diagram , 482.e14 f functioning , 482.e13 f mechanism , 482.e13 f state diagram , 482.e15 f states , 482.e12 write-back cache , 482.e14 f

Cache controllers , 457 coherent cache implementation

techniques , 482.e10–482.e11 implementing , 482.e1 snoopy cache coherence , 482.e16 SystemVerilog , 482.e1–482.e4

Cache hits , 458 Cache misses

block replacement on , 443–444 capacity , 445 , 446 compulsory , 445 confl ict , 445 defi ned , 382 direct-mapped cache , 394 fully associative cache , 396 handling , 382–383 memory-stall clock cycles , 389 reducing with fl exible block placement ,

392–396

set-associative cache , 395 steps , 383 in write-through cache , 383

Cache performance , 388–408 calculating , 390 b –391 b hit time and , 391–392 impact on processor performance ,

390–391 Cache-aware instructions , 470 Caches , 373–388 . See also Blocks

accessing , 376–382 in ARM cortex-A53 , 459 f associativity in , 395 b –396 b bits in , 380 b bits needed for , 380 contents illustration , 377 f defi ned , 19–22 , 373–374 direct-mapped , 374 , 375 f , 380 , 392 empty , 376 FSM for controlling , 447–452 fully associative , 393 GPU , B-38 inconsistent , 383 index , 378 in Intel Core i7 , 459 f Intrinsity FastMATH example ,

385–387 locating blocks in , 397–399 locations , 375 f multilevel , 388 , 400–403 nonblocking , 458 physically addressed , 434–435 physically indexed , 434 b –435 b physically tagged , 434 b –435 b primary , 400 , 407–408 secondary , 400 , 407–408 set-associative , 393 simulating , 466 b size , 379–381 split , 387 b summary , 387–388 tag fi eld , 378 tags , 482.e1 f , 482.e10–482.e11 , 482.e11 virtual memory and TLB integration ,

433–435 virtually addressed , 434 virtually indexed , 434 virtually tagged , 434 write-back , 384 , 385 , 444 write-through , 383 , 385 , 444 writes , 383–385

Callee , 99 , 101 Caller , 99

I-4 Index

Capabilities , 497.e12 Capacity misses , 445 Carry lookahead , A-37–A-47

4-bit ALUs using , A-43 f adder , A-38 fast, with fi rst level of abstraction ,

A-38–A-40 fast, with “infi nite” hardware , A-38 fast, with second level of abstraction ,

A-40–A-45 plumbing analogy , A-41 f , A-42 f ripple carry speed versus, A b -45 b summary , A-45–A-47

Carry save adders , 181 CDC 6600 , 368.e2 , 54.e6 Cell phones , 6–7 Central processor unit (CPU) . See also

Processors classic performance equation , 36–40 defi ned , 19 execution time , 32 , 33–34 performance , 33–35 system, time , 32 time , 389 time measurements , 33–34 user, time , 32

Cg pixel shader program , B-15 Characters

ASCII representation , 108–109 in Java , 111–113

Chips , 19 , 25–26 , 26 manufacturing process , 26

Classes defi ned , 150.e14 packages , 150.e20

Clock cycles defi ned , 33 memory-stall , 389 number of registers and , 67 worst-case delay and , 260

Clock cycles per instruction (CPI) , 35–36 , 270

one level of caching , 400 two levels of caching , 400

Clock rate defi ned , 33 frequency switched as function of , 41 power and , 40

Clocking methodology , 239–241 , A-47 edge-triggered , 239 , A-47 , A-72–A-73 level-sensitive , A-73–A-74 , A-74–A-75 for predictability , 239

Clocks , A-47–A-49 edge , A-47 , A b -49 b in edge-triggered design , A-72 f skew , A-73 specifi cation , A-56 f synchronous system , A-47–A-48

Cloud computing , 522–523 defi ned , 7

Cluster networking , 527–528 , 553.e1 , 553.e3–553.e5 , 553.e6–553.e9

Clusters , 577.e7–577.e8 defi ned , 490 , 520 , 577.e7 isolation , 521 organization , 489 scientifi c computing on , 577.e7

Cm* , 577.e3–577.e4 CMOS (complementary metal oxide

semiconductor) , 41 Coarse-grained multithreading , 504–505 Cobol , 173.e6 Code generation , 150.e12 Code motion , 150.e6 Cold-start miss , 445 Collision misses , 445 Column major order , 403 Combinational blocks , A-4–A-5 Combinational control units , C-4–C-8 Combinational elements , 238 Combinational logic , 239 , A-3–A-4 ,

A-9–A-20 arrays , A-18–A-19 decoders , A-9–A-10 defi ned , A-4–A-5 don’t cares , A-17–A-18 multiplexors , A-10 ROMs , A-14–A-16 two-level , A-11–A-14 Verilog , A-23–A-26

Commercial computer development , 54.e3–54.e9

Commit units buff er , 327 defi ned , 327 in update control , 332 b

Common case fast , 11 Common subexpression elimination ,

150.e5 Communication , 23–24

overhead, reducing , 44–45 thread , B-34

Compact code , 173.e3–173.e4 Compare and branch zero , 307

Comparisons constant operands in , 72–74 signed versus unsigned , 95–96

Compilers , 125 branch creation , 94 b brief history , 173.e7–173.e8 conservative , 150.e6 defi ned , 14 front end , 150.e2 function , 14 , 125 high-level optimizations , 150.e3–150.e4 ILP exploitation , 368.e4–368.e5 Just In Time (JIT) , 133 optimization , 141 , 173.e8 speculation , 321–322 structure , 150.e1 f

Compiling C assignment statements , 65 b C language , 94 b –95 b , 144–145 , 150.

e1 , 150.e2 fl oating-point programs , 206 b –207 b if-then-else , 93 b in Java , 150.e18–150.e19 procedures , 100 b –101 b , 102 b –103 b recursive procedures , 102 b –103 b while loops , 94 b –95 b

Compressed sparse row (CSR) matrix , B-55 , B-56

Compulsory misses , 445 , 446–447 Computer architects , 11–13

abstraction to simplify design , 11 common case fast , 11 dependability via redundancy , 12 hierarchy of memories , 12 Moore’s law , 11 parallelism , 12 pipelining , 12 prediction , 12

Computers application classes, traditional , 3 applications , 4 arithmetic for , 170 characteristics , 54.e12 f commercial development , 54.e3–54.e9 component organization , 17 f components , 17 f design measure , 53 desktop , 5 embedded , 5–6 fi rst , 54.e2 in information revolution , 4 instruction representation , 81–89

Index I-5

performance measurement , 54.e1–54.e3 post-PC era , 6–7 servers , 5

Condition codes/fl ags , 96 Conditional branches

changing program counter with , 310 b compiling if-then-else into , 93 b defi ned , 92–93 desktop RISC , D-16 f embedded RISC , D-16 f implementation , 97 b in loops , 117 PA-RISC , D-34–D-36 , D-35 f PC-relative addressing , 115–116 RISC , D-10–D-16 SPARC , D-10–D-12

Conditional move instructions , 311 b –312 b

Confl ict misses , 445 Constant memory , B-40 Constant operands , 72–74

frequent occurrence , 72 Content Addressable Memory (CAM) ,

398 b –399 b Context switch , 436 b Control

ALU , 249–251 challenge , 313 fi nalizing , 259 forwarding , 298 FSM , C-8–C-22 implementation, optimizing , C-27 mapping to hardware , C-3–C-4 ,

C-4–C-8 , C-8–C-22 , C-22–C-28 , C-28–C-32 , C-32–C-33

memory , C-26 f organizing, to reduce logic , C-31–C-32 pipelined , 288–292

Control and status register (CSR) access instructions , 462–463

Control fl ow graphs , 150.e8 , 150.e9 illustrated examples , 150.e8 f , 150.e9 f ,

150.e11 f Control functions

ALU, mapping to gates , C-4–C-7 defi ning , 254 PLA, implementation , C-7 , C-20 ROM, encoding , C-19 for single-cycle implementation ,

259–260 Control hazards , 269–272 , 305–313

branch delay reduction , 306–308

branch not taken assumption , 305–306 branch prediction as solution , 272 delayed decision approach , 272 b dynamic branch prediction , 308–312 logic implementation in Verilog , 366.

e7–366.e10 pipeline stalls as solution , 270 f pipeline summary , 312–313 solutions , 270 f static multiple-issue processors and , 322

Control lines asserted , 254 in datapath , 253 f execution/address calculation , 289 fi nal three stages , 291 f instruction decode/register fi le read ,

289 instruction fetch , 289 memory access , 289 setting of , 254 values , 289 write-back , 289

Control signals ALUOp , 253 defi ned , 240 eff ect of , 254 f multi-bit , 254 pipelined datapaths with , 288–292 truth tables , C-14 f

Control units , 237–238 . See also Arithmetic logic unit (ALU)

address select logic , C-24 , C-25 f combinational, implementing ,

C-4–C-8 with explicit counter , C-23 f illustrated , 255 f logic equations , C-11–C-12 main, designing , 251–254 as microcode , C-28 f next-state outputs , C-10 , C-12 b –C-13 b output , 249–251 , C-10 RISC-V , C-10 f

Cooperative thread arrays (CTAs) , B-30 Coprocessors

defi ned , 210 b Core RISC-V instruction set . See also MIPS

abstract view , 236 f desktop RISC , D-9 f implementation , 234–235 implementation illustration , 237 f overview , 235–238 subset , 234

Cores defi ned , 43 number per chip , 43

Correlation predictor , 310–311 Cosmic Cube , 577.e6–577.e7 CPU , 9 Cray computers , 248.e4 , 248.e5 Critical word fi rst , 382 Crossbar networks , 525–526 CTSS (Compatible Time-Sharing

System) , 497.e13 CUDA programming environment , 513 ,

B-5–B-6 barrier synchronization , B-18 , B-34 development , B-17 , B-17–B-18 hierarchy of thread groups , B-18 kernels , B-19 , B-24 key abstractions , B-18 paradigm , B-19–B-22 parallel plus-scan template , B-61 f per-block shared memory , B-58 plus-reduction implementation ,

B-63 f programs , B-6 , B-24 , B-24 scalable parallel programming with ,

B-17–B-23 shared memories , B-18 threads , B-36

Cyclic redundancy check , 413 b –414 b Cylinder , 372

D

D fl ip-fl ops , A-50–A-51 , A-52 D latches , A-50–A-51 , A-51 Data bits , 411 f Data fl ow analysis , 150.e8 Data hazards , 266–269 , 292–305 . See also

Hazards forwarding , 266–267 , 292–305 load-use , 267–269 , 306 stalls and , 301–305

Data parallel problem decomposition , B-17 , B-18 f

Data race , 121 Data selectors , 235–236 Data transfer instructions . See also

Instructions defi ned , 68 , 69 load , 69 off set , 69 store , 70–71

I-6 Index

Datacenters , 7 Data-level parallelism , 498 Datapath elements

defi ned , 241 sharing , 246–247

Datapaths branch , 244–245 building , 241–249 control signal truth tables , C-14 f control unit , 255 f defi ned , 19 design , 241 exception handling , 316 f for fetching instructions , 243 f for hazard resolution via forwarding ,

300 f for memory instructions , 245 in operation for branch-if-equal

instruction , 258–259 in operation for load instruction , 257 f in operation for R-type instruction ,

256 f operation of , 254–259 pipelined , 274–292 for RISC-V architecture , 247 for R-type instructions , 254–257 single, creating , 245–249 single-cycle , 273 static two-issue , 324 f

Deasserted signals , 240 , A-4 DEC PDP-8 , 173.e2 f Decimal numbers

binary number conversion to , 77 b defi ned , 74

Decision-making instructions , 92–98 Decoders , A-9–A-10

two-level , A-64 Decoding machine language , 118–120 Defect , 26–27 Delayed branches , 272 . See also Branches

as control hazard solution , 272 embedded RISCs and , D-23 reducing , 306–308

Delayed decision , 272 b DeMorgan’s theorems , A-11 Denormalized numbers , 214 Dependability via redundancy , 12 Dependable memory hierarchy , 408–414

failure, defi ning , 408–410 Dependences

between pipeline registers , 236–237 between pipeline registers and ALU

inputs , 295–296

bubble insertion and , 303 detection , 295 b name , 325 sequence , 293

Design compromises and , 84 datapath , 241 digital , 343 logic , 238–241 main control unit , 251–254 memory hierarchy, challenges , 447 f pipelining instruction sets , 265

Desktop and server RISCs . See also Reduced instruction set computer (RISC) architectures

addressing modes , D-6 architecture summary , D-4 f , D-4 f arithmetic/logical instructions , D-11 f conditional branches , D-16 constant extension summary , D-9 f ,

D-9 f control instructions , D-11 f conventions equivalent to MIPS core ,

D-12 f data transfer instructions , D-10 f features added to , D-45 f fl oating-point instructions , D-12 f instruction formats , D-7 f multimedia extensions , D-16–D-18 multimedia support , D-18 f

Desktop computers, defi ned , 5 Device driver , 553.e4 DGEMM (Double precision General

Matrix Multiply) , 216–217 , 340 , 342–343 , 403 , 528

cache blocked version of , 405 f optimized C version of , 218 f , 340 f , 464 f performance , 342 f , 406 f

Dicing , 27 Dies , 26–27 Digital design pipeline , 343 Digital signal-processing (DSP)

extensions , D-19 DIMMs (dual inline memory modules) ,

497.e4 Direct Data IO (DDIO) , 553.e6 Direct memory access (DMA) , 553.e2 f ,

553.e3 Direct3D , B-13 Direct-mapped caches . See also Caches

address portions , 397 f choice of , 398–399 defi ned , 374 , 392

illustrated , 375 f memory block location , 393 f misses , 395 b –396 b single comparator , 397 total number of bits , 380

Dirty bit , 428 b Dirty pages , 428 b Disk memory , 371–373 Displacement addressing , 118 Distributed Block-Interleaved Parity

(RAID 5) , 481.e5–481.e6 Divide algorithm , 184 b Dividend , 182 Division , 181–189

algorithm , 183 f dividend , 182 divisor , 182

Divisor , 182 divu (Divide Unsigned) . See also Arithmetic

faster , 186–187 fl oating-point , 204–210 hardware , 182–185 hardware, improved version , 185 f operands , 182 quotient , 182 remainder , 182 in RISC-V , 187 signed , 185–186 SRT , 187

Don’t cares , A-17–A-18 example , A b -17 b –A b -18 b term , 251

Double data rate (DDR) , 369–370 Double Data Rate (DDR) SDRAM ,

369–370 , A-64 Double precision . See also Single

precision defi ned , 191 FMA , B-45 , B-45–B-46 GPU , B-45 , B b -74 b representation , 210–212

Doubleword , 67 , 151 Dual inline memory modules (DIMMs) ,

371 Dynamic branch prediction , 308–312 . See

also Control hazards branch prediction buff er , 308 loops and , 310 b

Dynamic hardware predictors , 272 Dynamic multiple-issue processors , 320 ,

326–331 . See also Multiple issue pipeline scheduling , 327–331 superscalar , 326

Index I-7

Dynamic pipeline scheduling , 327–331 commit unit , 327 concept , 327 hardware-based speculation , 329–331 primary units , 328 f reorder buff er , 332 b reservation station , 327

Dynamic random access memory (DRAM) , 368 , 369–371 , A-62–A-64

bandwidth external to , 388 cost , 23 defi ned , 19 , A-62 DIMM , 497.e4 Double Date Rate (DDR) , 369–370 early board , 497.e4 f GPU , B-37–B-38 growth of capacity , 25 f history , 497.e1 internal organization of , 370 f pass transistor , A b -62 b –A b -64 b SIMM , 497.e5 f , 497.e4 single-transistor , A-63 f size , 388 speed , 23–24 synchronous (SDRAM) , 369–370 , A-

59 , A-64 two-level decoder , A-64

Dynamically linked libraries (DLLs) , 130–132

defi ned , 130 lazy procedure linkage version , 130

E

Early restart , 382 b Edge-triggered clocking methodology ,

239 , 240 , A-47 , A-72–A-73 advantage , A-48 clocks , A-72–A-73 drawbacks , A-73–A-74 illustrated , A-49 f rising edge/falling edge , A-47

EDSAC (Electronic Delay Storage Automatic Calculator) , 497.e2 f , 497.e1 , 54.e2

Eispack , 248.e2–248dir.e3 , 248.e3 Electrically erasable programmable read-

only memory (EEPROM) , 371 Elements

combinational , 238 datapath , 241 , 246–247 memory , A-49–A-57 state , 238 , 240 , 242 f , A-47 , A b -49 b

Embedded computers , 5–6 application requirements , 6 design , 5 growth , 54.e11

Embedded Microprocessor Benchmark Consortium (EEMBC) , 54.e11

Embedded RISCs . See also Reduced instruction set computer (RISC) architectures

addressing modes , D-6 architecture summary , D-4 f , D-4 f arithmetic/logical instructions , D-14 f conditional branches , D-16 constant extension summary , D-9 f ,

D-9 f control instructions , D-15 f data transfer instructions , D-13 f delayed branch and , D-23 DSP extensions , D-19 general purpose registers , D-5 instruction conventions , D-15 f instruction formats , D-8 f multiply-accumulate approaches ,

D-19 f Encoding

defi ned , C-31 RISC-V instruction , 85 f , 119 f ROM control function , C-18 ROM logic function , A-15 x86 instruction , 153–154

ENIAC (Electronic Numerical Integrator and Calculator) , 497.e1 , 54.e2–54.e3 , 54.e2 , 54.e3

EPIC , 368.e4 Error correction , A-64–A-66 Error Detecting and Correcting Code

(RAID 2) , 481.e4 Error detection , A-65–A-66 Error detection code , 410 Ethernet , 23–24 EX stage

load instructions , 280 f overfl ow exception detection , 315 , 318 f store instructions , 282 f

Exabyte , 6 f Exception enable , 437 b Exceptions , 313–319

association , 319 b datapath with controls for handling ,

316 f defi ned , 191 , 313 detecting , 313 event types and , 313

imprecise , 319 b interrupts versus, 313 pipelined computer example , 316 b –

317 b in pipelined implementation , 315–319 precise , 319 b reasons for , 314–315 result due to overfl ow in add

instruction , 318 f in RISC-V architecture , 314–315 saving/restoring stage on , 438

Executable fi les defi ned , 127–129

Execute or address calculation stage , 280 Execute/address calculation

control line , 289 load instruction , 280 store instruction , 280

Execution time CPU , 32 , 33–34 pipelining and , 274 as valid performance measure , 50–51

Explicit counters , C-23–C-24 , C-26 f Exponents , 190

F

Failures, synchronizer , A-75–A-76 Fallacies . See also Pitfalls

Amdahl’s law , 546 arithmetic , 220 assembly language for performance ,

158 b commercial binary compatibility

importance , 158 b defi ned , 49 GPUs , B-72 , B-75 low utilization uses little power , 50 b peak performance , 546 b pipelining , 343 powerful instructions mean higher

performance , 157 right shift , 220 b

False sharing , 455 Fast carry

with fi rst level of abstraction , A-38–A-40

with “infi nite” hardware , A-38 with second level of abstraction ,

A-40–A-45 Fast Fourier Transforms (FFT) , B-53 Fault avoidance , 409 Fault forecasting , 409

I-8 Index

Fault tolerance , 409 Fermi architecture , 513 , 542 Field programmable devices (FPDs) ,

A-77–A-78 Field programmable gate arrays (FPGAs) ,

A-77 Fields

defi ned , 83 format , C-31 names , 83 RISC-V , 83–89

Files, register , 242 , 247 , A b -49 b , A-53–A-55

Fine-grained multithreading , 504 Finite-state machines (FSMs) , 447–452 ,

A-66–A-71 control , C-8–C-22 controllers , 450 f for multicycle control , C-9 f for simple cache controller , 451–452 implementation , 449 , A-69 Mealy , 450 Moore , 450 b –451 b next-state function , 449 , A-66 output function , A-66 , A-68 state assignment , A-69 state register implementation , A-70 f style of , 450 b –451 b synchronous , A-66 SystemVerilog , 482.e6 f traffi c light example , A-67

Flash memory , 371 defi ned , 23

Flat address space , 467 Flip-fl ops

D fl ip-fl ops , A-50–A-51 , A-52 defi ned , A-50–A-51

Floating point , 189–214 assembly language , 205 f backward step , 248.e3–248.e4 binary to decimal conversion , 195 b branch , 204 challenges , 224 diversity versus portability , 248.e2–248.

e3 division , 204 fi rst dispute , 248.e1–248.e2 form , 190 fused multiply add , 212 b guard digits , 211 b history , 248.e2 IEEE 754 standard , 191–196

intermediate calculations , 210–211 operands , 205 f overfl ow , 190 packed format , 216 precision , 221 procedure with two-dimensional

matrices , 80 b programs, compiling , 79 b –80 b registers , 210 b representation , 190–191 RISC-V instruction frequency for , 224 f RISC-V instructions , 204–210 rounding , 210–211 sign and magnitude , 190 SSE2 architecture , 215 , 215 f subtraction , 204 underfl ow , 190 units , 211–212 in x86 , 215 f

Floating vectors , 248.e2 Floating-point addition , 196–199

arithmetic unit block diagram , 200 f binary , 197 b –199 b illustrated , 198 f instructions , 204–210 steps , 196 , 196 , 196 , 196–197

Floating-point arithmetic (GPUs) , B-41–B-46

basic , B-42 double precision , B-45–B-46 , B b -74 b performance , B-44 specialized , B-42–B-44 supported formats , B-42 texture operations , B-44

Floating-point control and status register (fcsr) , 191

Floating-point instructions desktop RISC , D-12 f SPARC , D-31–D-32

Floating-point multiplication , 199–204 binary , 203 b –204 b illustrated , 202 f instructions , 204 signifi cands , 199–203 steps , 199–201 , 201 , 201 , 201 , 201–203

Flow-sensitive information , 150.e13 b –150.e14 b

Flushing instructions , 306 , 307–308 exceptions and , 317 b

For loops , 142 , 150.e25 inner , 150.e23 SIMD and , 577.e2

Format fi elds , C-31 Fortran , 173.e6 Forwarding , 292–305

ALU before , 297 f control , 298 datapath for hazard resolution , 300 f defi ned , 266–267 graphical representation , 267 f illustrations , 366.e25 multiple results and , 269 multiplexors , 298 f pipeline registers before , 297 f with two instructions , 266 b –267 b Verilog implementation , 366.e3–366.e5

Fractions , 190 , 191 Frame buff er , 18 Frame pointers , 104 Front end , 150.e2 Fully associative caches . See also Caches

block replacement strategies , 443–444 choice of , 443 defi ned , 393 memory block location , 393 f misses , 396

Fully connected networks , 525 Fused-multiply-add (FMA) operation ,

212 b , B-45

G

Game consoles , B-9 Gates , A-3–A-4 , A-4–A-9

AND , A-12–A-13 , C-7 delays , A-45 mapping ALU control function to ,

C-4–C-7 NAND , A-8–A-9 NOR , A-8–A-9 , A-49 f

Gather-scatter , 501 , 542 General Purpose GPUs (GPGPUs) , B-5 General-purpose registers , 147

architectures , 173.e2 f embedded RISCs , D-5

Generate defi ned , A-39 example , A b -44 b –A b -45 b super , A-40

Gigabyte , 6 f Global common subexpression

elimination , 150.e5 Global memory , B-21 , B-39 Global miss rates , 406 b

Index I-9

Global optimization , 150.e4–150.e10 code , 150.e6 implementing , 150.e7

Global pointers , 104 b GPU computing . See also Graphics

processing units (GPUs) defi ned , B-5–B-6 visual applications , B-6

GPU system architectures , B-7–B-12 graphics logical pipeline , B-10 heterogeneous , B-7–B-9 implications for , B-24–B-25 interfaces and drivers , B-9–B-10 unifi ed , B-10–B-11

Graph coloring , 150.e11 Graphics displays

computer hardware support , 18 LCD , 18

Graphics logical pipeline , B-10 Graphics processing units (GPUs) , 512–

519 . See also GPU computing as accelerators , 512 attribute interpolation , B-43–B-44 defi ned , 46 , 496–497 , B-3 evolution , B-5 fallacies and pitfalls , B-72–B-75 fl oating-point arithmetic , B-16 ,

B-41–B-46 , B-74 GeForce 8-series generation , B-5 general computation , B b -73 b General Purpose (GPGPUs) , B-5 graphics mode , B-6 graphics trends , B-4 history , B-3–B-4 logical graphics pipeline , B-13–B-14 mapping applications to , B-55–B-72 memory , 512 multilevel caches and , 512 N-body applications , B-65–B-68 NVIDIA architecture , 513–515 parallel memory system , B-36–B-41 parallelism , 513 , B-76 performance doubling , B-4 perspective , 517–519 programming , B-12–B-25 programming interfaces to , B-17 real-time graphics , B-13

Graphics shader programs , B-14–B-15 Gresham’s Law , 225 , 248.e1 Grid computing , 523 b –524 b Grids , B-19 GTX 280 , 538–539

Guard digits defi ned , 210–211 rounding with , 211 b

H

Half precision , B-42 Halfwords , 112 Hamming, Richard , 410 Hamming distance , 410 Hamming Error Correction Code (ECC) ,

410–411 calculating , 410

Hard disks access times , 23 defi ned , 23

Hardware as hierarchical layer , 13 f language of , 14–16 operations , 63–67 supporting procedures in , 98–108 synthesis , A-21 translating microprograms to ,

C-28–C-32 virtualizable , 416

Hardware description languages . See also Verilog

defi ned , A-20 using , A-20–A-26 VHDL , A-20–A-21

Hardware multithreading , 504–507 coarse-grained , 504–505 options , 505 f simultaneous , 505

Hardware-based speculation , 329–331 Harvard architecture , 54.e3 Hazard detection units , 301

pipeline connections for , 304–305 Hazards . See also Pipelining

control , 269–272 , 305–313 data , 266–269 , 292–305 forwarding and , 300 b structural , 265–266 , 282

Heap allocating space on , 104 defi ned , 105

Heterogeneous systems , B-4–B-5 architecture , B-7–B-12 defi ned , B-3

Hexadecimal numbers , 82 binary number conversion to , 82 f , 83 b

Hierarchy of memories , 12

High-level languages , 14–16 benefi ts , 16 computer architectures , 173.e4 importance , 16

High-level optimizations , 150.e3–150.e4 Hit rate , 366 Hit time

cache performance and , 391–392 defi ned , 366–367

Hit under miss , 458 Hold time , A-52–A-53 Horizontal microcode , C-32 Hot-swapping , 481.e6–481.e7 Human genome project , 4

I

I/O , 553.e1–553.e2 , 553.e2 , 553.e2 on system performance , 481.e1 b –481.

e2 b I/O benchmarks . See Benchmarks IBM 360/85 , 497.e5 IBM 701 , 54.e4 IBM 7030 , 368.e1 IBM ALOG , 248.e6 IBM Blue Gene , 577.e8 , 577.e8–577.e9 IBM Personal Computer , 173.e5 , 54.e7 IBM System/360 computers , 54.e5 f , 368.

e1 , 248.e5 , 248.e6 IBM z/VM , 497.e12 ID stage

branch execution in , 307 , 308 load instructions , 280 f store instruction in , 279 f

IEEE 754 fl oating-point standard , 191–196 , 192 f , 248.e7–248.e9 . See also Floating point

fi rst chips , 248.e7–248.e9 in GPU arithmetic , B-42 implementation , 248.e9 rounding modes , 211–212 today , 248.e9

If statements , 115–116 If-then-else , 93 b Imagination Technologies , 145 Immediate addressing , 118 Immediate instructions , 72 Imprecise interrupts , 319 b , 368.e2–368.e3 Index-out-of-bounds check , 96 Induction variable elimination , 150.e6 Inheritance , 150.e14 In-order commit , 328–329

I-10 Index

Input devices , 16–17 Inputs , 251 Instances , 150.e14 Instruction count , 36 , 38 Instruction decode/register fi le read stage

control line , 288–292 load instruction , 277 store instruction , 282

Instruction execution illustrations , 366.e16–366.e25

clock cycle 9 , 366.e24 f clock cycles 1 and 2 , 366.e20 f clock cycles 3 and 4 , 366.e21 f clock cycles 5 and 6 , 366.e22 f clock cycles 7 and 8 , 366.e23 f examples , 366.e24–366.e25 forwarding , 366.e25 , 366.e25 no hazard , 366.e16 pipelines with stalls and forwarding ,

366.e25 Instruction fetch stage

control line , 289 load instruction , 277 store instruction , 282

Instruction formats , 153 defi ned , 82 desktop/server RISC architectures , D-7 f embedded RISC architectures , D-8 f I-type , 84 MIPS , 146 f RISC-V , 146 f R-type , 84 , 251–252 x86 , 153–154

Instruction latency , 344–345 Instruction mix , 39–40 , 54.e9 Instruction set architecture

branch address calculation , 244 defi ned , 22 , 52 history , 161 maintaining , 52 protection and , 417 thread , B-31–B-34 virtual machine support , 416

Instruction sets , B-49 MIPS-32 , 146 f RISC-V , 160 x86 growth , 156 f

Instruction-level parallelism (ILP) , 342–343 . See also Parallelism

compiler exploitation , 368.e4–368.e5 defi ned , 43 b , 319–320 exploitation, increasing , 331 and matrix multiply , 340–343

Instructions , 60 , D-25–D-27 , D-40 , D-40–D-43 . See also Arithmetic instructions ; MIPS ; Operands

add immediate , 72–74 addition , 174 Alpha , D-27–D-29 arithmetic-logical , 241–242 ARM , D-36–D-38 assembly , 65 basic block , 95 b cache-aware , 470 conditional branch , 92–93 , 93 b conditional move , 311 b –312 b data transfer , 68 decision-making , 92–98 defi ned , 14 , 62 desktop RISC conventions , D-12 f as electronic signals , 81–82 embedded RISC conventions , D-15 f encoding , 85 f fetching , 243 f fl oating-point , 204–210 fl oating-point (x86) , 215 f fl ushing , 306 , 307–308 immediate , 72 introduction to , 62–63 left -to-right fl ow , 275 load , 69 logical operations , 89–92 M32R , D-40 memory access , B-33–B-34 memory-reference , 235 multiplication , 181 nop , 302–303 PA-RISC , D-34–D-36 performance , 35–36 pipeline sequence , 302 f PowerPC , D-12–D-13 , D-32–D-34 PTX , B-31 , B-32 f representation in computer , 81–89 restartable , 438–439 resuming , 438 b –439 b R-type , 241–242 , 246–247 SPARC , D-29–D-32 store , 71 store-conditional doubleword , 122–123 subtraction , 174 SuperH , D-39–D-40 thread , B-30–B-31 Th umb , D-38–D-39 vector , 498–500 as words , 62 x86 , 146–155

Instructions per clock cycle (IPC) , 320 Integrated circuits (ICs) , 19 . See also

specifi c chips cost , 27 defi ned , 25 manufacturing process , 26 very large-scale (VLSIs) , 25

Intel Core i7 , 46–49 , 234 , 491 , 538–543 address translation for , 458 f architectural registers , 335–336 caches in , 459 f memory hierarchies of , 457–462 microarchitecture , 335 performance of , 460 SPEC CPU benchmark , 46–48 SPEC power benchmark , 48–49 TLB hardware for , 458 f

Intel Core i7 920 , 335–338 microarchitecture , 335

Intel Core i7 960 benchmarking and roofl ines of ,

538–543 Intel Core i7 Pipelines , 332–340 , 335–338

memory components , 336 f performance , 338–340 program performance , 339 b specifi cation , 333 f

Intel IA-64 architecture , 173.e2 f Intel Paragon , 577.e6–577.e7 Intel Th reading Building Blocks , B-60 Intel x86 microprocessors

clock rate and power for , 40 f Interference graphs , 150.e10 Interleaving , 388 Interprocedural analysis , 150.e13 b –150.e14 b Interrupt enable , 437 b Interrupt-driven I/O , 553.e3 Interrupts

defi ned , 191 , 313 event types and , 313 exceptions versus , 313 imprecise , 319 b , 368.e2–368.e3 precise , 319 b vectored , 314

Intrinsity FastMATH processor , 385–387 caches , 386 f data miss rates , 387 f , 397 f read processing , 432 f TLB , 430–433 write-through processing , 432 f

Inverted page tables , 427 Issue packets , 322–323 I-type , 87 b

Index I-11

J

Java bytecode , 132 bytecode architecture , 150.e10–150.e12 characters in , 111–113 compiling in , 150.e18–150.e19 goals , 132 interpreting , 132 , 144–145 , 150.e14 keywords , 150.e20 method invocation in , 150.e20 pointers , 150.e25–150.e26 primitive types , 150.e25 programs, starting , 132–133 reference types , 150.e25 sort algorithms , 141 f strings in , 111–113 translation hierarchy , 132 f while loop compilation in , 150.e17 b –

150.e18 b Java Virtual Machine (JVM) , 145 , 150.e15 Jump-and-link register instruction (jalr) ,

97–98 , 99 Jump instructions , D-26

branch instruction versus, 248 f control and datapath for , 249 implementing , 235–238 instruction format , 248

Just In Time (JIT) compilers , 133 , 550

K

Karnaugh maps , A-18 Kernel mode , 435 Kernels

CUDA , B-19 , B-24 defi ned , B-19–B-22

Kilobyte , 6 f

L

LAPACK , 221–222 Large-scale multiprocessors , 577.e6 ,

577.e6–577.e7 Latches

D latch , A-50–A-51 , A-51 defi ned , A-50–A-51

Latency instruction , 344–345 memory , B b -74 b pipeline , 274 b use , 323–325

lb (load byte) , 64 f

lbu (load byte, unsigned) , 64 f ld (load doubleword) , 64 f Leaf procedures . See also Procedures

defi ned , 102 example , 112 f

Least recently used (LRU) as block replacement strategy , 443–444 defi ned , 399 pages , 424–426

Least signifi cant bits defi ned , 74 SPARC , D-31

Left -to-right instruction fl ow , 275 Level-sensitive clocking , A-73–A-74 ,

A-74–A-75 defi ned , A-73–A-74 two-phase , A-74

lh (load halfword) , 64 f lhu (load halfword, unsigned) , 64 f Link , 553.e1–553.e2 Linkers , 127–129

defi ned , 127 executable fi les , 127–129 steps , 127

Linking object fi les , 128 b –129 b Linpack , 528 , 248.e2–248dir.e3 , 248.e3 Liquid crystal displays (LCDs) , 18 LISP, SPARC support , D-30 Live range , 150.e10 Livermore Loops , 54.e10 Load balancing , 495 b –496 b Load byte , 109 Load doubleword , 69 , 71–72 Load instructions . See also Store

instructions access , B-41 base register , 252 compiling with , 71 b datapath in operation for , 257 f defi ned , 69 EX stage , 280 f halfword unsigned , 112 ID stage , 279 f IF stage , 279 f load byte unsigned , 78 load half , 112 MEM stage , 281 f pipelined datapath in , 284 f signed , 78 b unit for implementing , 245 f unsigned , 78 b WB stage , 281 f

Loaders , 130

Load-reserved doubleword , 122–123 Load-store architectures , 173.e2 Load upper immediate , 113–114 Load-use data hazard , 267–269 , 306 Load-use stalls , 306 Load word , 113 b Load word unsigned , 113 b Local area networks (LANs) , 24 .

See also Networks Local memory , B-21 , B-40 Local miss rates , 406 b Local optimization , 150.e4 .

See also Optimization implementing , 150.e7

Locality principle , 364–365 spatial , 364 , 367 b temporal , 364 , 367 b

Lock synchronization , 121 Locks , 508–511 Logic

address select , C-24 , C-25 f ALU control , C-6–C-7 combinational , 240 , A-5 , A-9–A-20 components , 239 control unit equations , C-11 f design , 238–241 equations , A b -7 b minimization , A-18 programmable array (PAL) , A-77 sequential , A-4–A-5 , A-55–A-57 two-level , A-11–A-14

Logical operations , 89–92 AND , 90 desktop RISC , D-11 f , D-11 f embedded RISC , D-13 f NOT , 91 OR , 91 shift s , 90 xor , 91

Long instruction word (LIW) , 368.e4 Lookup tables (LUTs) , A-77–A-78 Loop unrolling

defi ned , 325–326 , 150.e3–150.e4 for multiple-issue pipelines , 325 b –326 b register renaming and , 325

Loops , 94–96 conditional branches in , 115–116 for , 142 prediction and , 310 b test , 142 , 143 while, compiling , 94 b –95 b

lr.d (load reserved) , 64 f

I-12 Index

lui (load upper immediate) , 64 f lw (load word) , 64 f lwu (load word, unsigned) , 64 f

M

M32R , D-15 , D-40 Machine code , 82 Machine instructions , 82 Machine language , 15 f

branch off set in , 116 b –117 b decoding , 118–120 defi ned , 14 , 82 illustrated , 15 f RISC-V , 87–89 SRAM , 19–22 translating RISC-V assembly language

into , 85 b –86 b Main memory , 418 . See also Memory

defi ned , 23 page tables , 427 physical addresses , 418

Mapping applications , B-55–B-72 Mark computers , 54.e3 Matrix multiply , 216–220 , 543–546 Mealy machine , 450 , A-67 , A-70–A-71 ,

A b -71 b Mean time to failure (MTTF) , 408–409

versus AFR of disks , 408 b –409 b improving , 409–410

Media Access Control (MAC) address , 553.e6

Megabyte , 6 f Memory

addresses , 78 b affi nity , 536 f atomic , B-21 bandwidth , 369–370 , 387 b cache , 19–22 , 373–388 , 388–408 CAM , 398 b –399 b constant , B-40 control , C-26 defi ned , 19 DRAM , 19 , 369–371 , A-62–A-64 fl ash , 23 global , B-21 , B-39 GPU , 512 instructions, datapath for , 245 local , B-21 , B-40 main , 23 nonvolatile , 22–23 operands , 68–72 parallel system , B-36–B-41

read-only (ROM) , A-14–A-16 SDRAM , 369–370 secondary , 23 shared , B-17 , B-39–B-40 spaces , B-39 SRAM , A-57–A-59 stalls , 390 technologies for building , 24–28 texture , B-40 virtual , 417–441 volatile , 22–23

Memory access instructions , B-33–B-34 Memory access stage

control line , 290 f load instruction , 280 f store instruction , 280

Memory bandwidth , 538–539 , 547 b Memory consistency model , 456 b Memory elements , A-49–A-57

clocked , A-50 D fl ip-fl op , A-50–A-51 , A-52 D latch , A-51 DRAMs , A-62–A-64 fl ip-fl op , A-50 hold time , A-52–A-53 latch , A-50 setup time , A-52–A-53 , A-53 f SRAMs , A-57–A-59 unclocked , A-50

Memory hierarchies , 535 of ARM cortex-A53 , 457–462 block (or line) , 365–366 cache performance , 388–408 caches , 373–388 common framework , 441–447 defi ned , 365 design challenges , 447 b development , 497.e5–497.e7 exploiting , 362 of Intel Core i7 , 457–462 level pairs , 366 f multiple levels , 365 overall operation of , 433 b –434 b parallelism and , 452–456 , 481.e1–481.

e2 pitfalls , 466–470 program execution time and , 407 quantitative design parameters , 441 f redundant arrays and inexpensive

disks , 456 reliance on , 367 structure , 365 f structure diagram , 368 f

variance , 407 b virtual memory , 417–441

Memory rank , 371 Memory technologies , 368–373

disk memory , 371–373 DRAM technology , 368 , 369–371 fl ash memory , 371 SRAM technology , 368 , 369

Memory-mapped I/O , 553.e2 Memory-stall clock cycles , 389 Message passing

defi ned , 519 multiprocessors , 519–524

Metastability , A-75–A-76 Methods

defi ned , 150.e14 invoking in Java , 150.e19–150.e20

Microarchitectures , 335 Intel Core i7 920 , 335–338

Microcode assembler , C-30 control unit as , C-28 f defi ned , C-27 dispatch ROMs , C-30 , C-30 f horizontal , C-32 vertical , C-32

Microinstructions , C-31 Microprocessors

design shift , 491 multicore , 8 , 43 , 490–491

Microprograms as abstract control representation ,

C-30–C-31 fi eld translation , C-28–C-29 translating to hardware , C-28–C-32

Migration , 454 Million instructions per second (MIPS) ,

51 Minterms

defi ned , A-12–A-13 , C-20 in PLA implementation , C-20

MIP-map , B-44 MIPS and RISC-V

common features between , 145 MIPS-16

16-bit instruction set , D-41–D-42 immediate fi elds , D-41 instructions , D-40–D-43 MIPS core instruction changes ,

D-42–D-43 PC-relative addressing , D-41

MIPS-32 instruction set , 145 MIPS-64 instructions , 145 , D-25–D-27

Index I-13

conditional procedure call instructions , D-27

constant shift amount , D-25 jump/call not PC-relative , D-26 move to/from control registers , D-26 nonaligned data transfers , D-25 NOR , D-25 parallel single precision fl oating-point

operations , D-27 reciprocal and reciprocal square root ,

D-27 SYSCALL , D-25 TLB instructions , D-26–D-27

Mirroring , 481.e4 Miss penalty

defi ned , 366–367 determination , 381–382 multilevel caches, reducing , 400–403

Miss rates block size versus, 381–382 data cache , 442 f defi ned , 366 global , 406 b improvement , 381–382 Intrinsity FastMATH processor , 387 local , 406 b miss sources , 446 split cache , 387 b

Miss under miss , 458 MMX (MultiMedia eXtension) , 215 Moore machines , 450 , A-67 , A-70–A-71 ,

A b -71 b Moore’s law , 11 , 369 , 512 , 553.e1–553.e2 ,

B b -72 b Most signifi cant bit

1-bit ALU for , A-33 f defi ned , 74

MS-DOS , 497.e15 , 497.e15 Multicore , 507–512 Multicore multiprocessors , 8 , 43

defi ned , 8 , 490–491 MULTICS (Multiplexed Information and

Computing Service) , 497.e8 Multilevel caches . See also Caches

complications , 406 b defi ned , 388 , 406 b miss penalty, reducing , 400–403 performance of , 400 b –401 b summary , 407–408

Multimedia extensions desktop/server RISCs , D-16–D-18 as SIMD extensions to instruction sets ,

577.e3

vector versus, 499 b –500 b Multiple dimension arrays , 210 Multiple instruction multiple data

(MIMD) , 548–549 defi ned , 497 , 498 fi rst multiprocessor , 577.e3–577.e4

Multiple instruction single data (MISD) , 497–498

Multiple issue , 320 code scheduling , 324 b –325 b dynamic , 320 , 326–331 issue packets , 322–323 loop unrolling and , 325 b –326 b processors , 320 static , 320 , 322–326 throughput and , 330 b

Multiple processors , 543–546 Multiple-clock-cycle pipeline diagrams ,

284–285 fi ve instructions , 286 f illustrated , 285–288

Multiplexors , A-10 controls , 449 in datapath , 253 f defi ned , 235–236 forwarding, control values , 298 f selector control , 249 two-input , A-10

Multiplicand , 176 Multiplication , 175–181 . See also

Arithmetic fast, hardware , 180 faster , 180–181 fi rst algorithm , 178 f fl oating-point , 199–204 hardware , 176–180 instructions , 181 operands , 181 product , 181 sequential version , 176–180 signed , 180

Multiplier , 176 Multiply algorithm , 176–180 Multiply-add (MAD) , B-42 Multiprocessors

benchmarks , 528–538 bus-based coherent , 577.e6 defi ned , 490 historical perspective , 551 large-scale , 577.e6 , 577.e6–577.e7 message-passing , 519–524 multithreaded architecture , B-26–B-27 ,

B-36

organization , 489 , 519 for performance , 547 shared memory , 490–491 , 507–512 soft ware , 491 f TFLOPS , 577.e5 UMA , 508

Multistage networks , 525–526 Multithreaded multiprocessor

architecture , B-25–B-36 conclusion , B-36 ISA , B-31–B-34 massive multithreading , B-25–B-26 multiprocessor , B-26–B-27 multiprocessor comparison , B-35–B-36 SIMT , B-27–B-29 special function units (SFUs) , B-35 streaming processor (SP) , B-34 thread instructions , B-30–B-31 threads/thread blocks management , B-30

Multithreading , B-25–B-26 coarse-grained , 504–505 defi ned , 496–497 fi ne-grained , 504 hardware , 504–507 simultaneous (SMT) , 505

Must-information , 150.e13 b –150.e14 b Mutual exclusion , 121

N

Name dependence , 325 NAND gates , A-8–A-9 NAS (NASA Advanced Supercomputing) ,

530 N-body

all-pairs algorithm , B-65 GPU simulation , B-71 mathematics , B-65–B-66 multiple threads per body , B-68–B-72 optimization , B-67 performance comparison , B-69–B-70 results , B-70–B-72 shared memory use , B-67–B-68

Negation shortcut , 78–79 Nested procedures , 102–104

compiling recursive procedure showing , 102 b –103 b

NetFPGA 10-Gigagit Ethernet card , 553.e1 f , 553.e2 f

Network of Workstations , 577.e7–577.e8 Network topologies , 524–527

implementing , 526–527 multistage , 527 f

I-14 Index

Networking , 553.e3–553.e4 operating system in , 553.e3–553.e5 performance improvement , 553.

e6–553.e9 Networks , 23–24

advantages , 23 bandwidth , 525 crossbar , 525–526 fully connected , 525 local area (LANs) , 23–24 multistage , 525–526 wide area (WANs) , 23–24

Newton’s iteration , 210 b Next state

nonsequential , C-24 sequential , C-23–C-24

Next-state function , 449 , A-66 defi ned , 449 implementing, with sequencer ,

C-22–C-28 Next-state outputs , C-27 , C-12 b –C-13 b

example , C-12 implementation , C-12–C-13 logic equations , C-12 b –C-13 b truth tables , C-13–C-15

No Redundancy (RAID 0) , 481.e3 No write allocation , 384 Nonblocking assignment , A-24 Nonblocking caches , 332 b , 458 Nonuniform memory access (NUMA) ,

508 Nonvolatile memory , 22–23 Nops , 302–303 NOR gates , A-8–A-9

cross-coupled , A-49 f D latch implemented with , A-51 f

NOR operation , D-25 NOT operation , 91 , A-6 Numbers

binary , 74 computer versus real-world , 213 decimal , 74 , 77 b denormalized , 214 hexadecimal , 83 signed , 74–81 unsigned , 74–81

NVIDIA GeForce 8800 , B-46–B-55 all-pairs N-body algorithm , B-71 dense linear algebra computations ,

B-51–B-53 FFT performance , B-53 instruction set , B-49 performance , B-51

rasterization , B-50 ROP , B-50–B-51 scalability , B-51 sorting performance , B-54–B-55 special function approximation

statistics , B-43 f special function unit (SFU) , B-50 streaming multiprocessor (SM) ,

B-48–B-49 streaming processor , B-49–B-50 streaming processor array (SPA) , B-46 texture/processor cluster (TPC) , B-47

NVIDIA GPU architecture , 513–515 NVIDIA GTX 280 , 539 f , 540 f NVIDIA Tesla GPU , 538–543

O

Object fi les , 128 b –129 b debugging information , 127 header , 126 linking , 128 b –129 b relocation information , 126 static data segment , 126 symbol table , 127 text segment , 126

Object-oriented languages . See also Java brief history , 173.e7 defi ned , 145 , 150.e14

One’s complement , 81 , A-29 Opcodes

control line setting and , 254 defi ned , 83 , 252

OpenGL , B-13 OpenMP (Open MultiProcessing) ,

510 b –511 b , 530 Operands , 67–74 . See also Instructions

32-bit immediate , 113–114 adding , 173 arithmetic instructions , 67 compiling assignment when in

memory , 69 b constant , 72–74 division , 181–189 fl oating-point , 205 f memory , 68–72 multiplication , 175–181 RISC-V , 64 f

Operating systems brief history , 497.e8 defi ned , 13 encapsulation , 22 in networking , 553.e3–553.e5

Operations atomic, implementing , 122 hardware , 63–67 logical , 89–92 x86 integer , 151–152

Optimization class explanation , 150.e13 f compiler , 141 f control implementation , C-27 global , 150.e4–150.e10 high-level , 150.e3–150.e4 local , 150.e4–150.e10 , 150.e7 manual , 144

or (inclusive or) , 64 f OR operation , 174 , A-6 ori (inclusive or immediate) , 64 f Out-of-order execution

defi ned , 328 performance complexity , 406 b –407 b processors , 332 b

Output devices , 16–17 Overfl ow

defi ned , 75 , 190 detection , 174 exceptions , 316 f fl oating-point , 191 occurrence , 173 saturation and , 175 b subtraction , 173

P

P + Q redundancy (RAID 6) , 481.e6 Packed fl oating-point format , 216 Page faults , 424 . See also Virtual memory

for data access , 459 defi ned , 418–419 handling , 420 , 437–439 virtual address causing , 430–433

Page tables , 443 defi ned , 422–423 illustrated , 425 f indexing , 422–423 inverted , 427 levels , 427 main memory , 427 register , 422–423 storage reduction techniques , 427 updating , 422 VMM , 439 b

Pages . See also Virtual memory defi ned , 418–419 dirty , 428 b

Index I-15

fi nding , 422–423 LRU , 424–426 off set , 419 physical number , 419 placing , 422–423 size , 420 f virtual number , 419

Parallel bus , 553.e1–553.e2 Parallel execution , 121 Parallel memory system , B-36–B-41 .

See also Graphics processing units (GPUs)

caches , B-38 constant memory , B-40 DRAM considerations , B-37–B-38 global memory , B-39 load/store access , B-41 local memory , B-40 memory spaces , B-39 MMU , B-38–B-39 ROP , B-41 shared memory , B-39–B-40 surfaces , B-41 texture memory , B-40

Parallel processing programs , 492–497 creation diffi culty , 492–497 defi ned , 490 for message passing , 509 b –511 b great debates in , 577.e4–577.e6 for shared address space , 509 b –511 b use of , 547

Parallel reduction , B-62 Parallel scan , B-60–B-63

CUDA template , B-61 f inclusive , B-60 tree-based , B-62 f

Parallel soft ware , 491 Parallelism , 12 , 43 b , 319–332

and computers arithmetic , 214–215 data-level , 224 , 498 debates , 577.e4–577.e6 GPUs and , 512 , B-76 instruction-level , 43 , 319–320 , 331 memory hierarchies and , 452–456 ,

481.e1–481.e2 multicore and , 507 b multiple issue , 320 b multithreading and , 505 performance benefi ts , 44 process-level , 490 redundant arrays and inexpensive

disks , 456 subword , D-17

task , B-24 task-level , 490 thread , B-22

Paravirtualization , 470 PA-RISC , D-14 , D-17

branch vectored , D-35 conditional branches , D-34 , D-35 f debug instructions , D-36 decimal operations , D-35 extract and deposit , D-35 instructions , D-34–D-36 load and clear instructions , D-36 multiply/add and multiply/subtract ,

D-36 nullifi cation , D-34 nullifying branch option , D-25 store bytes short , D-36 synthesized multiply and divide ,

D-34–D-35 Parity , 481.e4

bits , 410–411 code , 418 , A-64–A-65

PARSEC (Princeton Application Repository for Shared Memory Computers) , 530

Pass transistor , A b -62 b –A b -64 b PCI-Express (PCIe) , 527 , B-7–B-8 , 553.

e1–553.e2 PC-relative addressing , 115–116 , 118 Peak fl oating-point performance , 532 Pentium bug morality play , 222 f Performance , 28–40

assessing , 28 classic CPU equation , 36–40 components , 38 f CPU , 33–35 defi ning , 29–32 equation, using , 36–40 improving , 34 b –35 b instruction , 35–36 measuring , 32–33 , 54.e9 program , 9–10 ratio , 31 relative , 31 b response time , 30 b sorting , B-49–B-50 throughput , 30 b time measurement , 32

Personal computers (PCs) , 7 f defi ned , 5

Personal mobile device (PMD) defi ned , 6–7

Petabyte , 6 f

Physical addresses , 418 mapping to , 418–419 space , 507 , 509 b –511 b

Physically addressed caches , 434–435 Pipeline registers

before forwarding , 296–298 dependences , 295–296 , 296 f forwarding unit selection , 300

Pipeline stalls , 268 avoiding with code reordering ,

268 b –269 b data hazards and , 301–305 insertion , 303 f load-use , 306 as solution to control hazards , 270 f

Pipelined branches , 308 b Pipelined control , 288–292 . See also

Control control lines , 288–289 , 289 overview illustration , 304 f specifying , 289

Pipelined datapaths , 274–292 with connected control signals , 292 f with control signals , 288–292 corrected , 284 f illustrated , 277 f in load instruction stages , 284 f

Pipelined dependencies , 294 f Pipelines

branch instruction impact , 306 f eff ectiveness, improving , 368.e3–368.e4 execute and address calculation stage ,

278 , 280 fi ve-stage , 262 , 278 , 286 b –288 b graphic representation , 267 f , 284–288 instruction decode and register fi le

read stage , 276 f , 280 instruction fetch stage , 277 f , 280 instructions sequence , 302 f latency , 274 b memory access stage , 278 , 280 multiple-clock-cycle diagrams ,

284–285 performance bottlenecks , 330–331 single-clock-cycle diagrams , 284–285 stages , 262–263 static two-issue , 323 f write-back stage , 277 , 282

Pipelining , 12 , 260–274 advanced , 331–332 benefi ts , 260 control hazards , 269–272 data hazards , 266–269

I-16 Index

exceptions and , 315–319 execution time and , 274 b fallacies , 343–344 hazards , 265–269 instruction set design for , 265 laundry analogy , 261 f overview , 260–274 paradox , 261–262 performance improvement , 265 pitfall , 343–344 simultaneous executing instructions ,

274 b speed-up formula , 263 structural hazards , 265–266 , 282 summary , 312–313 throughput and , 274 b

Pitfalls . See also Fallacies address space extension , 382–383 arithmetic , 220–223 associativity , 467 b defi ned , 49 GPUs , B-74 ignoring memory system behavior ,

466 b memory hierarchies , 466–470 out-of-order processor evaluation ,

467 b performance equation subset , 50 b pipelining , 343–344 pointer to automatic variables , 159 b sequential word addresses , 159 b simulating cache , 466 soft ware development with

multiprocessors , 546 b VMM implementation , 468–470

Pixel shader example , B-15–B-17 Pixels , 18 Pointers

arrays versus, 141–144 frame , 104 global , 104 b incrementing , 143 Java , 150.e25–150.e26 stack , 99 , 102–104

Polling , 553.e6 Pop , 99 Power

clock rate and , 40 critical nature of , 53 effi ciency , 331–332 relative , 41 b –42 b

PowerPC algebraic right shift , D-33 branch registers , D-32–D-33 condition codes , D-12–D-13 instructions , D-12–D-13 instructions unique to , D-32–D-34 load multiple/store multiple , D-33 logical shift ed immediate , D-33 rotate with mask , D-33

Precise interrupts , 319 b Prediction , 12

2-bit scheme , 310 accuracy , 310 dynamic branch , 308–312 loops and , 310 b steady-state , 310

Prefetching , 470 , 534 Primitive types , 150.e25 Procedure calls

preservation across , 104 Procedures , 98–108

compiling , 100 b –101 b compiling, showing nested procedure

linking , 100 b –101 b execution steps , 98 frames , 104 leaf , 102 nested , 102 b –103 b recursive , 106 b for setting arrays to zero , 141 f sort , 135 strcpy , 110 b –111 b string copy , 110 b –111 b swap , 134

Process identifi ers , 436 Process-level parallelism , 490 Processors , 232

control , 19 as cores , 43 datapath , 19 defi ned , 17 b , 19 dynamic multiple-issue , 320 multiple-issue , 320 out-of-order execution , 332 b ,

406 b –407 b performance growth , 44 f ROP , B-12 , B-41 speculation , 321–322 static multiple-issue , 320 , 322–326 streaming , B-34 superscalar , 326 , 505–506 , 368.e4 technologies for building , 24–28

two-issue , 323–325 vector , 497–498 VLIW , 322

Product , 176 Product of sums , A-11 Program counters (PCs) , 241

changing with conditional branch , 311 b –312 b

defi ned , 99 , 241 exception , 435 , 437 incrementing , 241 , 243 f instruction updates , 277

Program performance elements aff ecting , 39 t understanding , 9

Programmable array logic (PAL) , A-77 Programmable logic arrays (PLAs)

component dots illustration , A-16 f control function implementation , C-7 f ,

C-20 defi ned , A-12–A-13 example , A b -13 b –A b -14 b illustrated , A-13 f ROMs and , A-15–A-16 size , C-20 truth table implementation , A-13

Programmable logic devices (PLDs) , A-77 Programmable ROMs (PROMs) , A-14 Programming languages . See also specifi c

languages brief history of , 173.e6–173.e7 object-oriented , 145 variables , 67

Programs assembly language , 125 Java, starting , 132–133 parallel processing , 490 starting , 124–133 translating , 124–133

Propagate defi ned , A-39 example , A b -44 b –A b -45 b super , A-40

Protected keywords , 150.e20 Protection

defi ned , 418 implementing , 435–437 mechanisms , 497.e12 VMs for , 414–415

Protection group , 481.e4 Pseudo MIPS

defi ned , 225

Pipelining (Continued)

Index I-17

Pseudoinstructions defi ned , 125 summary , 126

Pthreads (POSIX threads) , 530 PTX instructions , B-31 , B-32 f Public keywords , 150.e20 Push

defi ned , 99 using , 102–104

Q

Quad words , 151 Quicksort , 401 b –403 b , 402 f Quotient , 182

R

Race , A-72–A-73 Radix sort , 401 b –403 b , 402 f , B-63–B-65

CUDA code , B-64 f implementation , B-63–B-65

RAID . See Redundant arrays of inexpensive disks (RAID)

RAM , 9 Raster operation (ROP) processors , B-12 ,

B-41 , B-50–B-51 fi xed function , B-41

Raster refresh buff er , 18 Rasterization , B-50 Ray casting (RC) , 542 Read-only memories (ROMs) ,

A-14–A-16 control entries , C-16 b –C-18 b control function encoding , C-19 dispatch , C-25 f implementation , C-15–C-19 logic function encoding , A-15 overhead , C-18 PLAs and , A-15–A-16 , A-16 programmable (PROM) , A-14 total size , C-15–C-16

Read-stall cycles , 389 Read-write head , 371 Receive message routine , 519 Recursive procedures , 106 b .

See also Procedures clone invocation , 102

Reduced instruction set computer (RISC) architectures , D-3–D-5 , D-5–D-9 , D-9–D-16 , D-16–D-18 , D-19 , D-20–D-25 , D-25–D-27 ,

D-27–D-29 , D-29–D-32 , D-32–D-34 , D-34–D-36 , D-36–D-38 , D-38–D-39 , D-39–D-40 , D-40 , D-40–D-43 , D-43–D-45 , 368.e3 , 173.e4 . See also Desktop and server RISCs ; Embedded RISCs

group types , D-3–D-4 instruction set lineage , D-44 f

Reduction , 509 Redundant arrays of inexpensive disks

(RAID) , 481.e1–481.e2 history , 481.e6–481.e7 RAID 0 , 481.e3 RAID 1 , 481.e4 RAID 2 , 481.e4 RAID 3 , 481.e4 RAID 4 , 481.e4–481.e5 RAID 5 , 481.e5–481.e6 RAID 6 , 481.e6 spread of , 481.e5 summary , 481.e6–481.e7 use statistics , 481.e6 f

Reference bit , 426 b References

absolute , 127 types , 150.e25

Register addressing , 118 f Register allocation , 150.e10–150.e12 Register fi les , A b -49 b , A-53–A-55

in behavioral Verilog , A-56 defi ned , 242 , A b -49 b , A-53 single , 247 , 247 two read ports implementation , A-54 f with two read ports/one write port ,

A-54 f write port implementation , A-55 f

Register-memory architecture , 173.e2 Registers , 148 , 149–151

architectural , 314 , 335–336 base , 69 clock cycle time and , 67 compiling C assignment with , 67 b –68 b defi ned , 67 destination , 252 fl oating-point , 210 b left half , 278 number specifi cation , 242 page table , 422–423 pipeline , 295–296 , 295–296 , 296 f , 300 primitives , 67 renaming , 325

right half , 278 RISC-V conventions , 253 f spilling , 71 Status , 314 temporary , 68 , 100 variables , 68

Relative performance , 31 b Relative power , 41 b –42 b Reliability , 408–409 Remainder, defi ned , 182 Reorder buff ers , 332 b Replication , 454 Requested word fi rst , 382 Request-level parallelism , 522 Reservation stations

buff ering operands in , 327 defi ned , 327

Response time , 30 b Restartable instructions , 438–439 Return address , 99 R-format

ALU operations , 243 f Ripple carry

adder , A-29 carry lookahead speed versus,

A b -45 b RISC-V , 62 , 85–87

architecture , 188 f arithmetic instructions , 63 arithmetic/logical instructions not in ,

D-21 f , D-23 f assembly instruction, mapping ,

81 b –82 b common extensions to , D-20–D-25 compiling C assignment statements

into , 65 b compiling complex C assignment into ,

66 b control instructions not in , D-21 f control registers , 437 b control unit , C-10 data transfer instructions not in , D-20 f ,

D-22 f divide in , 187 exceptions in , 314–315 fi elds , 83–89 fl oating-point instructions , 204–210 fl oating-point instructions not in ,

D-22 f instruction classes , 157 f instruction encoding , 85 f , 119 f instruction formats , 120 , 146 f

I-18 Index

instruction set , 62 , 159–160 , 224 , 234 , D-9–D-16

machine language , 87–89 memory addresses , 70 f memory allocation for program and

data , 106 f multiply in , 181 Pseudo , 224 f register conventions , 107 f static multiple issue with , 322–326

Roofl ine model , 532–533 , 534 f , 535 with ceilings , 536 f computational roofl ine , 533 , 535 illustrated , 532 f Opteron generations , 533 with overlapping areas shaded , 537 f peak fl oating-point performance , 536 f peak memory performance , 540 f with two kernels , 537 f

Rotational delay . See Rotational latency Rotational latency , 373 Rounding , 210–211

accurate , 210–211 bits , 212 with guard digits , 211 b IEEE 754 modes , 211–212

Row-major order , 209 b –210 b , 403 R-type, defi ned , 87 b R-type instructions , 246 b –247 b

datapath for , 254–257 datapath in operation for , 256 f

RV32 , 73 b RV64 , 73 b

S

Saturation , 175 b sb (store byte) , 64 f SB-type instruction format , 115 sc.d (store conditional) , 64 f SCALAPAK , 221–222 Scaling

strong , 495 weak , 495

Scientifi c notation adding numbers in , 197 defi ned , 189 for reals , 189

sd (store doubleword) , 64 f Search engines , 4 Secondary memory , 23 Sectors , 371–372

Seek , 372 Segmentation , 421 b Selector values , A-10 Semiconductors , 25–26 Send message routine , 519 Sensitivity list , A-23–A-24 Sequencers

explicit , C-32 implementing next-state function

with , C-22–C-28 Sequential logic , A-4 Servers , 481.e6 . See also Desktop and

server RISCs cost and capability , 5

Service accomplishment , 408–409 Service interruption , 408 Set-associative caches , 393 . See also

Caches address portions , 397 f block replacement strategies , 443 choice of , 442 four-way , 394 f , 397 memory-block location , 393 f misses , 395 b –396 b n -way , 393 two-way , 394 f

Set less than instruction (slt) , A-31 Setup time , A-52–A-53 , A-53 f sh (store halfword) , 64 f Shaders

defi ned , B-14 fl oating-point arithmetic , B-14 graphics , B-14–B-15 pixel example , B-15–B-17

Shading languages , B-14 Shadowing , 481.e4 Shared memory . See also Memory

as low-latency memory , B-21 caching in , B-58–B-60 CUDA , B-58 N-body and , B-66 f per-CTA , B-39 SRAM banks , B-40

Shared memory multiprocessors (SMP) , 507–512

defi ned , 490–491 , 507–508 single physical address space , 507 synchronization , 508–511

Shift left logical immediate (slli) , 90 Shift right arithmetic (srai) , 90 Shift right logical immediate (srli) , 90 Sign and magnitude , 190 Sign bit , 77

Sign extension , 244 defi ned , 78 b shortcut , 78–79

Signals asserted , 240 , A-4 control , 240 , 253 , 253 , 253 deasserted , 240 , A-4

Signed division , 185–186 Signed multiplication , 180 Signed numbers , 74–81

sign and magnitude , 75 treating as unsigned , 96

Signifi cands , 191–192 addition , 196–197 multiplication , 199–203

Silicon , 25–26 as key hardware technology , 53 crystal ingot , 26 defi ned , 25–26 wafers , 26

Silicon crystal ingot , 26 SIMD (Single Instruction Multiple

Data) , 496–497 , 548–549 computers , 577.e1–577.e3 data vector , B-35 extensions , 577.e3 for loops and , 577.e2 massively parallel multiprocessors ,

577.e1 small-scale , 577.e3 vector architecture , 498–500 in x86 , 498

SIMMs (single inline memory modules) , 497.e5 f , 497.e4

Simple programmable logic devices (SPLDs) , A-77

Simplicity , 65–67 Simultaneous multithreading (SMT) ,

505 support , 505 f thread-level parallelism , 505 unused issue slots , 505 f

Single error correcting/Double error correcting (SEC/DEC) , 410–414

Single instruction single data (SISD) , 498 , 502–504

Single precision . See also Double precision

binary representation , 194 b defi ned , 191

Single-clock-cycle pipeline diagrams , 285–288

illustrated , 287 f

RISC-V (Continued)

Index I-19

Single-cycle datapaths . See also Datapaths illustrated , 275 f instruction execution , 276 f

Single-cycle implementation control function for , 259 nonpipelined execution versus

pipelined execution , 264 f non-use of , 259–260 penalty , 260 pipelined performance versus,

262 b –263 b Single-instruction multiple-thread

(SIMT) , B-27–B-29 overhead , B-35 multithreaded warp scheduling , B-28 f processor architecture , B-28–B-29 warp execution and divergence ,

B-29–B-30 Single-program multiple data (SPMD) ,

B-22 sll (shift left logical) , 64 f slli (shift left logical immediate) , 64 f Smalltalk-80 , 173.e7 , 173.e7 Smart phones , 7 Snooping protocol , 454–456 Snoopy cache coherence , 482.e16 Soft ware optimization

via blocking , 403–407 Soft ware

layers , 13 f multiprocessor , 490 parallel , 491 as service , 7 , 522 , 548 systems , 13

Sort algorithms , 141 f Sort procedure , 135 . See also Procedures

code for body , 136–138 full procedure , 139–140 passing parameters in , 138 preserving registers in , 138–139 procedure call , 138 register allocation for , 136–141

Sorting performance , B-54–B-55 Space allocation

on heap , 105–108 on stack , 104

SPARC annulling branch , D-23–D-25 CASA , D-31–D-32 conditional branches , D-10–D-16 fast traps , D-30 fl oating-point operations , D-31 instructions , D-29–D-32

least signifi cant bits , D-31 f multiple precision fl oating-point

results , D-32 nonfaulting loads , D-32 overlapping integer operations , D-31 quadruple precision fl oating-point

arithmetic , D-36 register windows , D-29–D-30 support for LISP and Smalltalk , D-30

Sparse matrices , B-55–B-58 Sparse Matrix-Vector multiply (SpMV) ,

B-55 , B-57 f , B-58 CUDA version , B-57 f serial code , B-57 f shared memory version , B-59 f

Spatial locality , 364 large block exploitation of , 381 tendency , 367

SPEC , 54.e10–54.e11 CPU benchmark , 46–48 power benchmark , 48–49 SPEC89 , 54.e10 SPEC92 , 54.e11 SPEC95 , 54.e11 SPEC2000 , 54.e11 SPEC2006 , 54.e11 SPECrate , 528 SPECratio , 47–48

Special function units (SFUs) , B-35 , B-50

defi ned , B-42–B-43 Speculation , 321–322

hardware-based , 329–331 implementation , 321 performance and , 321 , 322 problems , 321 recovery mechanism , 321

Speed-up challenge balancing load , 495 b –496 b bigger problem , 494 b –495 b

Spilling registers , 71 b –72 b , 99 Split algorithm , 542 Split caches , 387 b sra (shift right arithmetic) , 64 f srai (shift right arithmetic immediate) ,

64 f srl (shift right logical) , 64 f srli (shift right logical immediate) , 64 f Stack architectures , 173.e3–173.e4 Stack pointers

adjustment , 102–104 defi ned , 99 values , 101 f

Stacks allocating space on , 104 for arguments , 99 defi ned , 99 pop , 99 push , 99 , 102–104

Stalls , 268 avoiding with code reordering ,

268 b –269 b behavioral Verilog with detection ,

366.e5–366.e7 data hazards and , 301–305 illustrations , 366.e25 insertion into pipeline , 303 f load-use , 306 memory , 389 as solution to control hazard , 269 write-back scheme , 390 write buff er , 389

Standby spares , 481.e7 State

in 2-bit prediction scheme , 310 assignment , A-69 , C-27 bits , C-8–C-10 exception, saving/restoring , 438 logic components , 239 specifi cation of , 422 b

State elements clock and , 239 combinational logic and , 239 defi ned , 238–239 , A-47 inputs , 239 register fi le , A b -49 b in storing/accessing instructions ,

242 f Static branch prediction , 322 Static data

segment , 105 Static multiple-issue processors , 320 ,

322–326 . See also Multiple issue control hazards and , 322–323 instruction sets , 322 with RISC-V ISA , 322–326

Static random access memories (SRAMs) , 368 , 369 , A-57–A-66

array organization , A-61 f basic structure , A-60 f defi ned , 19–22 , A-57 fi xed access time , A-57 large , A-58 read/write initiation , A-58 synchronous (SSRAMs) , A-59 three-state buff ers , A-58 , A-59 f

I-20 Index

Static variables , 104 b Steady-state prediction , 310 Sticky bits , 212 Store buff ers , 332 b Store byte , 109 Store-conditional doubleword , 1

22–123 Store doubleword , 70–71 Store instructions . See also

Load instructions access , B-41 base register , 252 compiling with , 71 conditional , 122–123 defi ned , 71 b EX stage , 282 f ID stage , 279 f IF stage , 279 f instruction dependency , 300 b MEM stage , 281 f unit for implementing , 245 f WB stage , 281 f

Store word , 113 b Stored program concept , 63

as computer principle , 88 b illustrated , 88 f principles , 159–160

Strcpy procedure , 110 b –111 b . See also Procedures

as leaf procedure , 111 pointers , 111

Stream benchmark , 538 b Streaming multiprocessor (SM) , B-13 Streaming processors , B-34 , B-49–B-50

array (SPA) , B-41 , B-46 Streaming SIMD Extension 2 (SSE2)

fl oating-point architecture , 215 Streaming SIMD Extensions (SSE) and

advanced vector extensions in x86 , 215

Stretch computer , 368.e1 f , 368.e1 Strings

defi ned , 109–111 in Java , 111–113 representation , 108 f

Strip mining , 500 b Striping , 481.e3 Strong scaling , 495 Structural hazards , 265 , 282 STXR (store exclusive register) , 122–123 sub (subtract) , 64 f Subnormals , 214

Subtraction , 172–175 . See also Arithmetic binary , 172 b –173 b fl oating-point , 204 negative number , 174 overfl ow , 174

Subword parallelism , 214–215 , 342 f , D-17 and matrix multiply , 216–220

Sum of products , A-11 , A b -12 b Supercomputers , 368.e2

defi ned , 5 SuperH , D-15 , D-39–D-40 Superscalars

defi ned , 326 , 368.e3–368.e4 dynamic pipeline scheduling , 326–327 multithreading options , 492

Supervisor Exception Cause Register (SCAUSE) , 314

Supervisor exception program counter (SEPC) , 314 , 362 , 437

address capture , 317–319 defi ned , 315–317 in restart determination , 314

Supervisor exception return (sret) , 435 Supervisor Page Table Base Register

(SPTBR) , 425 f Supervisor Trap Vector (STVEC) , 319 b Surfaces , B-41 sw (store word) , 64 f Swap procedure , 134 . See also

Procedures body code , 134–135 full , 135 , 139–140 register allocation , 134–135

Swap space , 424 Symbol tables , 126 Synchronization , 121–124 , 542

barrier , B-18 , B-20 , B-34 defi ned , 508–511 lock , 121 overhead, reducing , 44–45 unlock , 121

Synchronizers from D fl ip-fl op , A-75 f defi ned , A-75 failure , A-75–A-76

Synchronous DRAM (SRAM) , 369 , A-59 , A-64

Synchronous SRAM (SSRAM) , A-59 Synchronous system , A-47–A-48 Syntax tree , 150.e2 System calls, defi ned , 362 Systems soft ware , 13

SystemVerilog cache controller , 482.e1–482.e4 cache data and tag modules , 482.e16 FSM , 482.e6 f simple cache block diagram , 482.e3 f type declarations , 482.e1 f

T

Tablets , 7 f Tags

defi ned , 374 in locating block , 397 page tables and , 424 size of , 399 b –400 b

Tail call , 107 Task identifi ers , 436 Task parallelism , B-24 Task-level parallelism , 490 Tebibyte (TiB) , 5 Telsa PTX ISA , B-31

arithmetic instructions , B-33 barrier synchronization , B-34 GPU thread instructions , B-32 f memory access instructions , 206

Temporal locality , 364 tendency , 367

Temporary registers , 68 , 100 Terabyte (TB) , 6 f

defi ned , 5 Texture memory , B-40 Texture/processor cluster (TPC) , B-47 TFLOPS multiprocessor , 577.e4–577.e5 ,

577.e5 Th rashing , 440 Th read blocks , 516 f

creation , B-23 defi ned , B-19 managing , B-30 memory sharing , B-20–B-21 synchronization , B-20–B-21

Th read parallelism , B-22 Th reads

creation , B-23 CUDA , B-36 ISA , B-31–B-34 managing , B-30 memory latencies and , B b -74 b multiple, per body , B-68–B-72 warps , B-27–B-28

Th ree Cs model , 445 b Th ree-state buff ers , A-58 , A-59 f

Index I-21

Th roughput defi ned , 29–30 multiple issue and , 320 pipelining and , 262

Th umb , D-15 f , D-38–D-39 Timing

asynchronous inputs , A-75–A-76 level-sensitive , A-74–A-75 methodologies , A-71–A-77 two-phase , A-74 f

TLB misses , 429 . See also Translation-lookaside buff er (TLB)

handling , 437–439 occurrence , 437 problem , 440

Tomasulo’s algorithm , 368.e2 Touchscreen , 19 Tournament branch predicators ,

311–312 Tracks , 371–372 Transfer time , 373 Transistors , 25 Translation-lookaside buff er (TLB) ,

428–430 , D-26–D-27 , 497.e5 . See also TLB misses

associativities , 430 illustrated , 429 f integration , 433 Intrinsity FastMATH , 430–433 typical values , 430

Transmit driver and NIC hardware time versus receive driver and NIC hardware time , 553.e7 f

Tree-based parallel scan , B-62 f Truth tables , A-5

ALU control lines , C-5 f for control bits , 251 datapath control outputs , C-17 f datapath control signals , C-14 f defi ned , 251 example , A b -5 b next-state output bits , C-15 f PLA implementation , A-13

Two’s complement representation , 76

advantage , 77 negation shortcut , 78 b –79 b rule , 80 b sign extension shortcut , 79 b –80 b

Two-level logic , A-11–A-14 Two-phase clocking , A-74 , A-74 f TX-2 computer , 577.e3

U

Unconditional branches , 93 Underfl ow , 190 Unicode

alphabets , 111 defi ned , 111 example alphabets , 112 f

Unifi ed GPU architecture , B-10–B-11 illustrated , B-11 f processor array , B-11–B-12

Uniform memory access (UMA) , 508 , B-9 multiprocessors , 508

Units commit , 327 , 332 b control , 237–238 , 249–251 , C-4–C-8 ,

C-10 f , C-12–C-13 defi ned , 211–212 fl oating point , 211–212 hazard detection , 301 , 304–305 for load/store implementation , 245 f special function (SFUs) , B-35 ,

B-42–B-43 , B-50 UNIVAC I , 54.e4 f , 54.e3–54.e4 UNIX , 173.e7 , 497.e10 , 497.e13 , 497.e13 ,

497.e14 AT&T , 497.e14 Berkeley version (BSD) , 497.e14 genius , 497.e16 history , 497.e13 , 497.e14

Unlock synchronization , 121 Unsigned numbers , 74–81 Use latency

defi ned , 323–325 one-instruction , 323–325

V

Vacuum tubes , 25 f Valid bit , 374–376 Variables

C language , 104 b programming language , 67 register , 67 static , 104 b storage class , 104 b type , 104 b

VAX architecture , 173.e3 , 497.e6 Vector lanes , 500 Vector processors , 497–504 .

See also Processors conventional code comparison ,

499 b –500 b

instructions , 499 multimedia extensions and ,

498–500 scalar versus, 500–501

Vectored interrupts , 314 Verilog

behavioral defi nition of RISC-V ALU , A-25 f

behavioral defi nition with bypassing , 366.e4 f

behavioral defi nition with stalls for loads , 366.e6 f

behavioral specifi cation , A-21 , 366.e1–366.e3

behavioral specifi cation of multicycle MIPS design , 366.e11 f

behavioral specifi cation with simulation , 366.e1–366.e3

behavioral specifi cation with stall detection , 366.e5–366.e7

behavioral specifi cation with synthesis , 366.e10–366.e16

blocking assignment , A-24 branch hazard logic implementation ,

366.e7–366.e10 combinational logic , A-23–A-26 datatypes , A-21–A-23 defi ned , A-20–A-21 forwarding implementation , 366.

e3–366.e5 modules , A-23 f multicycle MIPS datapath , 366.e13 f nonblocking assignment , A-24 operators , A-22–A-23 program structure , A-23 reg , A-21 , A-21 RISC-V ALU defi nition in , A-36–A-37 sensitivity list , A-23–A-24 sequential logic specifi cation ,

A-55–A-57 structural specifi cation , A-21 wire , A-21 , A-21 , A-22

Vertical microcode , C-32 Very large-scale integrated (VLSI)

circuits , 25 Very Long Instruction Word (VLIW)

defi ned , 322 fi rst generation computers ,

368.e4 processors , 322

VHDL , A-20–A-21 Video graphics array (VGA) controllers ,

B-3–B-4

I-22 Index

Virtual addresses causing page faults , 438 defi ned , 418–419 mapping from , 418–419 size , 420–421

Virtual machine monitors (VMMs) defi ned , 414 implementing , 468 b laissez-faire attitude , 468 page tables , 439 b in performance improvement , 417 requirements , 416

Virtual machines (VMs) , 414–417 benefi ts , 414–415 illusion , 439 b instruction set architecture support ,

417 performance improvement , 417 for protection improvement , 414–415

Virtual memory , 417–441 . See also Pages address translation , 418–419 , 428–430 integration , 433–435 for large virtual addresses , 426–427 mechanism , 440 motivations , 417–418 page faults , 418–419 , 424 protection implementation , 435–437 segmentation , 421 b summary , 439–441 virtualization of , 439 b writes , 428

Virtualizable hardware , 416 Virtually addressed caches , 434 Visual computing , B-3 Volatile memory , 22

W

Wafers , 26 defects , 26–27 dies , 27 , 27 , 27 , 28 yield , 27

Warehouse Scale Computers (WSCs) , 7 , 519–524 , 548

Warps , B-27–B-28 Weak scaling , 495 Wear levelling , 371 While loops , 94 b –95 b Whirlwind , 497.e1 Wide area networks (WANs) , 24 . See also

Networks Wide immediate operands , 113–114 Words

accessing , 68 defi ned , 67 double , 151 load , 69 , 71 quad , 151 store , 71 b

Working set , 440 World Wide Web , 4 Worst-case delay , 260 Write buff ers

defi ned , 385 stalls , 381 write-back cache , 385

Write invalidate protocols , 454 Write serialization , 453–454 Write-back caches . See also Caches

advantages , 444 cache coherency protocol , 482.e4 complexity , 385 defi ned , 384 , 444 stalls , 389 write buff ers , 385

Write-back stage control line , 290 f load instruction , 280 store instruction , 282

Writes complications , 384 b –385 b expense , 440 handling , 383–385 memory hierarchy handling of ,

331–332 schemes , 384 virtual memory , 427 write-back cache , 384 , 385

write-through cache , 384 , 385 Write-stall cycles , 389 Write-through caches . See also Caches

advantages , 444 defi ned , 383 , 444 tag mismatch , 384

X

x86 , 146–155 Advanced Vector Extensions in ,

215–216 brief history , 173.e5–173.e6 conclusion , 154–155 data addressing modes , 149–151 evolution , 96 fi rst address specifi er encoding , 155 f instruction encoding , 153–154 instruction formats , 154 f instruction set growth , 156 f instruction types , 152 f integer operations , 151–152 registers , 149–151 SIMD in , 496–497 Streaming SIMD Extensions in ,

215–216 typical instructions/functions , 154 f typical operations , 153 f unique , D-36–D-38

Xerox Alto computer , 54.e7–54.e9 XMM , 215 xor (exclusive or) , 64 f xori (exclusive or immediate) , 64 f

Y

Yahoo! Cloud Serving Benchmark (YCSB) , 530

Yield , 27 YMM , 216

Z

Zettabyte , 6 f

Documents

Index 0-9, and symbols - Elsevier