Dept. of ECE, GMU ECE 511 - 20011 System Organization Dr. Qiang Lin

Dept. of ECE, GMU ECE 511 - 20011

System Organization

Dr. Qiang Lin


Why Study System Organization

• To write even a modest 80x86 assembly language program requires considerable familiarity with 80x86 family

• To write good assembly language programs requires a strong knowledge of underlying hardware

• Unfortunately, underlying hardware is not consistent

– Techniques crucial for 8088 programs may not be useful on 80486 systems

– Programming techniques to provide big performance boosts on an 80486 chip may not help at all on an 80286

– Fortunately, some programming techniques work well whatever microprocessor you are using


Basic System Components

John Von Neumann, a pioneer in computer design, is given credit for architecture of most computers in use today, including 80x86 family. A typical Von Neumann Architecture (VNA) system has three major components:

CPU, memory, and I/OHow a system designer combines them impacts system performance


• System bus connects various components of a VNA machine

• 80x86 family system bus consists of 3 major busses:– address bus, data bus, and control bus

– They vary from processor to processor, i.e., the data bus may have a different implementation on 80386 than on 8088, but both carry data between the processor, I/O, and memory

• A typical 80x86 system uses standard TTL logic levels– Each wire on a bus uses a standard voltage level to represent zero and one

– Only specify zero and one rather than electrical levels because they vary on different processors, especially laptops.

System Bus


Data Bus

• Shuffle data between various components in a computer system. The bus size varies widely in 80x86 family

• On 80x86s, data bus contains 8, 16, 32, or 64 lines (bits)

– 8088, 8 bits; 8086 and 80286, 16 bits

– 80386, 80486, Pentium, Pentium II, Pentium III and Pentium 4, 32 bits

– Itanium, 64 bits

• 8 bit data bus does not limit processor to 8 bit data types, but processor can only access one byte of data per memory cycle

• Processors with a 16 bit bus naturally faster than processors with an 8 bit bus. Likewise, processors with a 32 bit bus faster than those with a 16 or 8 bit data bus


Address Bus

• Question - Which memory location or I/O device?

• To differentiate memory locations and I/O devices, a unique memory address is assigned to each memory element and I/O device

– When software wants to access particular memory location or I/O device, it places the corresponding address on address bus

– Circuitry associated with the memory or I/O device recognizes this address and instructs memory or I/O device to read the data from or place data on data bus

– In either case, all other memory locations ignore the request

• With n address lines, processor can provide 2n unique addresses


80x86 Family Address Bus Sizes

Processor Address Bus Size Max Addressable Memory In English!

8088 20 1,048,576 One Megabyte

8086 20 1,048,576 One Megabyte

80188 20 1,048,576 One Megabyte

80186 20 1,048,576 One Megabyte

80286 24 16,777,216 Sixteen Megabytes

80386sx 24 16,777,216 Sixteen Megabytes

80386dx 32 4,294,976,296 Four Gigabytes

80486 32 4,294,976,296 Four Gigabytes

Pentium 32 4,294,976,296 Four Gigabytes

Itanium 64


Control Bus

• A collection of signals that control how processor communicates with the rest of the system

• Exact make up of the control bus varies in 80x86 family, common ones are:

– read and write: control direction of data on data bus. If read is low (logic zero), CPU is reading data from memory while write line is low, the system transfers data from the CPU to memory

– byte enable: allow 16, 32, and 64 bit CPUs to deal with smaller chunks of data

– Two distinct address spaces: one for memory and one for I/O; I/O address bus on all 80x86 CPUs is 16 bit; but both spaces share the same address bus for both I/O and memory addresses

– Others like system timing, waits, etc


Memory Subsystem

• Question - What exactly is a memory location?

– 80x86s support segmented byte addressable memory so basic memory unit is a byte located in different segment

– A word uses 2 consecutive memory addresses or 2 bytes. Similarly, a double word uses 4 consecutive memory locations

– Examples of memory locations and 8 bit, 16 bit and 32 bit data bus, respectively


I/O Subsystem

• 80x86 family provides a 16 bit I/O address bus– Other than separate control lines and a smaller bus, I/O addressing

behaves exactly like memory addressing

– Memory and I/O devices both share the same data bus

• Three limitations

– Require special instructions to access I/O devices

– IBM PC designers used the "best" I/O locations for their own purposes, forcing third party developers to use less accessible locations

– Can address no more than 65,536 I/O addresses


System Timing and Clock

• On Von Neumann machines, such as 80x86s, most operations are serialized

– Computer executes commands in a prescribed order

– To ensure all operations occur at just the right moment, 80x86 CPUs use an alternating signal called the system clock

• At the most basic level, system clock handles all synchronization within a computer system


System Clock (Continued)

• The frequency with which system clock alternates between zero and one is system clock frequency

• The time for system clock to switch from zero to one and

back to zero is clock period, also called a clock cycle. • Most CPUs start an operation on falling or rising edge

• Since all CPU operations are synchronized around clock, CPU cannot run tasks faster than clock

• Many operations take multiple clock cycles to complete so CPU often performs operations at a significantly lower rate


Memory Access and System Clock

• An operation synchronized around the system clock

– Many 80x86 processors take several clock cycles to access a memory location ranging from one to four clock cycles

– 8088 and 8086 CPUs require four clock cycles to access memory; while 80486 requires only one

• Memory access time is time between a memory operation request (read or write) and memory operation completes

– Examples of read and write


Wait States

• A wait state is nothing more than an extra clock cycle to give some device time to complete an operation

– In abovr, system loses 10ns to buffe and decode. If CPU needs data back in 20 ns for an 50 MHz 80486, memory must respond in less than 10 ns

– If cost-effective memory won't work with a fast CPU, an answer is wait state.

– For example, if a 20 MHz processor has a memory cycle time of 50 ns and loses 10 ns to buffer and decode, it needs 40 ns memory. If can only afford 80 ns memory, then add a wait state to extend memory cycle to 100 ns (two clock cycles) will solve this problem as:

• Subtract 10ns for decoding and buffering leaves 90 ns. Therefore, 80 ns memory will respond well before CPU requires the data


Cache Memory• A typical program tends to access the same memory locations

repeatedly and adjacent memory locations, known as temporal locality of reference and spatial locality of reference, respectively

• Generally, a typical program might only use 10-20% of memory allotted to it

• Cache memory sits between CPU and main memory

– A small amount of very fast (zero wait state) memory

– Bytes appearing within a cache do not have fixed addresses. Instead, cache memory can reassign the address of a data object; allows to keep recently accessed values in cache.

– Addresses that CPU has never accessed or hasn't accessed in some time remain in main (slow) memory. Since most memory accesses are to recently accessed variables, the data generally appears in cache memory

– A cache hit occurs whenever CPU accesses cache and finds data in cache. A cache miss occurs if CPU accesses cache and data is not there


Stack Memory

• A special type of data structure in which items are removed in the reverse order from that in which they are added

• The most recently added item is the first one removed, also called last-in, first-out (LIFO)

• Adding an item to a stack called push

• Removing an item from a stack called pop


8086 CPU Registers

• Special memory locations implemented on CPU chip

• Registers on 8086

– Four segment registers that partition memory into 64 Kbytes segments so can address a maximum of 256 Kilobyte at a time:

• CS (code segment) for program• SS (stack segment) for stack memory• DS (data segment) for data• ES (extra segment) for extra data

– Four general-purpose registers:• AX, accumulator register for word multiply, divide and I/O• BX, base register for translation• CX, count register for string operations and loops• DX, data register for word multiply, divide and indirect I/O


8086 CPU Registers (Continued)

– One instruction pointer, IP, register to identify next instruction location to be fetched

– Two pointer registers to store offset addresses for SS:• SP, stack pointer

• BP, base pointer

– Two index registers to store offset addresses for DS or ES:• SI, source index

• DI, destination index

– One status register, also known as flags register:• for six status bits as CF for carry flag, PF for parity flag, AF for

auxiliary carry, ZF for zero, SF for sign, OF for overflow

• for three control flags as IF for interrupt enable, DF for direction, and TF for trap


Segmented Address Space• Segmentation, Pros:

• Provides a powerful memory management mechanism

• Allows programmers to partition their programs into modules that operate independently of one another

• Allows two processes to easily share data and to extend CPU addressability

• Segmentation, Cons:• Limited addressability within a segment

• Harder to run program with huge memory requirement


8086 Instruction Set

• Data-Transfer– Move data either between its internal registers or between an

internal register and a storage location in memory

• Arithmetic– Include instructions for addition, subtraction, multiplication, and

division

• Logic– Perform logic operations AND, OR, XOR and NOT

• Shift– Perform logical shift and arithmetic shift

• Rotate– Include rotate left, right, left through carry and right through carry


8086 Physical Address Calculation

• Segment base address (SBA)– Identifies starting location of the segment in memory

• Effective address (EA)– Represents offset of the operand from beginning of the segment

• To reference an operand in memory, 8086 must calculate physical address (PA)– PA = SBA {CS, SS, DS, ES} : EA

– EA can be any combination or all of• Base {BX, BP} + Index { SI, DI} + Displacement {8-bit, 16-bit}


8086 Addressing Modes

• Support six addressing modes:– immediate addressing, encode operand’s data directly in instruction

without involving memory or other registers

– direct addressing, encode operand’s effective address directly in instruction

– register indirect addressing, PA = {CS, DS, SS, ES}:{BX, BP, SI, DI}

– based addressing, EA = {BX, BP} + {direct, indirect} displacement

– indexed addressing, similar to based addressing but using displacement value as a pointer to starting point of an array in memory

– based-indexed addressing for two dimensional arrays

• PA = {CS, DS, SS, ES} : {BX, BP} + { SI, DI} + displacement

• Base register for m coordinate and index register for n coordinate


Step-by-Step Instruction Execution

• x86 CPUs may not complete execution of an instruction in a single clock cycle

• CPU often executes several steps for each instruction

– For example, CPU issues the following instruction involving:

mov reg, reg/memory/constant • Fetch byte from memory• Update IP register to point at next byte• Decode instruction to see what it does• If required, fetch a 16-bit instruction operand from memory• If required, update IP to point beyond operand• Compute operand address, if required• Fetch the operand• Store fetched value into destination register


Differences Between 80x86 Processors

• 8086 - slowest, no one instruction can complete in one clock cycle– The above mov instruction takes between 5 and 12 clock cycles to

execute depending upon the operands– The longer instruction, takes more time– Instruction without memory access executes faster– Instructions using complex addressing modes run slower

• 80286 - major improvement is the pre-fetch queue – Whenever CPU is not using Bus Interface Unit (BIU), BIU can

fetch additional bytes from the instruction stream– Whenever CPU needs an instruction or operand byte, it grabs the

next available byte from the pre-fetch queue


Differences Between 80x86 Processors (Cont)

• 80486 - incorporates pipeline to improve performance– With six stage pipeline it can overlap the execution of six separate

instructions as below:

–

– Once CPU fills pipeline, it completes one instruction on each cycle

– This is true even if there are complex addressing modes to be computed, memory operands to fetch, or other operations which use cycles on a non-pipelined processor


Stalls In a Pipeline

• Unfortunately, previous scenario is too simplistic

– Two problems, bus contention among instructions and non-sequential execution, may increase average execution time in pipeline

– For example, one instruction needs to store data in memory and the other reads data from memory. then, contention for the address and data bus may develop since CPU will be trying to simultaneously fetch data and write data in memory

– One simplistic way to handle bus contention is through a pipeline stall

• The CPU, when faced with contention for the bus, gives priority to the instruction furthest along in the pipeline

• The CPU suspends fetching opcodes until current instruction fetches (or stores) its operand

• Cause new instruction in pipeline to take two cycles to execute


Flush a Pipeline

• What happens when an instruction modifies IP register?

– E.g, the time the instruction jmp 1000

– completes execution, already started five other instructions and only one clock cycle away from completion of the first. Obviously, CPU must not execute those instructions or it will compute improper results

– The only reasonable solution is to flush the entire pipeline and begin fetching opcodes anew!

– Doing so causes a severe execution time penalty since it will take 6 clock cycles before next instruction completes execution

– This is why a pipeline is not long and should keep control transfer instructions to a minimum in a pipelined system


Prefetch Queue of 80486

• Bus contention can be reduced by prefetch– The prefetch queue simply acts as a buffer between the instruction

stream in memory and the opcode fetching circuitry – But 80486 can not take advantage of prefetch queue as 80286

since it constantly accesses memory

– Only when CPU not accessing memory, BIU can fetch additional opcode bytes for prefetch queue. Thus, 80486 prefetch queue can not use any "dead" bus cycles to fetch additional opcode bytes


• Bus contention can also be reduced by cache– Be aware of how cache operates to write best programs

• On-chip instruction caches are generally quite small. So, the shorter the instructions, the more of them will fit in the cache.

• The more instructions are in cache, the less often bus contention will occur• Using registers to hold temporary results places less strain on data cache so it

does not need to flush data to memory or retrieve data from memory so often. Use the registers wherever possible!

Cache of 80486


I/O (Input/Output)

• I/O-mapped I/O– CPU use special instructions to transfer data between the computer

system and outside world

– A printer uses three ports: a read/write port, an input port and an output port

• Memory-mapped I/O– CPU use special memory locations in normal address space of

CPU to communicate with real-world devices


I/O (continued)

• Direct Memory Access (DMA)– To input ten bytes (words) from an input port and store them into

memory CPU must read each value and store it into memory.

– For very high-speed I/O devices CPU may be too slow when processing data one byte (word) at a time

– A special form of memory-mapped I/O where peripheral device reads and writes memory without going through CPU known as direct memory access

– Allow I/O operation to proceed in parallel with other CPU operations to increase overall system speed


Polled I/O and Interrupt

• I/O devices may not accept data at an arbitrary rate, e.g.:

– A Pentium PC can send several million characters/s to a printer, but printer is unable to print them in a second

– A keyboard is unable to provide several million keystrokes/s

• CPU needs some way to coordinate data transfer between the computer system and its peripheral devices

• Polled I/O– CPU constantly tests a status I/O port to see if data is available for

receiving or device is ready for sending; inherently inefficient

• Interrupt– An external hardware event to cause CPU to stop current instruction

sequence and call a special interrupt service routine

– Need to save the current CPU contents for restore later

Documents

Dept. of ECE, GMU ECE 511 - 20011 System Organization Dr. Qiang Lin