Gator Microprocessor Final2 · 68HC11 and 68HC12 that UF has been using to teach the Microprocessor class in the ... The test code used a large amount of indexed addressing to load

Gator uProcessor

Summa Cum Laude Thesis Sopagna Kheang

Bachelor of Science in Computer Engineering Fall 2007

Gator Microprocessor Sopagna Kheang [email protected]

ABSTRACT Gator Microprocessor, also known as the GuP, was developed by Kevin Phillipson in Spring 2007 with the goal that it would serve as a tool for both the Microprocessors class and Digital Design class being taught at the University of Florida. It is a soft-core processor that was inspired by the Motorola/Freescale 68xx microprocessors, mainly the 68HC11 and 68HC12 that UF has been using to teach the Microprocessor class in the past years. The GuP is machine code compatible with the 6800, 6801 and 68HC11 microprocessors. Though speed and efficiency was the main focus and through benchmark testing it is shown that the Gator Microprocessor is much faster than the old HC11 and fares very well with the HC12. By making the GuP instruction set and machine code compatible with the 68HC series microprocessors, we are hoping to allow students access to the benefits of already existing assemblers and tools. Furthermore, we believe that the students will learn more about microprocessors through the use of the GuP instead of the existing HC12 development kit. The GuP allows greater flexibility than the commercial processors because students will have access to all of the microcode and architecture of the microprocessor. They can explore the architecture of the microprocessor and gain a better understanding of how microprocessors work. There is a much greater potential for learning by being able to see the intricate details that are not available in proprietary products. INTRODUCTION In the past years the University of Florida has been using two very different boards to teach the microprocessors class and the digital design class. For the microprocessors class, the board is backed by the Motorola/Freescale 68HC12 processor. The digital design board is powered by an Altera Cyclone II FPGA. With the GuP, we have both designs combined into one single board that could be used by both classes. Not only will the students save money, we believe that being able to use the same board for the two classes will allow them to gain a better and deeper understanding of the design and architecture of a microprocessor. Designing with students’ usage in mind, the hierarchical layout of GuP makes it simple to trace signals and understand the processes involves in the execution of each instruction. The propriety nature of the 68HC12 limits the scope of learning whereas GuP overcomes this and will serve as a great educational tool in teaching how microprocessors work.

Overview of the features of the Gator Microprocessor: Architecture

- 16-bit Wide Data Paths - Variable Micro-cycle Length - Fixed Cycle Counts per Instruction - Compact microcode allow fast memory access - Overlapping Pre-fetch to increase efficiency - Dual ALUs - Advanced Math Unit

Industry Standard Compatibility - Fully source-code and machine-code compatible with 68HC11 - Compatible with all HC11 C/C++ compilers including GNU GCC - Memory bus interfaces with Synchronous and Asynchronous Memory

Speed - Up to 100MHz operation in modern FPGAs - 2.5 times faster per clock cycle than a HC11 on average - 1.25 times faster per clock cycle than a HC12 on average

BENCHMARKING & VERIFICATION In order to compare the performance between the three systems, a benchmarking test was performed on the HC11, HC12 and the GuP. Six simple 7400 series 4-bit counter chips were hooked up together to give us a 24-bit counter. The outputs of the counters are connected to LEDs to display the counter value. This simple counter was attached to the system clock of the microprocessor to be tested. The reset and enable signals where connected to I/O ports allowing the processor to clear the counter and start and stop the count. Using this system we were able to count the number of system clock cycles it would take for the different processors to perform any given algorithm. All of the processors ran the test program from external 8-bit SRAM with no memory wait states. Both 16-bit microprocessors, the GuP and the 68HC12, were tested using their external data bus in 8bit mode. The photo below shows the GuP Development Board being tested.

In order to get a comprehensive test, we decided on using a Fast Fourier Transform algorithm because we thought the algorithm was long enough and used a variety of different instructions. The test code used a large amount of indexed addressing to load and store data from a 512 byte data table and sine lookup table. The algorithm used mostly simple sequences of basic arithmetic instructions (ex. ADDA, SUBA, RORA) with a few multiply instructions (MUL). A copy of the test program is provided in the appendix. The following table displays the result of the test.

FFT Benchmark GuP vs HC11

Dataset 1 Dataset 2 HC11 2459932 2343324 clocks GuP 938307 952161 clocks

Ratio: 2.622 2.461

FFT Benchmark GuP vs HC12 Dataset 1 Dataset 2

HC12 1190768 1138662 clocks GuP 938307 952161 clocks

Ratio: 1.269 1.196

The GuP performs about 2.5 times faster than the HC11 processor and 1.25 times faster than the HC12 at the same clock rate. The first data set was a square wave. The second data set was the reverse transform which correctly reassembled the square input wave.

HC11HC12

GuP

S1

S20

500000

1000000

1500000

2000000

2500000

Clo

ck C

ycle

s

Processor

Fast Fourier Transform Benchmark

Series1Series2

For this past semester our team has been working rigorously to test out the efficiency of the GuP. I was responsible for verifying that the entire instruction set works as intended as well as writing the microcode for several instructions. We were able to verify that the GuP can accomplish all of the tasks that the HC12 is expected to perform in the Microprocessors class. We were able to demonstrate all of the labs in that class on the GuP development board. Currently, the GuP development board is running the BUFFALO monitor program written for the HC11 flawlessly. It can handle all the instructions that the HC11 processor can handle, not to mention that it performs those instructions in a more efficient manner. GuP ARCHITECTURE GuP is very efficient in executing instructions compared to HC11 processor and the HC12 processor. The efficiency of the GuP can be attributed to many different factors which include wider 16-bits data paths, a small and compact microcode, overlapping pre-fetch and separate ALUs for address and data. The GuP also uses variable length microcycles instead of the fixed E-Clk of the HC1x which further results in compressed and faster microcycles. The GuP also implements dedicated logic for advanced math instructions. Given all these enhancements the GuP still has fixed and easily predictable instruction cycle count unlike the HC12. 16bit data-paths Whereas the 68HC11 processor has 8-bit internal data paths, GuP is very similar to the 68HC12 processor in that it has internal 16-bit data paths for ALU operations. The GuP’s data ALU is able to operate on both 16bit and 8bit data by switching modes. The memory controller used in the current version interfaces to an 8bit external data bus and therefore implements the necessary cycles to move 16bit data in and out of the processor. Because the GuP’s microcode already assumes everything is 16bit wide, future versions of the GuP could easily be upgraded to a full 16bit external data bus by changing only the memory controller. The main advantage of having a 16-bit data path is the reduced amount of cycles it takes to perform 16-bits operations. In an HC11, it would take two micro-cycles to perform an internal 16-bit operation, one to take care of the lower byte and the next cycle to take care of the upper byte. The added delay in performing 16-bit operations as opposed to 8-bit operations in the ALU is very little compared to the delay resulting from an extra micro-cycle and therefore 16-bit wide data paths result in a much faster processor. Examples of the benefits of such a system can be seen in 16-bit operations such as ABX and ADDD where GuP would take one less micro-cycle to perform the operation than the HC11 microprocessor. Faster Microcycles Since GuP is FPGA based and since FPGAs are synchronous devices with dedicated clock paths, we must design the GuP to synthesize efficiently into an FPGA. If we were to simply divide the clock using a counter or a state machine and then use that clock to drive the circuitry inside the microprocessor we would have unpredictable race conditions due to clock skew and gated clocks. This design choice led us to the usage of a synchronous enable signals for data synchronization rather than dividing the system clock like the HC1x processors. The usage of such synchronous enable signals result in variable

length microcycles as compared to the HC11 and HC12 microprocessors fixed microcycle length. As a result, we directly supply the circuitries with the system clock rather than the E-Clk which is one fourth of the system clock frequency for the HC11 processor and one half of the system clock frequency for the HC12. This gives an added advantage to GuP over the HC11 and HC12 because once the data is ready to be clocked into registers we can clock it in at the next system clock signal rather than waiting for the fixed E-Clk signal as in the HC11 and HC12. HC11 Read and Write Cycle

HC12 Read and Write Cycle

GuP Read and Write Cycle

In the HC11 and HC12 processors the user must combine several signals for address decoding when interfacing devices and memory. Even though these processors have only one signal to determine whether it is a read or a write cycle (R/~W), this signal must be combined with others. The R/~W signal shown above for the HC11 and HC12 is high during a read cycle and low during the write cycle. In the read cycle in the HC12, the R/~W must be high and the active low DBE signal must be low. For the write cycle in this same processor, the R/~W signal must be low to indicate a write. In addition, due to ringing problem in the HC12 development board, this signal must be and’ed with the E-CLK signal in order for the decoding to work as expected. Similarly, for the HC11, the R/~W cycle must be used along with the E-CLK as well when interfacing devices. Furthermore, the use of the address strobe (AS) signal in the HC1x complicates matters even more. In an HC1x, the data bus is shared with the lower 8 bits of the address bus. The AS control signal is used as an active-high signal to latch the lower address on the multiplexed address / data bus. To simplify everything, GuP takes on a different approach that is straight forward and makes the concept clearer to the users. Instead of having the R/~W signal, GuP uses separate read and write enables (rd_en and wr_en). During a read cycle, the rd_en signal would go true and during a write cycle, the wr_en signal would go true. Both signals would never be true at the same time. Because these signals do not have to be used in conjunction with any another signals (i.e. E-CLK or DBE) the GuP requires less support logic to interface memory devices. Also the design of the GuP’s memory controller allows the processor to interface seamlessly with both synchronous and asynchronous memory devices. The internal data bus on the GuP uses separate read and write data busses and the bi-directional external bus is easily formed using three state buffers. Advanced Math Unit

The GuP processor uses dedicated logic to perform division and multiplication instructions. In the multiplication instruction, MUL, the GuP takes advantage of the FPGA’s ability to perform fast multiplication and can complete the instruction in a total of two micro-cycles (corresponds to three system clock cycles). The HC12 takes a total of 3 E-CLK cycles which corresponds to 6 system clock cycles and the HC11 MUL instruction takes 10 E-Clocks which corresponds to 40 system clock cycles. As for the IDIV and FDIV instructions, the GuP uses a balance of combinatorial and sequential logic to perform the instructions. The divisions are divided into five stages, each taking one system clock cycle. In addition to these cycles, the decode state adds another clock cycles for a total of 6 system clock cycles in length which is a major improvement over the HC11 and the HC12. The HC12 can perform those instructions in 12 E-CLK cycles which equates to 24 system clock cycles in length. The HC11 takes 41 E-CLK cycles which is 164 system clock cycles long.

Instruction Timing The timing of the GuP is quite different from the timing of the HC11 or the HC12 due to the fact that it uses variable length micro-cycles. In the HC11 and HC12, memory accesses are framed by one E-CLK cycles which correspond to 4 and 2 system clocks respectively. GuP can do memory accesses in 2 system clock cycles, similar to the HC12 which is twice as fast as the HC11. The additional speed of the GuP is a result of the pre-fetch aspect of the GuP, the use of instructions that are 16 bits wide and instructions that use the advanced math unit an extra one or two cycle is needed. In general for all microprocessors, the fastest instructions are the inherent instructions. For instance, instructions such as ANDA and ASLA would take two E-CLK cycles (8 system clocks) to perform on an HC11. One cycle is used to fetch and decode the opcode and the second cycle to perform the operation. The GuP can perform such instruction in three system clock cycles, one cycle in the decode microstate and another two cycles in the execution and pre-fetch state. In instructions that require addressing modes such as direct and extended addressing. For instance if we were to consider a LDAA extended addressing instruction, the HC11 would take four E-CLK cycles (16 system clocks). One microcycle is used to fetch decode, another two to fetch the extended address and lastly a fourth to load the data into the A register. The GuP’s requires the same amount of microcycles but is far more efficient as it only requires nine system clock cycles. This is also faster than the HC12 which takes 10 system clocks.

MICROPROGRAMED CONTROLLER All microprocessors perform some form of the sequence of the steps below:

1. fetch instruction 2. decode 3. addressing mode (optional) 4. execute

The diagram above depicts the simplest form of a computer that is attributed to Von Neumann Architecture. In this case, the CPU implements the stored program by fetching instructions from memory in the external architecture and decoding and executing them. The algorithms in the controller section of the CPU control the functioning of the internal architecture which contains the registers, ALU, etc.

INTERNAL

ARCHITECTURE REGISTERS, ALU, etc.

CONTROLLER

PROGRAM & DATA

MEMORY

INPUT/OUTPUT

SECTION

CPU EXTERNAL

ARCHITECTURE

clk outputs

inputs

DATA BUS

ADDR BUS

CONTROL BUS

In GuP, the main components of the controller are the microsequencer and microprogram memory and the mapper. Each instruction (i.e. STAA, ABA) are implemented by a series of microinstruction states in a state-machine manner. These microinstruction states are stored in a file called gucode.asm and have the following format:

For instance, if we were to consider one of the execution states, it would look as follow: ASRB: set_map_a 57 DATA8_RSHIFT ASHIFT,B DATA_WRITE B CCR_OP OP_ooooXXXX ADDR_POST_INC PC LOAD_OPCODE JUMP DECODE end_state

LABEL: <set_map_x > <DATA_OPERATION> OPERAND <DATA_WRITE> OPERAND <CCR_OP > OPERAND <ADDR_OPERATION> OPERAND <LOAD_OPCODE> <JUMP> OPERAND end_state

The gucode.asm file is then assembled into the raw machine code format shown below:

These controls are stored in the microprogram memory in a file called gucode.mif. The operands in each microcode word control the circuitries and carry out the microinstructions.

LABEL: OPERATION OPERAND1, OPERAND2, …, OPERANDn COMMENT

The diagram bellow shows in detail the steps involved in generating the microprogram memory and the mapper logic.

The two main files are the gucode.asm and gucode.mac file. The gucode.mac file is a macro file. This file defines all the variables used in gucode.asm file from CCR_OP variables to the various JUMP_MAP variables. Furthermore, this file is also used to define macro functions such as DATA16_SUB, DATA8_OR and so forth. In the end, it is responsible in tying the instructions used in the gucode.asm file to the various control vectors that would be assembled into raw machine code. In fact, for each of the states in the gucode.asm file, the macros defined in gucode.mac sets the necessary control variables and the macro assembler outputs those control bits into an S-record, or in our case gucode.s. The s2mif56.c utility changes the S-record into the gucode.mif file with the same microcode word format as shown earlier. The gucode.mif file is in turn used by Quartus to initialize the microprogram ROM inside the microprocessor core. Let’s consider the function of the map.c utility and answer the question of what a mapper is. In general, there are many instructions that use the exact same pieces of microcode. For instance, in a simple case, the immediate instructions LDAA and LDAB will both require the immediate loading of 8 bits. To cut down on redundancy and save microprogram memory we know that we can use the same piece of code for both instructions. The mappers allow us to decode the opcode into the necessary sequence of microinstructions quickly using combinatorial logic. So mappers are just vectors that would point us to the next execution position in the microcode for a given opcode. In our example, once we decode and find that the instruction is an immediate LDAA, the mapper points to the microstate that implements the address mode for loading immediate 8bit data. Then once that is done, a different mapper would point us to the next location necessary to complete the instruction, loading the A register or a different location which loads the B register. The function of the map.c utility is therefore to generate a VDHL file (mapper.vhd) which can be optimized by Quartus and basically contains all the possible vectors necessary in completing every opcode instruction. The HC11's opcodes are mapped out over four pages. The second, third and fourth pages are accessed via pre-bytes $18, $1A, $CD respectively. We decided to use a compatible map to make the GuP source-code and machine-code compatible with the HC11. The pre-bytes are shown in orange on the following page. All of the instructions that GuP is capable of performing are shown on the next two pages.

gucode.asm

gucode.mac

map.c

upasm

mapper.vhd

gucode.s

s2mif56

gucode.mif

To implement these instructions, eight different mappers were used. Two mappers were used for each one of the four pages of opcodes. Mapper A is responsible for the addressing modes and inherent opcode executions of opcode page 1. In addition, it is also responsible for the pre-byte decoding of opcode pages 2, 3 and 4. Mapper B is responsible for the execution code of page 1 memory addressing opcodes. Mapper C takes care of the addressing modes of page 2 opcodes and the execution code for the inherent page 2 opcodes. Mapper D is used for the execution code for the page 2 memory addressing opcodes. Mapper E, F, G and H are responsible for the addressing modes and execution code for memory addressing opcodes of page 3 and 4 respectively. Let’s consider a few examples to have a better understanding of what is going on. Assume an instruction has been read and we see that the opcode is $80 which is an SUBA immediate. In the microcode decode stage we see the following: DECODE: IF IRQ_COND,FALSE JUMP_MAP_A end_state Since the IRQ condition is false, the state says we should jump to the mapper A vector. JUMP_MAP_A is a macro which is defined in the gucode.mac file which will ultimately cause the microsequencer to send out an address vector that would send us to the next state corresponding to a given opcode in the gucode.asm file. In this case if we were to look at the mapper.vhd file when the opcode is $80, mapper_a has a vector value of $0D which corresponds to the 13th state in the gucode.asm file counting from the top of the file. This will lead us to the following location in the gucode: LOAD1_IMM8: set_map_a 80 81 82 ;SUBA CMPA SBCA set_map_a 84 85 86 ;ANDA BITA LDAA set_map_a 88 89 8A 8B ;EORA ADCA ORAA ADDA set_map_a C0 C1 C2 ;SUBB CMPB SBCB set_map_a C4 C5 C6 ;ANDB BITB LDAB set_map_a C8 C9 CA CB ;EORB ADCB ORAB ADDB DATA16_PASS PC DATA_WRITE EA ADDR_POST_INC PC LOAD_DATA8 JUMP_MAP_B end_state

Notice that this is the correct location because this is a SUBA immediate instruction. In this one state we set the controls to write the current program counter address to the effective address register, to load an 8bit piece of data, and increment the program counter in addition to jumping to the mapper B vector. For the opcode $80, the corresponding map b vectors is $D0 which corresponds to SUBA: set_map_b 80 90 A0 B0 set_map_d A0 DATA8_SUB A,MEM_U8 DATA_WRITE A CCR_OP OP_ooooXXXX ADDR_POST_INC PC LOAD_OPCODE JUMP DECODE end_state The above segment of code will execute the instruction. DATA8_SUB is a macro to subtract parameter 2 from parameter 1. The result is then written back to register A on the next line. The CCR_OP line passes a parameter that will effectively be used by the Condition Code Register block. In addition, since this is the last state, we also load in a new opcode and to jump back to the decode state where the cycle repeats. Thus in the last state for many instructions the GuP implements a pre-fetch in parallel with the execution of the current opcode to increase efficiently and speed.

To consider a more involved instruction, let’s look at the page 3 instruction CPD extended addressing which corresponds to opcode $B3. Since this instruction is on a page other than the first page, there will be a pre-byte when it is assembled by the assembler. This pre-byte is the number $1A which is a vector on the page 1 opcode map that will send it to page 3. In the first decode state, the opcode is $1A. In the mapper.vhd file we see that the vector for mapper A when the opcode is $1A corresponds to microstate 9. PAGE_3: set_map_a 1A ADDR_POST_INC PC LOAD_OPCODE end_state JUMP_MAP_E end_state In the above snippet, we load the opcode again in the first state. Now the opcode register holds $B3 corresponding to the CPD instruction. We then jump map E. According to the mapper.vhd file, this will send us to state number $38 in the gucode.asm file which leads us to the following: LOAD3_EXT16: set_map_e B3 ;CPD ADDR_POST_INC2 PC LOAD_DATA16 end_state DATA16_PASS MEM_U16 DATA_WRITE EA ADDR_PASS MEM_U16 LOAD_DATA16 JUMP_MAP_F end_state In this case, we load a 16bit address and pass it through the data ALU to be stored in the EA register. We also load the 16bit data word pointed to by the extended address. We then jump map F to execute the instruction.

The mapper.vhd map_f vector points us to $D4, a location for CPD. CPD: set_map_f 83 93 A3 B3 set_map_h A3 DATA16_SUB D,MEM_U16 CCR_OP OP_ooooXXXX ADDR_POST_INC PC LOAD_OPCODE JUMP DECODE end_state We do the subtraction and set the bit mask for the CCR_OP. At the end, we do a pre-fetch and jump back to the decode state. As you have probably noticed, in the last state of most instructions we perform a pre-fetch and we set the necessary control vectors to jump to the decode state for the next instruction. This pre-fetching allows us to pack the execution steps very tightly. In other words, there is no wasted state and as a result we gain an edge in efficiency over the HC11 and HC12 processors. This tight packing is also shown in the fact that the length of the microprogram memory is only 256 bytes long. This small and compact microcode allows us to address it using only a 8-bit wide address which furthermore results in shorter access time. As we discussed before, the three main pieces of the controller are the mapper, microsequencer and the program memory. The objective of the microsequencer is to generate the addresses in the order used to step through the microprogram. In our case, the microsequencer takes as inputs sys_rst, clk, sync, condition, true_false, micro_op, branch_vector, map vectors for mappers A through H and it outputs the micro_prog_addr. The sync input is used to synchronize the microsequencer with the memory controller. Basically, the sync signal tells the microsequencer when a memory access is completed and it may continue to the next microstate. The micro_op is used by the microsequencer to generate the next address for the microprogram ROM. It should be noted that it is possible to pipeline the microinstructions to create a faster processor. However as a consequence, this would make the design more complex and therefore the internal design of the microprocessor would be more difficult for a student to understand, thus it was avoided in the initial design.

SUMMARY OF BENEFITS The Gator Microprocessor attempts to speed up the instruction set execution in the most efficient ways. The idea of an E-Clk is thrown out; the clock used for all clock events comes directly from the oscillator. Furthermore, the microcode implements each instruction in the minimum number of states possible. Extraneous states are avoided to ensure the fastest possible outcome. Similar to the HC12, Gator Microprocessor utilizes 16-bits internal buses, which means that 16-bits operations can be done in one state instead of two. With the way we are using mappers to quickly decode instructions into a sequence of microstates has many benefits. For example, adding new instructions can be done by mapping new opcodes without adding extra microcode. We can easily implement the extended addressing mode versions of the instructions BSET, BCLR, BRSET, and BRCLR which would expand the HC11 instruction set to be more orthogonal. We also demonstrated the real world benefits of the Gator Microprocessor though benchmarking. The Gator Microprocessor it much faster than the HC11, on average it executes code 2.5 times faster at any given clock frequency. The GuP is also faster than the far more complex HC12 by a factor of 1.25 when running HC11 opcodes. CONCLUSION We believe that the Gator Microprocessor will serve as a very useful tool for the Microprocessor and the Digital Design classes. The non-propriety nature of the project will assist and aid the learning process as we will no longer be limited in teaching how a microprocessor works. We have adapted the GuP to make it as fast and efficient possible while keeping the architecture clean and concise yet powerful. Furthermore, the fact that the GuP is instruction set compatible with existing industry standard products will make the transition much easier as students will have access to existing analogous literatures and tools. The GuP is flexible enough to meet the needs of current and future students and fast enough to be used for a variety of research endeavors. ACKNOWLEDGMENTS Working on this project has bestowed in me very valuable experiences. I am very grateful for the enthusiasm and knowledge that Kevin Phillipson was able to share. My thanks go out also to Peter Flores for his work on GuP and to Dr. Schwartz for having paired me with this group and for his help and support. Also, I would like to thank Dr. Crane and Dr. Gugel for their time and input. Furthermore, this report would not be possible without the amount of materials that Dr. Lynch has provided us. Thanks to Tucknology for sponsoring some of the test equipment and support with the paper. REFERENCES Lynch, A. Michel. Microprogrammed State Machine Design. Ann Arbor: CRC Press, 1993. M68HC11 Reference Manual Freescale Semiconductor, inc: 2002, 2007. Phillipson, Kevin. “Gator Microprocessor, Senior Design Final Report” May, 2007. Schwartz, Eric. “HC12 Bus Timing & Interfacing.” May, 2006.

APPENDIX ****************************************************************************** * * * Benchmarking Test Program * * * * This is the main portion that we will use to benchmark the * * HC11, HC12, and GuP. This main portion will first enable the * * counter and then call sample functions to do some calculations. * * * * The counter will then be disabled at the end so we can get an * * accurate count of how many cycles it takes to do the operations. * * * * * ****************************************************************************** ************************************************************************ * * FFT for Interactive C * version 1.1 * * Public functions: * * int power_spectrum(char *data)... accepts pointer to 512-byte array, * the first 256 bytes of which represents signed 8-bit real-valued * data; performs simple power-spectrum estimation, without windowing * or binning; on return, array holds unsigned 8-bit spectral compo- * nents; return value is the index to the peak frequency * * int fft(char *data)... accepts pointer to 512-byte array, the first * half of which represents signed 8-bit real components, the second * half of which represents imaginary components; performs in-place * FFT; return value is number of times data had to be divided by 2 * * int inverse_fft(char *data)... same as fft() but array contains * complex-valued inverse transform * * The function argument is the address of the first byte of the array. * Because of the nonstandard way IC handles arrays, you'll need to call * using a form such as power_spectrum((int)&data[0]). * ************************************************************************ * * Version history: * * 1.0 7/31/01 George Musser ([email protected]) * Adapted from fft.c11 by * * Ron Williams * Department of Chemistry * Ohio University * Athens, OH 45701 * * This, in turn, was a modification of the 6800 FFT presented by * * Richard Lord * Byte Magazine, pp. 108-119 * February 1979 * * 1.1 8/4/01 gsm * power_spectrum() now returns peak frequency * ************************************************************************ * * Williams's FFT algorithm is written in ROMable code for the HC11. It * uses a sine look-up table for speed and stores its dynamic variables * in the machine stack. The user passes the address of a 512-byte array * of 8-bit signed numbers, with the real components in the first 256 * bytes and the imaginary components in the second. power_spectrum() * zeroes out the imaginary components. * * The fft() and inverse_fft() return value is a normalization exponent * equal to the number of times the data was divided by two to keep it * within 8-bit range during the transform. The power_spectrum() return * value is the index of the array component with the maximum absolute * value. This is easily related to the peak frequency. For data at * timesteps of

* * 0, dt, ..., 127 dt, 128 dt, 129 dt, ..., 255 dt * * the values represent the power at periods of * * DC, 256 dt, ..., 256/127 dt, 2 dt, -256/127 dt, ..., -256 dt * * For real-valued data, the transform components at positive and nega- * tive periods are complex conjugates, so the power depends only on * the absolute value of the period. Accordingly, the return value is * the absolute value of the array index. Array component 128, which * corresponds to the Nyquist frequency, is a special case. * * Thus a power_spectrum() return value of 0 indicates that the DC * component dominates, a return value of 1 indicates that a frequency * of 1/256th of the sampling frequency dominates, and a return value * of 128 indicates that the Nyquist frequency dominates. * * To ensure that the determination of the peak is as accurate as possi- * ble, the user may want to apply a window function to the data before * passing it to power_spectrum(). I haven't implemented windowing here, * because, in fixed-point math, it would destroy precision and create * spurious harmonics. The calling routine can minimize the loss of * precision by scaling the data to fill the full 8-bit range -- an * operation that is easier to do in Interactive C. * * According to Williams's benchmarking, the computation takes 350 * milliseconds -- pretty fast, but remember that it hogs the processor, * since IC will not preempt a machine-language routine. I have not done * a great deal of testing, apart from some simple cases. * * Clever packing and unpacking of the data array could double the power- * spectrum computation, at the price of added complexity. ************************************************************************ * * offsets for local variables in FFT routine * REAL EQU 0 TOP EQU 2 ; gsm-added SIGN EQU 4 ; gsm-added CELNM EQU 5 CELCT EQU 6 PAIRNM EQU 7 CELDIS EQU 8 DELTA EQU 9 SCLFCT EQU $0A COSA EQU $0B SINA EQU $0C SINPT EQU $0D REAL1 EQU $0F REAL2 EQU $11 TREAL EQU $13 TIMAG EQU $14 TMP EQU $15 TMP2 EQU $16 MAIN_START EQU $A000 DATA_TABLE EQU $B000 STACK_START EQU $BFFF ORG DATA_TABLE DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50

DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 DC.B 00,50,00,50,00,50,00,50,00,50,00,50,00,50,00,50 OUTPUT DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 DC.B 00,00,00,00,00,00,00,00,00,00,00,00,00,00,00,00 *BENCHMARK PORTC EQU $1003 DDRC EQU $1007 ORG MAIN_START LDS #STACK_START LDAA #$F0 STAA DDRC LDAA #$00 STAA PORTC LDAA #$80 ;CLEAR 7 STAA PORTC LDAA #$C0 STAA PORTC ******** Start of Benchmark code LDD #DATA_TABLE LDY #0 JSR subroutine_fft ******** End of Benchmark code LDAA #$80 ; 7 STAA PORTC ; 12 SWI ************************************************************************ * * power_spectrum * * I split the original routine into two pieces: the actual FFT and the * power-spectrum calculation. That way, we can call the FFT directly if * we have complex-valued data (as we might if using the FFT for filter- * ing or waveform synthesis). * * The argument -- the address of the real-valued data array -- is passed * in the accumulator. * subroutine_power_spectrum * clear imaginary part of data array * * slight optimization saves 1019 cycles (gsm):

* + use STAA instead of CLR, saving 2 cycles per iteration * + use X rather than Y, saving 2 cycles per iteration ADDD #$100 ; get address of imaginary part [4] XGDX ; load data address into X [3] CLRA ; A register is zero [2] CLRB ; B register holds count (0=256) [2] zero STAA $FF,X ; [4] DEX ; [3] DECB ; [2] BNE zero ; [3] * do the FFT PSHX ; save the data address XGDX ; put data address back in accumulator BSR subroutine_fft PULX ; grab the data address * calculate sum of absolute values * * formally, we should be calculating the sum of squares, but this is * close enough for government work * * also, remember where the maximum value is, at a cost of 7437 cycles; * the B register is the maximum value, and the location of that value * is stored temporarily on the stack (gsm-added 8/4/01) * * slight optimization saves 2046 cycles (gsm): * + use X rather than Y, saving 3 cycles per iteration * + use Y for the byte counter, saving 5 cycles per iteration smsq LDY #0 ; clear byte counter CLRB ; initialize maximum value [2] PSHY ; initialize index [5] sum LDAA 0,X ; get real component BPL sm1 ; force positive NEGA BVC sm1 ; watch for $80 CLRA ; which is really 0 sm1 STAA 0,X ; store real component [4] INX ; get imaginary component LDAA $FF,X DEX BPL sm2 ; force positive NEGA BVC sm2 ; watch for $80 again CLRA sm2 ADDA 0,X ; compute sum [4] STAA 0,X ; save back in real part of array CBA ; is this the new maximum? [2] BLS sm3 ; no [3] TAB ; yes [2] INS ; ...so pop old index [3] INS PSHY ; ...and push new one [5] sm3 INX ; inc X for next round INY ; done when byte counter reaches 256 CPY #$100 ; [5] BNE sum * done PULX ; move index to accumulator XGDX CMPB #$80 ; take absolute value of LSB BLS smex NEGB smex RTS ************************************************************************ * * fft * * Here's the main body of the FFT. The argument -- the address of the * complex-valued data array -- is passed in the accumulator. The * direction of the transform (0 for forward, 1 for reverse) is passed * in Y.

* * fixed local-memory management, which didn't allow room on the stack * for nested calls or interrupt service (gsm) * * public entry points, which set direction of transform subroutine_fft LDY #0 BRA fftc11 subroutine_inverse_fft LDY #1 * initialize local variables fftc11 TSX ; top of stack for frame pointer XGDX ; to be placed in X SUBD #$17 ; subtract offset to make room XGDX ; X now has frame pointer TXS ; relocate stack underneath frame (gsm-added) STD REAL,X ; save data address (passed in accumulator) ADDD #$1FF ; needed in scale subroutine (gsm-added) STD TOP,X XGDY ; store direction of transform STAB SIGN,X CLR SCLFCT,X ; zero out the scale factor * do bit sorting * * added swapping of the imaginary part, too, in case the data is complex- * valued (gsm) * * slight optimization saves ~29000 cycles (gsm): * + expand rev1 loop and use A instead of TMP, saving 72 cycles * + store B in TMP2 rather than on the stack, saving 2 cycles * + use both index pointers during byte swap, saving 40 cycles LDAB #$FE ; setup start for bit reversal revbit STAB TMP2,X ; save copy of address [4] RORB ; rotate B right - bit 1 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 2 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 3 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 4 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 5 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 6 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 7 to carry ROLA ; rotate left - carry bit in RORB ; rotate B right - bit 8 to carry ROLA ; rotate left - carry bit in LDAB TMP2,X ; retrieve unbitreversed address [4] CBA ; make sure we only swap once [2] BHS noswap swap PSHX ; save frame pointer LDY REAL,X ; load base address LDX REAL,X ABY ; add unbitreversed address TAB ; add bitreversed address ABX LDAA 0,Y ; swap real components LDAB 0,X STAA 0,X STAB 0,Y INX ; prepare for 256-byte offset INY LDAA $FF,Y ; swap imaginary components LDAB $FF,X

STAA $FF,X STAB $FF,Y PULX ; restore frame pointer LDAB TMP2,X ; get current address back noswap DECB ; decrement address BNE revbit ; do next if not done * special case of first pass of FFT * * added imaginary part, for which the Danielson-Lanczos formula is the * same as for the positive component (gsm) * * slight optimization saves 1658 cycles (gsm): * + use X instead of Y, netting 5x2 cycles per iteration * + use Y instead of TMP, netting 3 cycles per iteration JSR scale PSHX ; save frame pointer [4] LDX REAL,X ; set up data pointer [5] LDY #128 ; get number of cells [4] fpss LDAA 0,X ; get rm [4] LDAB 1,X ; get rn [4] PSHA ; make copy ABA ; rm'=rm+rn STAA 0,X ; save back in data array [4] PULA ; get rm again SBA ; rn'=rm-rn STAA 1,X ; put away [4] INX ; point to next pair [3] INX ; [3] LDAA $FE,X ; get im [4] LDAB $FF,X ; get in [4] PSHA ; make copy ABA ; im'=im+in STAA $FE,X ; save back in data array [4] PULA ; get im again SBA ; in'=im-in STAA $FF,X ; put away [4] DEY ; decrement # cells [4] BNE fpss ; go back if not done PULX ; restore frame pointer [5] * now, the FFT proper for passes 2 through N * * for inverse transforms, it should be enough just to negate the sine value four LDAA #64 ; # of cells is now 64 STAA CELNM,X ; store STAA DELTA,X ; so is delta LDAA #02 ; number of pairs is 2 STAA PAIRNM,X STAA CELDIS,X ; so is distance between npass JSR scale ; check for over-range LDAA CELNM,X ; get current cell # STAA CELCT,X ; store at cell counter LDY REAL,X STY REAL1,X ; get copy of data ncell LDY #sintab ; get address of sines STY SINPT,X ; save copy LDAA PAIRNM,X ; get current pairnm np1 PSHA ; save pair counter LDAB 64,Y ; get sine LDAA SIGN,X ; negate if inverse transform (gsm-added) BEQ ncos NEGB ncos LDAA 0,Y ; get cosine STAA COSA,X ; save copy STAB SINA,X ; ditto LDY REAL1,X ; point to top of data LDAB CELDIS,X ; get current offset ABY ; add to y for current STY REAL2,X ; copy it LDAA 0,Y ; get data point rn PSHA ; copy it LDAB COSA,X ; get cosine JSR smul ; rn*cos(a) STAA TREAL,X

PULA ; get copy of rn LDAB SINA,X ; get sin(a) JSR smul ; rn*sin(a) STAA TIMAG,X ; store imaginary tmp INY LDAA $FF,Y ; get imaginary data PSHA ; save it LDAB SINA,X ; get sin(a) JSR smul ; in*sin(a) ADDA TREAL,X STAA TREAL,X ; tr=rn*cos+in*sin PULA ; get data back LDAB COSA,X ; get cosine JSR smul ; in*cos(a) SUBA TIMAG,X ; ti=in*cos-rn*sin STAA TIMAG,X LDY REAL1,X LDAA 0,Y ; get rm TAB ; save a copy ADDA TREAL,X ; rm'=rm+tr STAA 0,Y ; store new rm SUBB TREAL,X ; rn'=rm-tr LDY REAL2,X STAB 0,Y ; store new rn LDY REAL1,X INY STY REAL1,X ; save real1 for nxt LDAA $FF,Y ; get im TAB ; save copy ADDA TIMAG,X ; im'=im+ti STAA $FF,Y ; put back in array LDY REAL2,X INY SUBB TIMAG,X ; in'=im-ti STAB $FF,Y ; put back in array LDY SINPT,X LDAB DELTA,X ; increment sine pntr ABY STY SINPT,X ; save away PULA DECA ; dec pair counter BEQ ar1 ; gsm-added JMP np1 ; gsm-added ar1 LDY REAL1,X LDAB CELDIS,X ABY STY REAL1,X DEC CELCT,X BEQ ar3 JMP ncell ar3 LSR CELNM,X ; half cells BEQ done ; done when all cells ASL PAIRNM,X ; double pairs ASL CELDIS,X ; twice as far apart LSR DELTA,X ; delta is half JMP npass ; one more time! * return to calling program done LDAB SCLFCT,X ; scale factor is return value (in accumulator) CLRA ; zero MSB for IC's benefit (gsm-added) XGDX ; pop off all the local variables (gsm-added) ADDD #$17 XGDX TXS RTS ************************************************************************ * * subroutine for rescaling out-of-range data * * fixed array overflow, incorrect branches; moved top-of-data calculation * to main FFT routine (gsm) * * made out-of-range detection a subroutine, at a cost of 13 cycles (gsm) * * slight optimization saves 3584 cycles in each loop (gsm): * + reverse roles of X and Y, saving 3 cycles per iteration * + replaced ADDA #$80, LSRA, SUBA #$40 sequence with ASRA, saving 4 * cycles per iteration *

scale XGDX ; move frame pointer to accumulator [3] XGDY ; thence to Y [4] * first, check whether any value lies outside the range (-64,64] scdow LDAA #$C0 ; -64 LDAB #$40 ; +64 BSR range ; sets carry if out of range BCC sexit ; if not, check whether we need to scale up * divide the whole array by 2 scl INC SCLFCT,Y ; keep track of scaling [7] LDX TOP,Y ; reset pointer [6] scl1 LDAA 0,X ; get data [4] ASRA ; divide by two [2] STAA 0,X ; store away [4] DEX ; bump pointer [3] CPX REAL,Y ; done when both [7] BHS scl1 ; imag and real done (gsm: was BNE) sexit XGDY ; put frame pointer [4] XGDX ; back into X [3] RTS #if 1==0 * check whether any value lies outside the range (-32,32] * * this code, as it stands, doesn't do any good -- multiplying everything * by 2 won't increase the precision of subsequent calcuations; but it * may provide inspiration for a cleverer rescaling algorithm someday * scup LDAA #$E0 ; -32 LDAB #$20 ; +32 BSR range ; sets carry if out of range BCS sexit ; if set, all is well * multiply the whole array by 2 sc2 DEC SCLFCT,Y ; keep track of scaling [7] LDX TOP,Y ; reset pointer [6] sc21 LDAA 0,X ; get data [4] LSLA ; multiply by two [2] STAA 0,X ; store away [4] DEX ; bump pointer [3] CPX REAL,Y ; done when both [7] BHS sc21 ; imag and real done BRA scup ; keep scaling up until data is in midrange #endif ************************************************************************ * * subroutine for checking whether any data is out of range: if any value * is outside the range (A,B], set the carry bit * * extracted from scale subroutine (gsm) * * slight optimization saves 1536 cycles (gsm): * + reverse roles of X and Y, saving 3 cycles per iteration * range LDX TOP,Y ; start at top of data [6] rtop CMPA 0,X ; check for minimum [4] BLO rnxt ; if less negative than A, don't fix CMPB 0,X ; check for maximum [4] BLO rexit ; if > B, go fix it rnxt DEX ; bump pointer [3] CPX REAL,Y ; done when both [7] BHS rtop ; imag and real done (gsm: was BNE) CLC rexit RTS ************************************************************************ * * subroutine for signed 8-bit multiplication * * A <- A*B * B represents number from -1 to 1, so only hi byte counts. *

smul STAA TMP,X ; copy multiplier STAB TMP2,X ; ditto multiplicand * get absolute values TSTA ; check sign of multiplier BPL sk1 ; skip negation NEGA BVS sko ; check for $80 BEQ sko ; check for zero sk1 TSTB ; check multiplier sign BPL sk2 NEGB BVS sko ; check for $80 BEQ sko * do the multiplication sk2 MUL ; do multiplication ADCA #0 ; 8 bit conversion ASLA ; and correct for sine (so that 127=1.0) LDAB TMP2,X ; get original multiplicand EORB TMP,X ; check sign of result BPL out NEGA ; result is negative out RTS sko CLRA ; return zero to main RTS ************************************************************************ * * now for the sine lookup table * sintab FCB 127, 127, 127, 127, 126, 126, 126, 125, 125, 124 FCB 123, 122, 122, 121, 120, 118, 117, 116, 115, 113 FCB 112, 111, 109, 107, 106, 104, 102, 100, 98, 96 FCB 94, 92, 90, 88, 85, 83, 81, 78, 76, 73 FCB 71, 68, 65, 63, 60, 57, 54, 51, 49, 46 FCB 43, 40, 37, 34, 31, 28, 25, 22, 19, 16 FCB 12, 9, 6, 3, 0, -3, -6, -9, -12, -16 FCB -19, -22, -25, -28, -31, -34, -37, -40, -43, -46 FCB -49, -51, -54, -57, -60, -63, -65, -68, -71, -73 FCB -76, -78, -81, -83, -85, -88, -90, -92, -94, -96 FCB -98,-100,-102,-104,-106,-107,-109,-111,-112,-113 FCB -115,-116,-117,-118,-120,-121,-122,-122,-123,-124 FCB -125,-125,-126,-126,-126,-127,-127,-127,-127,-127 FCB -127,-127,-126,-126,-126,-125,-125,-124,-123,-122 FCB -122,-121,-120,-118,-117,-116,-115,-113,-112,-111 FCB -109,-107,-106,-104,-102,-100, -98, -96, -94, -92 FCB -90, -88, -85, -83, -81, -78, -76, -73, -71, -68 FCB -65, -63, -60, -57, -54, -51, -49, -46, -43, -40 FCB -37, -34, -31, -28, -25, -22, -19, -16, -12, -9 FCB -6, -3, 0, 3, 6, 9, 12, 16, 19, 22 FCB 25, 28, 31, 34, 37, 40, 43, 46, 49, 51 FCB 54, 57, 60, 63, 65, 68, 71, 73, 76, 78 FCB 81, 83, 85, 88, 90, 92, 94, 96, 98, 100 FCB 102, 104, 106, 107, 109, 111, 112, 113, 115, 116 FCB 117, 118, 120, 121, 122, 122, 123, 124, 125, 125 FCB 126, 126, 126, 127, 127, 127

Documents

Gator Microprocessor Final2 · 68HC11 and 68HC12 that UF has been using to teach the Microprocessor class in the ... The test code used a large amount of indexed addressing to load