Hardware accelerator for SOM based DNA sequencing Algorithmkth.diva-portal.org/smash/get/diva2:1317198/FULLTEXT01.pdf · The algorithm has been implemented on two coarse grain reconﬁgurable

IN DEGREE PROJECT ELECTRICAL ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2018

Hardware accelerator for SOM based DNA sequencing Algorithm

PRASHANT SHARMA

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Royal Institute of Technology

Hardware accelerator for SOMbased DNA sequencing Algorithm

Author:Prashant Sharma

Supervisor:Ahmed Hemani

ii

Abstract

The prevalent experience based diagnosis of health problem are often incor-rect. Different aspect of this problem are microorganism’s adaptation of an-tibiotics and effectiveness of the generic medicines on each individual etc.

The DNA sequencing based diagnosis is evolving to deal with this problem.The algorithmic part of these techniques is difficult to speedup and thereforehave a high latency. As a solution to this problem, machine learning basedmethods, such as BioNN, uses self organizing maps(SOM) which do not needan explicit assembly process and categorizes bacteria with smaller sampledata.

For a memory and computation intensive process, such as BioNN, it is un-desirable to implement on CPUs. The generic architectures, such as GPUs,are designed to handle varying range of needs, thus may not be the mostpower and performance efficient. Furthermore, the cloud based solutionswill provide even worse results. Therefore, the customized hardware has tobe designed. Moreover, the design and verification of an architecture fromthe scratch from the ASIC methodology requires considerable engineeringeffort.

This project plans to use a coarse grain re-configurable architecture( CGRA)platform also known as the SiLago. The SiLago methodology is aimed to re-duce the design and verification effort of the custom design repective to theASIC methodology, without compromising much on the design trade-offs.The algorithm has been implemented on two coarse grain reconfigurable fab-rics, providing a kick start to the ambitious project. The parametric-SiLagoimplementation of BioSOMs, where trained for two E Coli strains of bacteriawith 40K training vectors. The results of SiLago implementation were bench-marked against a GPU GTX 1070 implementation in the CUDA framework.The comparison reveals 4 to 140 X speed up and 4 to 5 orders improvementin energy-delay product compared to implementation on GPU.

iii

AcknowledgementsI would first like to thank my thesis examiner professor Ahmed Hemani ofthe Electronic systems department at KTH for showing confidence in me.The door to their office was always open whenever I need his guidance.

I would like to acknowledge Dr. Asad Jafri and Yu Yang at KTH for there pre-cious contribution and guidance to the development of my thesis work. ToDimitirios Stathis, as this thesis was enriched significantly through helpfuldiscussions and support. To professor Manfred Grabherr, who contributedtowards developing algorithm and shared his experience to optimize thework.

I would like to thank professor Per Larsson Edefors at Chalmers Universityfor his valuable guidance and helping me in the thesis process. Special thanksto professor Lars Svensson, who have supported me throughout the periodof master studies, by keeping me motivated from the very beginning.

Finally, I must express my very profound gratitude to my grandparents, par-ents and my sister for providing me with the unfailing support and contin-uous encouragement throughout my years of study. This accomplishmentwould not have been possible without them. Thank you.

v

Contents

Acknowledgements iii

List of Figures vii

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 72.1 DNA sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Self-Organizing Maps . . . . . . . . . . . . . . . . . . . 142.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 The SiLago Platform . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 The platform . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Vesyla . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Implementation 293.1 Algorithmic resource estimation . . . . . . . . . . . . . . . . . 29

3.1.1 Code flow . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Optimization of Algorithm . . . . . . . . . . . . . . . . . . . . 333.3 Parallelization of Algorithm . . . . . . . . . . . . . . . . . . . . 353.4 Memory and data management . . . . . . . . . . . . . . . . . . 39

3.4.1 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.2 Register File . . . . . . . . . . . . . . . . . . . . . . . . . 403.4.3 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.1 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.2 DPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vi

3.5.3 Vesyla . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Results 474.1 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Functional verification . . . . . . . . . . . . . . . . . . . 474.1.2 Scheduling report . . . . . . . . . . . . . . . . . . . . . . 474.1.3 Silago results and reference model . . . . . . . . . . . . 48

4.2 Logic synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2.1 Timing report . . . . . . . . . . . . . . . . . . . . . . . . 494.2.2 Power report . . . . . . . . . . . . . . . . . . . . . . . . 494.2.3 Area report . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Comparative analysis 515.0.1 Survey of TPU architecture . . . . . . . . . . . . . . . . 515.0.2 An ASIC based architecture . . . . . . . . . . . . . . . . 535.0.3 BioSOM on GPU . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion 55

7 Future work 57

A Appendix A 59A.1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64A.3 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66A.4 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.5 Appendix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.6 Appendix F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.7 Appendix G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.8 Appendix H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Bibliography 89

vii

List of Figures

1.1 Amdahls law: B is the part of an algorithm which is non scalable. 2

2.1 A typical process of a DNA Sequencing and analysis [5] . . . 82.2 A neuron and its parts [15] . . . . . . . . . . . . . . . . . . . . . 112.3 Simplified model of a neuron[4]. . . . . . . . . . . . . . . . . . 122.4 A two dimension mapped SOM. . . . . . . . . . . . . . . . . . 152.5 Working of Novel SOM algorithm . . . . . . . . . . . . . . . . 182.6 Reduced abstraction gap in SiLago methodology . . . . . . . . 212.7 Composition of SiLago fabric[6]. . . . . . . . . . . . . . . . . . 232.8 Different inputs to the options block of line pragma. . . . . . . 252.9 List of DPU modes in detail. . . . . . . . . . . . . . . . . . . . . 26

3.1 Selection of appropriate distance; path in red is shortest, henceappropriate, distance but green is not. . . . . . . . . . . . . . . 32

3.2 Curve fitting; Blue represents plot of 2−x, red represents e−x

and green represents 2−1.5x . . . . . . . . . . . . . . . . . . . . . 353.3 Block representation showing memory architecture in the SiLago

fabric; red is the DRAM, orange is an individual SRAM block,green is the DiMarch which is composed of SRAM blocks, blueis RFile which together with DPUs(black) makes a SiLago block,yellow is DRRA compsed of SiLago blocks. . . . . . . . . . . . 39

3.4 A simple block representation showing DPU. . . . . . . . . . . 44

5.1 Architectural floormap of TPU . . . . . . . . . . . . . . . . . . 525.2 Working of TPR with accumulation of result to anable variable

precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Massively parallel SOM architecture by prof. M. Porrmann et.

al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1

Chapter 1

Introduction

1.1 Background

The aim of this thesis work is to contribute towards a cross-disciplinary at-tempt to accelerate DNA sequencing with efficiency. The prevalent expe-rience based diagnosis of health problem are often inaccurate [21] . Therecould be a multiple factors which hinders predicting exact problem [7]. Forexample:

• Inability to understand by patient and medical practitioner.

• Inability to communicate by both parties.

• Common symptoms, complexity and rarity if disease.

• There can be multiple bacteria in our body.

• Adaptation of medicine is specific to individual patient whereas medicinesare generic.

• Antibiotic adaptation and evolution of micro-organism.

Due to aforementioned problems, the DNA sequencing based diagnosis isevolving[11]. Sequencing is a technique to extract information from lengthyDNA codes[13]. In this process the DNA samples are examined to check thepresence of bacteria and individual characteristics of the patient.

There are several traditional algorithms which are used to model sequencing.For example:

• Dynamic programming.

• Hidden Markov Models.

• Graph theory formulations.

2 Chapter 1. Introduction

The total time taken to sequence the data including algorithm and completesetup takes from about few days(current) to few months(when it started). Itdepends upon the goal, the technology[18] and the cost [14]. Clearly, thisprocess takes a considerable amount of time for complete in-order to take itsadvantage. Many continuous advancements have been made in this area forsmarter, time and cost effective solutions[8]. The common problem in the al-gorithmic part is that these algorithms saturate when scaled, due unscalableserial part. This well known property of a algorithm is known as Amdhal’slaw[1].

FIGURE 1.1: Amdahls law: B is the part of an algorithm whichis non scalable.

With continuous efforts of the society, smart DNA sequencing methods areaccelerating and advancing [20]. This thesis work is a contribution to a mul-tiuniversity project to develop a wholesome product as a contribution to thewelfare of the society.

1.2 Problem

As discussed in the previous section, with continuous effort towards effectiveDNA sequencing, there are several proposed smart algorithms to analyze thedata in minimum time[17]. These algorithms are complex and difficult toparallelize. This is because of the nature of the sequencing application. Theproblem with DNA sequencing is:

• It is a lengthy setup and analysis process.

• DNA sequencing files are itself huge to read.

• Each DNA sequence is uncorrelated with that of other microorganismwhich makes difficult to find analogy.

1.3. Purpose 3

As a solution to this problem, neural network algorithms could be a goodfit[20]. An artificial neural network(ANN) is, by nature, a huge paralleldistributed network. It is made up of simple processing units, which exe-cutes and utilizes fed data to define the network[9]. These over-lined prop-erties makes it potentially fast for the computation of certain tasks. The Self-organizing maps(SOM), a type of unsupervised learning, is a competitivelearning model which reduces data dimensions and helps clustering similardata together[9]. The traditional SOMs can help to clusters the similar data.But since, DNA sequence is characteristic to an individual and training ofthe network is an iterative process, it is difficult to fit typical SOMs modelto DNA sequencing application. The novel SOM based algorithm not onlyadapts to the application but also eradicates the need to read the whole char-acteristic sequence. This report discuss about the terminology and theoryfrom basics, before the implementation part is described in the detail.

1.3 Purpose

The proposed algorithm for BioSOM, employs a novel SOM based algorithmto aforementioned problem. The conventional algorithms try to re-assembleand match the whole DNA sequence, which consumes a lot of computa-tion. Whereas in BioSOM, the total execution time is reduced by reducingthe length of DNA sequence and effort to assemble, without compromisingon quality. This is true for both training and testing phase. Thus, executiontime is cut by a factor, lets say 10, depending upon purpose of sequencing.

To execute such an algorithm efficiently, the traditional CPU implementationmight not be the best approach. Due to this fact giant players, like Intel, Qual-comm etc, are pushing towards neural specific architectures. On the otherhand, the GPU implementation is an good option. The latest GPUs are quitepowerful for highly parallel application. But having evolved as a general-purpose hardware, it may not be the fastest one. But they are still too generic(mostly for DNN applictions) in terms of execution, architecture, memory etcfor application specific algorithms. Moreover GPU suited Deep Neural Net-works(DNN) are not suitable or natural to describe DNA sequencing domainproblem. Artificial Neural Network(ANN) specific GPU is a trending push inthe market. This impacts area-time-energy budget. Such as this novel SOMbased algorithm for DNA sequencing might not be supported with specificGPU product lineup in the market. This argument is quite reasonable as it is


currently not economically possible to alter GPUs for each application, untilit is a widely appreciated application such as self driving.

Apart, a cloud based solutions is undesirable. Since they will suffer fromadditional latency to upload the huge data set to the cloud. There are a cou-ple of other options available. The FPGA and softcore implantation may notbe the most fastest and efficient one, as compared to GPUs, because theylack support for streaming data parallel application that these kind of al-gorithm typically needs [2]. This is specially true for energy and area con-straint hand handles devices. Therefore, for such a hardware constraint andmemory and computation intensive process, a custom hardware support isdesired. Developing through the custom ASIC methodology requires a con-siderable design, verification and engineering effort. Furthermore, with theincrease in complexity of algorithms the abstractions levels are stacking up.Thus while developing the hardware through ASIC methodology, it becomesincreasingly difficult to predict number of solutions, and thus, cost matrices.Clearly, it adds on the engineering effort and is especially true if there is aneed to upgrade the hardware. Thus, an smart hardware development plat-form must be analyzed. A detailed view on ASIC development aspects canbe found in section [2.4].

For selection of the hardware platform, the Silago platform is a good fit forsuch an application. Silago is a Coarse Grain Architecture(CGRA) Platform.Its alike custom ASIC flow but is disciplined and with raised standardizationlevel, succeeding standard cell era. The fabric layout is developed and fixedby one-time engineering effort. But the individual blocks with disciplinedin-out pins are customized as per the problem domain. This intelligent solu-tion has already been proven for applications such as DSP. The thesis reportdiscuss about Silago, it advantages and its sequencing specific developmentin later sections[]. Again, a detailed view on Silago platform can be found insection [2.4].

The purpose of the thesis work is to:

1. provide customized hardware support to the cross-disciplinary and multi-university research project, for the common overall aim of reducing thetime required for DNA sequencing. This in-turn helps automating andreducing the overall diagnosis time.

2. upgrade and demonstrate capability of the SiLago platform to handleparametric parallelizable SOM specific neural networks algorithms.

1.4. Work 5

1.4 Work

The thesis work aims to provide the Hardware support to the product un-der development. To achieve the objective, the cross disciplinary efforts arerequired. The focus area involves algorithmic architecture, Memory plan,Hardware architecture, inputs to compiler/scheduler underdevelopment andverification of individual parts and as a whole.

The detailed description of the aforementioned focus area under considera-tion is as follow:

• Algorithm is a set of C++ scripts (around 4 main and 8 other files). ButSilago platform works with MATLAB interface. So, we need translateform C++ to MATLAB. For that we need to understand and reverseengineer the code.

• Secondly MATLAB should only contain the command to achieve theobjectives but not the whole algorithm (because we are designing a cus-tom HW but not the algorithm).

• We need to optimize the algorithm according to the hardware platform.This is because the C++ scripts made according to single processingunit(CPU).

• The algorithm is serial so parametrically parallelize it. Parametric par-allelization is important to let user select the number of computationalunits.

• Algorithmic optimization according to hardware and memory whichwould improve product performance.

• Memory plan and access minimization. But not the development ascontribution of KTH is focused on Hardware development.

• Detailed Architecture plan on how the algorithm will execute and hard-ware will behave.

• Specify the HW requirement by reverse engineering the algorithm. Ifthe functionality is missing in platform, then add the functionality andgo all the way to logic synthesis.

• Compiler cum scheduler which translates the code from MATLAB tolow level is not fully developed. So, help developing it.


• Verify the results with MATLAB generated golden reference model andSilago generated result.

In chapter two, the report briefly describes the DNA sequencing process andrelevant information. Upon discussing, it explains the basics of artificial neu-ral network followed by brief introduction of Self organizing maps, which isa specific type of ANN. Once the basics and the problem statement is clear,it describes the novel algorithm. For custom hardware design of the algo-rithm, it explores the CGRA based, SiLago platforms and compare it withother popular hardware options in theory.

Following, in chapter two, the report describes the implementation part ofthe thesis. Since, the design considerations were many-way and on differentabstraction levels, the chapter is appropriately divided in different sections.Each section, list different aspects of designing the hardware.

After the implementation part, a dedicated chapter lists down results whichverifies the custom hardware design at different abstraction levels.

Finally, a chapter on each namely; comparative analysis, conclusion and fu-ture work concludes the report. Moreover, Appendices lists the relevant in-formation which might be interesting to read for complete understanding ofthe topic.

7

Chapter 2

Theory

2.1 DNA sequencing

The DNA sequencing is a technique used to determine the nucleotide se-quence of DNA( deoxyribonucleic acid). The nucleotide sequence is the mostfundamental level of knowledge of a gene or genome. It is the blueprint thatcontains the instructions for building an organism, and no understanding ofgenetic function or evolution could be complete without obtaining this in-formation. The DNA consisting of four nucleotide monomers: adenines (A),guanines (G), cytosines (C) and thymines (T).The specific order of how thesenucleotides are arranged essentially codes all biological phenomena [9].

DNA sequencing has a very strong impact in various biological fields, includ-ing human genetics, plants and agriculture, bioinformatics, studies of micro-bial species, cancer and viruses, and many others. Progress in DNA sequenc-ing always leads to revolutionary advances in the diagnosis and treatment ofvarious diseases. Therefore the ability to measure or infer such sequences isimperative to biological research [11].

Researchers throughout the years have addressed the problem of how to se-quence DNA, and the characteristics that define each generation of method-ologies for doing so. There are three generations of DNA sequencing, eachgeneration defined based upon considerable technology benchmark.

Large sizes, genetic variations and complexity of biological genomes requireefficient sequencing methods in order to properly characterize them. For ex-ample, the human genome consists of approximately three billion bases. Se-quencing a large amount of bases is time consuming and quite expensive us-ing current methods. The increasing demand for faster and cheaper genome

8 Chapter 2. Theory

sequencing resulted in the development of advanced sequencing technolo-gies. However, the existing methods of DNA sequencing are still not welloptimized from the point of view of cost and speed.

FIGURE 2.1: A typical process of a DNA Sequencing and anal-ysis [5]

Once the sequencing is finished, the data becomes available for download as"fastq" text files, in which each short read takes up four lines. FASTQ files areASCII text files that encode both nucleotide calls as well as ’quality informa-tion’, which provides information about the confidence of each nucleotide.A typical files looks something like this:

@SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109

length=50

2.1. DNA sequencing 9

TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT

+SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109

length=50

hhhhhhhhhhghhghhhhhfhhhhhfffffe‘ee[‘X]b[d[ed‘[Y[^Y

@SRR566546.971 HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108

length=50

GATTTGTATGAAAGTATACAACTAAAACTGCAGGTGGATCAGAGTAAGTC

+SRR566546.971 HWUSI-EAS1673_11067_FC7070M:4:1:2374:1108

length=50

As noticed from example above, the FASTQ format uses 4 lines for each readproduced by the sequencer. Each read has:

1. Header line - This line must start with "@", followed by the name of theread (should be unique)

2. The nucleotide sequence

3. A second header line - This line must start with "+". Usually, the informa-tion is the same as in the first header line, but it can also be blank (The "+" isstill required though)

4. Quality Information - For each nucleotide in the sequence, an ASCII en-coded quality score is reported. The idea is that better quality scores indicatethe base is reliably reported, while poor quality scores reflect uncertaintlyabout the true identity of the base.

For easy manipulation of the FASTQ files, it can be converted to FASTA for-mat. This is easily done from a simple script, such as in appendix x. TheFASTA file format is a widely used format as it is easy to analyze informa-tion specifically through computer algorithms. An example sequence is asfollow.

>NM_012514

CGCTGGTGCAACTCGAAGACCTATCTCCTTCCCGGGGGGGCTTCTCCGGCATTTAG

CGGCGTTTGGAAGTACGGAGGTTTTTCTCGGAAGAAAGTTCACTGGAAGTGGAAGA

10 Chapter 2. Theory

GATTTATCTGCTGTTCGAATTCAAGAAGTACAAAATGTCCTTCATGCTATGCAGAA

TTGGAGTGTCCAATCTGTTTGGAACTGATCAAAGAACCGGTTTCCACACAGTGCGA

ATATTTTGCAAATTTTGTATGCTGAAACTCCTTAACCAGAAGAAAGGACCTTCCCA

CCTTTGTGTAAGAATGAGATAACCAAAAGGAGCCTACAAGGAAGTGCAAGG

>NM_012515

TGTGGATCTTTCCAGAACAGCAGTTGCAATCACTATGTCTCAATCCTGGGTACCCG

As can be noticed from example above, sequences are expected to be repre-sented in the standard IUPAC amino acid and nucleic acid codes, with someexceptions. The supported amino acid and nucleic acid code can be found inappendix x and appendix y respectively.

After the step of sequencing, the data is now available in required format.The next step is to process the huge bunch of data. There are several pro-posed smart algorithms to analyze the data in minimum time. As, alreadydiscuses before, these algorithms are complex and difficult to parallelize.Thus, even after putting considerable efforts these algorithms are still tooslow to execute. As a solution to this problem, neural network algorithmscould be a good fit. An artificial neural network(ANN) is, by nature, a hugeparallel distributed network. It is made up of simple processing units con-nected in a huge network. For simplicity of designing algorithm, the net-work is defined in simple and ordered connections. The neurons executesand utilize fed data based upon there localized connecting biases, knownas weights. These properties makes it potentially fast for the computationof certain tasks. The following section discuss about the neural network indetail.

2.2 Neural Networks

The brain is a highly complex. nonlinear, and parallel information processingsystem. It has the capability to organize its structural constituents, known asneurons, so as to perform certain computations (e.g pattern recognition, per-ception, and motor control) many times faster than the fastest digital com-puter in existence today. A neuron is an electrically excitable cell that takesup, processes and transmits information through electrical and chemical sig-nals. In the brain, neurons never work alone. There are, in order of, 100 bil-lion neurons(1000 km in line) and 1 quadrillion synapses in a human brain[9].

2.2. Neural Networks 11

FIGURE 2.2: A neuron and its parts [15]

A Neuron is composed of the Cell body, the Dendrites and the Axon. Neu-rons influences the decisions by firing electrical impulses. Firing electricalsignal depends on:

1. Strength of connection between synapses, called as weights.

2. Input signal.

3. Interpretation(execution) by the cell body.

A neuron receives inputs from other neurons, and if the sum(reference func-tion) of those inputs exceeds a certain threshold, the neuron "fires“. Theelectrical impulse arrives on the dendrites, gets processed into the cell bodyto then move along the axon. It then reaches the end of the axon, at thesynapses, which is connected to next neurons.

Synapses are elementary structural and functional units that mediate the in-teractions between neurons.

An Artificial Neural Network(ANN) is a massively parallel distributed pro-cessing model made up of simple processing units, which has a naturallylearns from experiential knowledge and making it available for use. It re-sembles the brain in two respects:

1. Knowledge is acquired by the network from its environment through alearning process.

2. Inter neuron connection strengths, known as synaptic weights, are usedto store the acquired knowledge.


FIGURE 2.3: Simplified model of a neuron[4].

With the increasing complexity of algorithms, the simplified model helps inexpressing them well. Therefore, to model artificial neural network it is ad-visable to keep the basic units and structure of the model to be simple. Thismakes the model easy to realize and implement. The complexity of the algo-rithm is, thus, not only be taken care by a neuron but thousands together.

1. The connection strength can be represented as real numbers in rangefrom either [0,1] or [-1,+1]. Here 0 means no connection, positive num-ber means facilitating connection and negative number means hinder-ing connection.

2. The inputs can be between represented as integers in range [0,+1]. Thebenefit of using this representation is its resemblance to binary digitalprocessing.

3. The simplest processing function can be multiply accumulation(MAC)function.

4. The decision is made according to desired threshold level. A simplestdecision function is a step function. Due to non-differential nature ofthe function, the smoother versions such as tanh or sigmoid are morecommon. The advantage of the function is that it is differential at alldegrees.


The weights are initially random. They get ordered as the input data is fedto the network. This procedure is known as learning process. The adjustedweights impacts output of the functions.

There are a few most popular types of learning. These includes:

1) Supervised: Learning with help of a teacher is also referred to as super-vised learning. For example a child is toughed alphabets of a language. Thepopular Deep Neural Network type of the network uses supervised back-propogation type of learning. Here back propagation is similar approach ofcorrecting the child if he/she identifies a character wrong.

2) Unsupervised: Which corresponds to the self categorizing capability of thenetwork. For example self teaching capability of a child. If he/she is intro-duced to a different specie of tree, but he can still identify that it is a tree(maynot name it). In this type of learning, the input data is analyzed and groupedtogether based on similarity of data. This is done by finding the winning neu-ron. The weights of wining and neighbouring neurons are adjusted so thatif the similar input data repeats, then the probability of hitting the winningneuron(or clustering) is improved.

3)Reinforcement learning: This different from previous two types as weightsnever gets updated. Previous two types used permanent learning approach.While RNN process the information based on current computational unitalong with taking advantage of recent past. For example daily food dietingtracking need not to be remembered for long time.

Uses of artificial neural network includes big data analyses, data mining,feature extraction, training/ experienced based tasks, data correlation, au-tomation, strategic tasks etc. Some common examples includes software au-tomation, image recognition, fingerprint detection, gesture detection, objectidentification, image correction and quality improvement(denoising), datacompression, real time animation and rendering, strategic game, handwrit-ing/ language recognition etc.

There a number of different models for each type of learning process. Anumber of variations and advancements can also be put in them dependingon application. The preliminary version of BioSOM, is based upon self orga-nizing maps( SOMs) which is a type of unsupervised learning. This thesis isaimed to configure the architecture for the unsupervised part. Thus, follow-ing section describes about the SOMs and its variation in detailed fashion.


2.2.1 Self-Organizing Maps

The Self-Organizing Map, developed by professor Teuvo Kohonen , is one ofthe most popular neural network models [16]. It belongs to the category ofcompetitive learning networks. The Self-Organizing Map is based on unsu-pervised learning, which means that no human intervention is needed dur-ing the learning and that little needs to be known about the characteristicsof the input data. We could, for example, use the SOM for clustering datawithout knowing the class memberships of the input data. The SOM can beused to detect features inherent to the problem and thus has also been calledSOM, the Self-Organizing Feature Map.

The SOM architecture has generally two layers. A input space and a SOMmesh. Each neuron in the input mesh is connected with the neuron in theSOM mesh by a weighted connection. SOM mapping steps starts from ini-tializing the weight vectors randomly. After that, a sample vector is selectedrandomly and the map of neurons with random weight vectors is searchedto find which neuron best represents that sample. Each weight vector hasneighboring weights that are close to it. The winner neuron can set it weightfor better correlation with the input. This increases it chance of winning nexttime the input appear. The winner neuron can also influence its neighbour-ing neurons by a small amount. This influence factor let SOM make clusterof like data together. This whole process is repeated a large number of times.

In sum up, learning occurs in several steps and over many iterations whichare as follow:

1) All weights are initialized randomly.

2) A input vector is chosen at random from the set of training data.

3) Every node is examined to calculate which one’s weights are most like theinput vector. The winning node is commonly known as the Best MatchingUnit (BMU).

4) Then the neighbourhood of the BMU is calculated. Additionally, the amountof neighbors can decreases over time.

5) The winning weight is rewarded with becoming more like the sample vec-tor. The neighbors also become more like the sample vector. The closer anode is to the BMU, the more its weights get altered and the farther away theneighbor is from the BMU, the less it learns.


FIGURE 2.4: A two dimension mapped SOM.

6) Repeat step 2 for N iterations.

Getting the Best Matching Unit is done by running through all right vec-tors and calculating the distance from each weight to the sample vector. Theweight with the shortest distance is the winner. There are numerous waysto determine the distance, however, the most commonly used method is theeuclidean distance.

There are couple and ever evolving variations and innovations which canbe made in the general algorithm to suit a specific kind of application. Forexample, a time varying parameter can control neighbourhood and update.It can also be used to control neighbourhood over time. Other variations in-cludes type of network(connection). Similarly, the connection can be made inone-Dimensional or three-Dimensional space as well. The DNA sequencingalgorithm used in this thesis used hypertorrid structure, which is explainedin the next section.

The SOM has been proven useful in many applications. It provides a topol-ogy preserving mapping from the high dimensional space to map units. Mapunits, or neurons, usually makes a two-dimensional lattice and thus the map-ping is a mapping from high dimensional space onto a plane. The property


of topology preserving means that the mapping preserves the relative dis-tance between the points. Points that are near each other in the input spaceare mapped to nearby map units in the SOM. The SOM can thus serve asa cluster analyzing tool of high-dimensional data. Also, the SOM has thecapability to generalize. Generalization capability means that the networkcan recognize or characterize inputs it has never encountered before. A newinput is assimilated with the map unit it is mapped to.

The applications of employing SOM includes classification of sample data,topology preservation, clustering, visualization, ordering, feature extraction,data compression etc. SOM is also used together with other AI based appli-cation to perform certain to mimic behavior of human cortex. For exampleself-adaptive system for dynamic object detection [x].

2.3 Algorithm

The proposed algorithm, for project BioSOM, employs neural network basedapproach.The Neural Network (NN) based approach provides the criticalbenefits as follow:

• It obviates the need to invent bioinformatics algorithms: New bioinfor-matics algorithms are invented by researchers for every specific prob-lem and need to be reinvented whenever the nature of data changes.These inventions, by definition, are not mechanical, but rather rely onhuman ingenuity, which happens sporadically, making the progressslow and unpredictable. By using the learning ability of ANNs to solvea large class of bioinformatics problems, we demonstrate the possibilityto develop a method to obviate the need to invent algorithms.

• Antibiotic Resistance (ABR): Rapid point-of-care diagnostic tests are acentral part of the solution to this demand problem as studies show thatup to 70 percent of antibiotics are prescribed incorrectly. This is due tophysicians not diagnosing patients accurately in real-time, leading toprescription of broadspectrum antibiotics that should ideally be kept inreserve. A neural network based approach not only identifies but cantrain itself over the time to mimic evolution of bacteria over antibioticresistance.

2.3. Algorithm 17

• It automates the process which implies no traditional experience basedapproch.

• Neural Network Algorithms can also classify(clustering) the data whichis not yet known.

• Scalable Parallelism: The present generation of bioinformatics algo-rithms are largely based on dynamic programming, Hidden MarkovModels, and graph theory formulations. These algorithms have largeparts that can be parallelized, though not in an embarrassingly paral-lel fashion, while other parts are serial and, following Amdahl’s law,quickly saturate the benefits of parallelism. On other side the ANNbased approach is embarrassingly parallel in nature.

• Algorithmic Superiority: NN based solutions, applied to bioinformat-ics problems, promise to be faster, more accurate, and less biased thanthe conventional solutions.

The questions comes what type of algorithm should DNA sequencing appli-cation use? As discussed before, SOM can cluster the data of similar type.Since, DNA of each individual is characteristic to an individual thus SOMcan successfully differentiate only by reading a part of file. This way we savetime by only reading around 10 percent of the file.

On the top a DNN can be trained to answer specific questions according towhat is known about phenotype, symptoms, drug resistance etc. This thesisis specifically aimed to configure the hardware and schedule the algorithmdedicated to unsupervised part only. Once complete, the SOM based algo-rithm implemented the Silago platform is aimed to provide a kick start tothe efforts to accelerate the diagnosis and show the way forward in projectdevelopment.

Next problem statement is how to incorporate the SOM model to the algo-rithm? As discussed before a typical traditional SOM model has followingcharacteristics:

• Network of neuron in a 2-D mesh fashion.

• It take advantage of Correlation between inputs to slowly make clus-ters.

• Neurons on the edge of the mesh have less number of neighbouringneurons which influences clustering.


FIGURE 2.5: Working of Novel SOM algorithm

Whereas, characteristics of DNA sequence are as follow:

• Two DNA sample of two different microorganism are highly uncorre-lated.

• DNA sample of specie of microorganism shows some correlation.

• DNA file is larger, even if read less than whole.

• Number of microorganism present in the world could be difficult to fitin one 2D mesh.

The algorithm developed at Uppsala University by prof. Manfred, thus,varies in the architecture than the traditional SOM. It has the following prop-erties:

• Each microorganism specie has an individual SOM layer. So, they donot interfere with other microorganism data while clustering. Adding,it is easier to maintain different SOM nets for as many as microorganismspecie data.

• Each SOM has a number of Neuron/Nodes. This helps in holding lageDNA data. Adding, it helps in incorporating many individuals of samespecie. The update happens only inside a specific SOM.

• The SOM is in a hypertoroidal shape. This means all neurons inside aSOM has equal neighbours and no edges.

The working of the algorithm is divided into two stages. These are the train-ing phase and testing phase. In training phase individual the SOM nets are

2.3. Algorithm 19

trained by microorganism data. In testing phase the sample data of patientis checked with each SOM to find its correlation.

The steps in the training phase are as follow:

• Each neuron of SOM layer(or net) is connected to each neuron in in-put layer( or space) by a weighted connection. Initially the weights arerandom.

• One DNA line is read from the whole file. It is randomly divided insmaller fragments which serves as an input to the SOM.

• The correlation(distance) of each neuron in SOM is calculate w.r.t theone input fragment sequence.

• The winner neuron is determined which is a neuron with least distanceor best correlation.

• The winner neuron updates its weight. It also updates its neighbourswith an exponentially decreasing factor by its physical distance( not tobe confused with correlation-distance).

• Thereby over a few iterations of different fragments, the whole SOMnet is trained.

• Similarly, the other SOM nets are trained.

Similarly other SOMs are also trained with different Microorganismdata.

The steps in the training phase are as follow:

• Each neuron of SOM layer(or net) is connected to each neuron in inputlayer( or space) by a weighted connection. The weights are not randombut the values from the training phase.

• Several DNA line are read from the whole file. They are not fragmentedas we do not want to loose data of multiple possible microorganism.They are fed one by one as an input to the SOM. All DNA lines are notneeded to be fed as, by property, their instances present all over theDNA file.

• The correlation(distance) of each neuron in SOM is calculate w.r.t theone input fragment sequence.


• The winner neuron is determined which is a neuron with least distanceor best correlation. this data is store.

• Update process should not happen as it will influence the database.

• The store data is multiplied with each fed to accumulate the result. Thisaccumulated result describes the correlation if a sample data to a spe-cific SOM or microorganism.

• Similarly, correlation with the other SOM nets are checked.

• Finally, the data is compared for analysis.

To compute such an algorithm with accuracy and performance, an appropri-ate platform must be chosen. As previously dissed, Silago platform showssuperiority over other platforms for this application. Following section dis-cusses about Silago platform in detail.

2.4 The SiLago Platform

2.4.1 Motivation

For BioSOM, it is proposes to implement the NN algorithms entirely in hard-ware, using a method that provides a hundred-fold improved computationalefficiency compared to GPUs tailored for NN algorithms, while also provid-ing radical improvement in engineering efficiency by reducing the cost ofcustom hardware design comparable to that of compiling code for GPUs.

It is a well known fact that it needed a lot of effort to design a custom hard-ware from scratch. In early 90s, the standardization was, thus, made commonin industry till transistor level. Afterwards, the ever evolving technology andrequirements pushed the custom design flow to what it is known today asASIC flow. With continuously increasing software requirements, it takes alot of engineering effort not only to design but to verify and validate too.

The increase in number of abstractions levels is due to the increase in com-plexity of the algorithms. This is facilitated by the another fact that the num-ber of solutions to the Cost Matrices(to find optimal solution) increases ex-ponentially with the addition of each layer. Thus, it is becoming difficultto find/predict an early solution deep down the transistors. Therefore, it isbecoming to exponentially difficult to design an architecture through ASIC

2.4. The SiLago Platform 21

FIGURE 2.6: Reduced abstraction gap in SiLago methodology

methodology. Furthermore, it is evident that with each design engineeringeffort, many folds verification effort is required as well.

SiLago methodology is coined to reduce this effort by arising the level ofstandardization to the micro architecture level / RTL level, succeeding thestandards cell era [6][10][12].

The main advantages of Silago methodology is as follow:

• Reduced custom hardware development effort.

• Cost Metrics are predictable because it is a Hard-IP platform. The smallstandard Silago blocks are hardened with one time engineering effort.

• The matlab/vsim testbench of a compiler, Vesyla, is designed for de-buging purpose which is not the part of silago design flow.The resultsof the scripts can be compared with golden reference model as well.This golden reference model is automatically incorporated with MAT-LAB interface to Vesyla.

• In fact, for Silago design flow, the functional verification not needed,it’s correct by construct. The correctness is guaranteed by automationprocess.


• Re-usability(library development of supported functions).

• Innovations at topmost abstraction level as total number of solutionbelow the topmost level decreases.

In Following section, the Silago platform is discuss in the detail.

2.4.2 The platform

The SiLago platform based on coarse grain reconfigurable computation andstorage fabrics. SiLago (Silicon Large Grain Object) blocks are microarchi-tecture level design units that replace standard cells. The Silago blocks arefor-example DRRA cells or DiMArch cells. DRRA cells are made up of arith-metic logic units, register files, sequencers, interconnect elements or com-binations of them. While DiMarch are basically SRAM memory elementsput together. The SiLago blocks are synthesized down to physical level andcharacterized with post-layout data to know with the highest possible de-gree of confidence the cost metrics of the micro-architectural operations thatthey support. A virtual grid defines the two-dimensional pitch in the SiLagoframework and all SiLago blocks are hardened to occupy exactly one or morecontiguous SiLago grid cells. The infrastructural and functional nets of theSiLago blocks are brought to the periphery on the grid at right positions andmetal layers to enable composition by abutment with the neighboring SiLagoblocks. SiLago grid is divided into regions and each region caters to a specifictype of functionality like inner-modem computation, outer-modem compu-tation, scratchpad memory, system control, infrastructure etc. These SiLagoblock types and regions are similar to class library in an objected orientedprogramming environment. The number and types of class objects instanti-ated is a decision made by the programmer and constitutes a specific instanceof the program. In the SiLago framework, the number and types of SiLagoblocks that are instantiated and their interconnection is decided as a part ofthe synthesis process that maps applications to the SiLago fabric [6].

The SiLago design flow has three components. The first two components areone time engineering efforts.

1. The first component involves development of the SiLago physical de-sign platform. This step is similar to the standard cell library develop-ment phase.


FIGURE 2.7: Composition of SiLago fabric[6].

2. The second component is a function library development phase andis similar to the library of micro-architecture level blocks (that systemlevel synthesis tools have access to. )

3. The third SiLago design flow component is the system level synthesiscomponent that transforms digital signal processing systems modeledin Simulink to SiLago designs, i.e. GDSII level SiLago design instancecomposed of the SiLago blocks.

The SiLago method provides an end-to-end automated design frameworkfrom Matlab Simulink to GDSII and the entire flow has been exercised on anumber of DSP applications and the benefit of whole SiLago flow quantifiedcompared to the standard cell based EDA flow. The SiLago method outper-forms the standard cell based synthesis by 2-3 orders in synthesis speed and1-2 orders in terms of accuracy of predicting cost metrics. But the real benefitis in eliminating the functional and constraints verification and more effec-tive higher level design space exploration.


2.4.3 Vesyla

Vesyla is a high-level synthesis tool that maps the algorithm to the DRRAand DiMarch fabric. It calls a HDL simulator, namely Mentor Graphics Mod-elSim. The verification part, through Vesyla, is only for debugging purpose.

Veryla does not have an automated resource allocation and binding mecha-nism. It does it, rather manually, by using some pragmas. The pragmas areidentified with "%!" symbol in the code.

There are two types of pragmas:

1. Line pragmas: They are written as a comment, after the aforementionedsymbol, in front of a statement. They are used for storage and processorallocation.

2. Block pragmas are written on a line before and after the desired groupof statements. They are mainly used for defining parallel regions.

Figure 6 shows various option types associated with the line pragma. Thedetailed description of using the line pragmas is as follow:

1. Storage allocation pragma can have different options that are placedwithin the angular bracket. Currently there are three options for stor-age allocations which are distribution type, variable type, and indirectaddressing mode.

2. All arithmetic statements in a MATLAB code that need DPU operationsshould be instrumented with processor allocation pragmas in order tospecify which part of the DRRA fabric should perform that arithmeticcomputation.

3. Processor allocation pragma can have different options that are placedwithin the angular bracket. Currently there are three options for storageallocations which are DPU mode, output port, and saturation mode.

Below is a small yet comprehensive example of describing an algorithm us-ing Vesyla platform.

1. A = zeros(1, 64); %! MEM<options> [0,0]

2. B = zeros(1, 512); %! RFILE<options> [1,1]

3. %! RESOURCE_SHARING_BEGIN

4. C(1) = C(1) + D(2); %! DPU [1,0]


FIGURE 2.8: Different inputs to the options block of linepragma.

5. E(1) = E(1) + F(2); %! DPU [0,1]

6. %! RESOURCE_SHARING_END

All of the instruction lines in the example above, viz 1 and 2, shows how touse line pragma. There are different types of line pragma used in the examplewhich shows how to store data at different level of memory(as in line 1 and2) and execute the instruction with specific DPU(as in line 4 and 5). Thespace for "options" can either list one of the commands as in figure 6 or beleft empty. The lines 4 and 5 executes in parallel as they are listed inside theblock pragma.

There are several modes of operations of the DPU. In the example above,both DPUs ( i.e [1,0] and [0,1]) adds and accumulate the results. Here, theDPU realizes the mode of operation by itself. In other case, it is also possibleto inform the DPU about the mode of operation by utilizing "options". Tablein figure 7, lists down all the modes of the DPU and a description on how itperforms the computation.

Currently Vesyla is in the testing and development stages, so it may have


FIGURE 2.9: List of DPU modes in detail.


several limitations and problems. Some of the important limitations are thefollowing:

1. Limited support for MATLAB functions: Supported functions are: ze-ros, ones, min, max, sum, fi, abs.

2. No support for multidimensional arrays.

3. Supports only one DPU operation in each assignment statement.

4. Limited support for arithmetic operations to be mapped to DPU: Sup-ported operations are addition, subtraction, multiplication, MAC, Sym-metric MAC, Preserving MAC, Comparison, subtract-accumulation, ab-solute subtract accumulation.

5. Nested loops of maximum 4 levels are supported.

6. Conditional statement limitations: Limited supports for conditional state-ments inside loop, No nested conditional statements.

29

Chapter 3

Implementation

This chapter will go through the steps and considerations involved to designthe customized hardware for SOM. The considerations involves memory anddata management, architecture plan, scheduling and parallelization of the al-gorithm, register transfer level changes and logical synthesis. The Followingsections discuss these steps in detail.

3.1 Algorithmic resource estimation

This section will discuss about the algorithmic requirements set by the set ofscripts developed specifically for accelerating DNA sequencing. Followingsections focuses on reverse engineering the code to deduce the specificationfrom the that information.

3.1.1 Code flow

The purpose of reverse engineering the code is two ways; firstly to under-stand the structure and nature of code. This would be later used in schedul-ing the algorithm to the hardware. Secondly, it would let define the hardwarerequirements needed to support the algorithm.

As previously described, the algorithm is divided into two parts viz train-ing and testing. Firstly, different SOM nets are trained by sample of differentbacteria and virus. In second step the trained nets are compared with thepatient’s sample data to check for footprints and write back the probabil-ity(score) to output file. The following paragraphs explains the algorithmand the steps in brief detail.

30 Chapter 3. Implementation

• The algorithm starts by reading data samples of different bacteria/virus.The sample data is assumed perfect. So, it is only required to read andtrain the neural network through a single specific part(subsequence) ofan DNA sequence. As, it known that it would take a few iterations forthe net to set its random weights in the order, so multiple sub-sequencesare feed to the network. To sum-up, a specific( generally first) sequenceof a DNA file is chosen.

• The starting address of the subsequences( in total 10) is chosen ran-domly and then feed to the net one by one. As, there is one individualSOM net dedicated to each bacteria/virus. So, this process is done foreach different bacteria/virus on different respective SOM nets. The na-ture of the net is as described in the previous chapter.

• In the process of reading and feeding the data, the data needs to beconverted from the FASTA format to the binary format. Note that, eachbinary bit of information is represented by an integer from { -1, 0, 1}which means:

1. 0: means no input.

2. +1: means positive input.

3. -1: means negative input.

The difference between a positive and negative input is that of facilitat-ing or hindering influence on neuron to fire.

Each letter of a DNA subsequence is represented by two such, abovementioned, integers. Of these two connections, either of them is 0 andother is between positive/negative input. As we can note, there are twobits required to represent any of these three bits. So, a total of 4 bits areneeded to represent a letter in the net.

• Following the conversation step, it is the learning rule. In learning,the distance between randomized weight set is calculated for each sub-sequence in each SOM. This is simply done by summing the squares ofdifference between weights and input subsequence data. i.e

d =N∑i

Wixji (3.1)

3.1. Algorithmic resource estimation 31

where, N are total number of connections of jth neuron in the SOM netto the input layer, d is distance, W and x are associated weight andinput respectively.

• The winner neuron is determined in the net whose distance value isleast.

• The weights are updated according to the location and neighbourhoodof the winning neuron according to Gaussian curve. The update isaimed to increase the probability of firing the neuron, if the same in-put is repeated. In other words the distance value should further mini-mize.This is simply done as:

Winew = Wiold + α(Wiold − xji) (3.2)

where, α is the update factor which is a function of physical distancebetween neuron in the ring.

• The physical distance can to be calculated from both sides in the thesimple hypertoroid, as shown in figure 3.1. In case physical distanceis negative, it is converted to a positive integer. There are two possiblepaths, and hence, physical distance between two neurons in the ring.

To check if the neuron falls in the neighbourhood of the winning neu-ron, the shortest distance is calculated. To check if the calculated dis-tance is the shortest one, it is compared from the halfway of the ring.i.e

if(d > size/2){dnew = size− dold} (3.3)

If the calculated distance is larger than half the maximum distance, theother path is the shortest and appropriate one.

• This distance value is used as a parameter to an exponential neighbour-hood function to update the weights. i.e

α = ae−bx. (3.4)

where, a and b are fitting parameters.


FIGURE 3.1: Selection of appropriate distance; path in red isshortest, hence appropriate, distance but green is not.

The neighbourhood update function can be introduced with other ca-pabilities including time dependent size of the neighbourhood etc, asmention in previous chapter.

• The process repeats for each sequence and SOM.

• The next part of the algorithm is to analyze the test (sample) file. Thispart is similar to the previous one but with few appropriate modifica-tions. In this step the best neuron is calculated as in previous part. Butthe update process is not carried out. This is because it is undesirableto influence the SOM net from the sample. Also, Multiple iterations ofsubsequences from same sequence do not fit because the task is not totrain. Rather, multiple sequences are read all over the file to find closestmatch (and also multiple infection organisms). The total score, for eachnet, is returned as an output which indicates the probability of matchwith that net. The final scores determines the infection.

3.2. Optimization of Algorithm 33

3.2 Optimization of Algorithm

1. Improving performance by reducing execution cycles: After reverse en-gineering, the original algorithm is as follow:

For(1 of S)

For(1 of M)

For(1 Neuron of N)

For(1 Weight of 64)

calculate distance

if(best)

store index and score

end

end

end

For(1 Neuron of N)

if(prev. best)

update weight$\}$

end

end

end

end

It is clearly noticed that it is required to execute the wining neuronof the before update. First optimization is to execute update processof previous execution cycle. This will not affect any computationalchanges. But this will allow to merge two second inner most loops(iteratingover neurons) together and hence increase performance in consider-able amount. Clearly, now it is only required to iterate once over allthe weights and neurons foe each execution cycle, but earlier it wasrequired to iterate over the neurons twice and also its weights. Thus,twice the performance improvement. The new algorithm is as follow:

For(1 of S)

For(1 of M)

For(1 Neuron of N)

For(1 Weight of 64)

if(prev. best)

update weight

end


calculate distance

if(best)


end

end

end

end

end

2. Replacing the neighbourhood exponential function by shift function:The neighbourhood influence is performed by the following mathemat-ical operation:

α = ae−bx. (3.5)

where, a and b are fitting parameters.

To calculate exponential function, the DPU needs to be updated witheither of:

(a) CORDIC algortihm: which impacts performance.

(b) look up table: which impacts area.

(c) Some other similar approaches which impact a mix in area andperformance.

As known, any decreasing exponential function can be replace with:

α = α2−βx + ζ. (3.6)

Where α, β and ζ are fitting parameters.

This is only possible "iff" exact boundaries of data set and precisionis known. The fitting parameters can then play an important role toapproximate this new function to the older one within specific region.Secondly, it is not a hardened rule to only use e−x function for neigh-bourhood. Other functions can also be played around to achieve de-sired results.

The question is how the new function helps? The new function is noth-ing but right shift function( 2−x). So, the function is represented as:

3.3. Parallelization of Algorithm 35

FIGURE 3.2: Curve fitting; Blue represents plot of 2−x, red rep-resents e−x and green represents 2−1.5x

α = αright_shift(x) + ζ. (3.7)

The right shift is an general function, every hardware platform provide.It is appreciable for area, performance and power, esp. true if comparedwith aforementioned functional implementation techniques.Another ad-vantage is that there is now no need to upgrade Silago and Vesyla asright shift is already present.

3.3 Parallelization of Algorithm

Generally, an algorithm can be parallelized in multiple ways. After paral-lelizing a algorithm, the serial algorithm which is made to run on single pro-cessing elements, runs on multiple processing elements architecture. Thealgorithms are parallelized based upon the their scalable part. Accordingly,maximum speedup that can be achieved by the algorithm can be predictedby Amdahl’s Law.

The thesis aims to exploit the scalable part of algorithm to run in parallel withappropriate load sharing. It is given that the parallelization of the algorithmshould be parametric. Adding, the platform should be easy to modify toscale parallelism.

To parallelize an algorithm, it is required to evaluate the algorithm as well asthe its architecture. The hardware resources should be ideal for least possibletime with appropriate load sharing. Apart, the connections between differ-ent blocks, the pin placement and synchronization between different blocks


should be kept in mind while defining the architecture. On algorithmic side,the algorithm is analyzed to exploit different options for parallelism by ana-lyzing loops and computations.

To start with analysis of parallelization of algorithm, let us review the DRRAachitecture. A Silago block(DRRA block) can store atleast 32 different dataregisters in Rfiles, each with precision of 16 bits. The data gets executedin DPU . That is, DPU is the functional element of every block which con-sist of hardware support for several functionalities. A DPU can, thus, ex-ecute several functionalities including addition, subtraction, multiply accu-mulate(MAC), shift etc. Most of the the functional sub-blocks has throughputof one. Note that, since it is a CGRA architecture, it do not support binaryfunctionalities.

Lets us now briefly look to our algorithm. To simplify and generalize, let usassume that there are M input subsequences, C characters, T test sequences,N neurons and S SOM nets. Then the linear algorithm can be expressed asfollow.

For(1 of S)

For(1 of M)

For(1 Neuron of N)

For(1 Weight of G)

if(prev. best)

update weight

end

calculate distance

if(best)


end

end

end

end

end

As seen from basic linear algorithm, there are couple of options available toparallelize. Few of these simple techniques are:

1. To divide elementary computations to different Silago blocks.

2. To divide task of inner most loop i.e weights to different Silago blocks.

3.3. Parallelization of Algorithm 37

3. To divide task of second inner most loop i.e neurons to different Silagoblocks.

4. To divide the task of the two outermost loop.

All above-mentioned approached has their own advantages and crux. Thefirst technique might be the ideal one for load sharing(DPU level). Also, thehardware pipeline might never scarce of resources. But at end of the com-putation for each neuron(and thereby for each weight), all the blocks need toshare the results. Thereby for resource sharing, the architecture will need allSilago blocks to be connected. Incase the physical distance is too far awaybetween the blocks, the interconnections becomes undesirable. Adding, Incase of complex functionalities, the synchronization overhead would resultin the performance drop.

The second technique of dividing the weights to different Silago blocks willuse Rfile space ineffectively. Hence, SRAM and DRAM access time mightcontribute to a high overhead. The synchronization overhead and resourcesharing area still persist.

Before we ponder over third approach, let us discuss four. If we distributethe task of different SOM to different Silago blocks, there is a need of readingmultiple sequences in case of training. This will pt pressure to synchroniza-tion between different level of memory for optimum performance. Further-more, the same problem persist if division is based on number of sequences.Also, to be noted that for a typical SOM, multiple sequence reads in parallelis not appropriate as it will hinder the functions for update function.

Instead,if the algorithm is parallelized by third techniques. Computations ofeach neuron is assigned to each Silago block. That means, all the weightsassociated with the neuron is stored in the huge parametric sized RFile alongwith the other data. The associated DPU executes the computations related toeach neuron serially. If there are ’B’ number of Silago blocks, same numberof neuron data can be calculated simultaneously. Thus, execution remainshighly symmetrically in each block and can be parametrically increased ordecrease without too much engineering cost. Next points to ponder is aboutload sharing. The load sharing among the DRRA blocks, assuming all theblocks are filled with the appropriate data, remains highly symmetrical. Asthe DPU is specifically designed for SOM application, number of functionalunits improperly utilized is not an concern. The synchronization effort atthe end of computation cycle is thus minimum. The resource sharing at the


after the end of computation remain low(to calculate best neuron in case ofmultiple blocks) since it is a small part of the algorithm. Moreover, the com-parison of neighbouring units is calculated at same time to enable furtherparallelism. The Silago architecture allows resource sharing in window fash-ion to communicate the results in the end. The only crux remains is to paral-lelize the computation of individual neuron which is execute serially insideevery block. But with the present algorithm, most of the calculating remainssimple, appropriate and serial by nature. Further in future if any possibilityarises, number of adjoining blocks can be increased to accelerate the inde-pendently parallelizable part.

By analyzing the above argument, it shows that the SOM application forDNA sequencing is inherently supported by the Silago platform better thencomputationally generic GPU and can be easily engineered as compared toan ASIC architecture. This bring the inherited advantage of Silago method-ology over other approaches.

For(1 of many S)

For(1 of many M)

For(B Neuron of N)

For(1 no. of Weights of G(in individual blocks))

if(prev. best)

update weight

calc distance

if(best(local to block))

store index+SOM

end

end

if(best(global to block))

store index+SOM

end

end

end

end

The complete algorithm for two SiLago blocks can be found in appendix A.Each block computes the data of one neurons in parallel with another block.The thesis is aimed to compare the implementation of the algorithm on oneand two Silago blocks to observe the efficiency and scalability. Thus, theperformance of the algorithm can be predicted for higher number of blocks.

3.4. Memory and data management 39

3.4 Memory and data management

As we know, the coarse grain Silago architecture provides three level of mem-ory hierarchy. The first level is Register file(RFile) with variable number ofregisters( generally 32). The Rfile stores the data to be computed by the asso-ciated DPU. Each Register in RFile is 16 bit wide. As discussed earlier, RFile,DPU and few other computation units constitutes to an DRRA block. Thesecond level is SRAM which stores data according to the need of DRRA blockand algorithm. The SRAM blocks obeys CGRA architecture layout, known asDiMarch, as in the case of computational blocks. Finally, third level is DRAMwhich store all the data in defined format.

FIGURE 3.3: Block representation showing memory architec-ture in the SiLago fabric; red is the DRAM, orange is an indi-vidual SRAM block, green is the DiMarch which is composedof SRAM blocks, blue is RFile which together with DPUs(black)makes a SiLago block, yellow is DRRA compsed of SiLago

blocks.

Given aforementioned memory architecture, the aim is to theoretically min-imize the number of memory accesses to make the architecture energy effi-cient. The size of SRAM and DRAM is defined as cost and functional con-straints. First the data size is calculated from the specifications. The calcula-tions are parametric for easy modifications:


3.4.1 DRAM

Let there are M input subsequences, each with C characters, T test sequences,S SOMs. If we assume specifications, M=25, C=10, S=100, N=1024 and T= 4.

1) Calculations for input subsequences: Each characters requires 2 connec-tions and each connection is represented by 2 bits(from -1,0,1). So, a total of4*M*C bits are input subsequences which is total of 1000 bits.

2) Calculations for test subsequences: For T test sequences, each with C char-acters. Similar to the previous calculations, total number of bits required are4*T*C which is 160 bits.

3) Calculations for weights : As assumed in specifications, the precision ofthe weights is 16 bits. The total number of bits required to store weights are(S*N)*(2*C)*(16) = 32*S*C*N which is 655360 bits.

Note that, the assumptions of values for variables was just made to showtheir extend. In reality the the formula can be used to predict the size of thememory based on algorithmic requirement.

3.4.2 Register File

Since the design is parametric, the SRAM and the Register File(Rfile) spaceis calculate as per data required in one step in parallelization. Rfile will holdthe data of single steps. So the calculations vary for number of DRRA blocks.The calculations of data size for Rfile for different number of DRRA blocks isshown as below:

1) Single DRRA block(or one neuron at one time/ no parallelism):

The Rfile needs to store data for one neuron. That is, one input subsequencewith C characters each with 2 bits of precision(2*C), 2*C weights of 16 bitprecision, and intermediate data for execution.

2) Two DRRA blocks( two degree parallelization):

In this case, there are two Rfiles to compute two neurons individually. Sothe input sequence and weight data remains the same in each Rfile. Theinput subsequences is multi-casted to each Rfile, as it remains same. Theintermediate data may increase a bit to store individual data as well as globaldata to calculate and compare global wining neuron.

3.5. Architecture 41

In the next section, different types of data stored inside Rfile is described indetail.

3.4.3 SRAM

The SRAM is used in ping-pong fashion to boost execution efficiency. This isdone by filling the SRAM with forthcoming execution data until Rfile com-putes the current data. This technique is implemented to hide slower datatransfer rate between DRAM and SRAM as compared to SRAM and Rfile.The size of SRAM is approximately double the of size needed for computa-tions in Rfile :

1) Single Silago block:

In this case, it should be double to that of the Rfile. That is 4*C + 2*(2*C*16 +intermediate data)).

2) Two Silago blocks:

In this case, it is a little less than the double to that of the Rfile. This is becausesubsequence remains the same for one complete training interaction. That is4*C + 4*(2*C*16 + intermediate data)).

Like wise, the SRAM size can also be calculated for higher number of Silagocells.

3.5 Architecture

This section describes about the architecture definition of the hardware.

3.5.1 Memory

The memory size is calculated as in previous section. As already mentionedthe SRAM is used in pingpong fashion to hide the slower DRAM access.Also, it is assumed that the DRAM stores the converted and reduced datafor computations. For example the algorithm computes characters which aretranslated to input to the weights. Each character requires two connectionand hence inputs. Each input is represented by two bits. This process iscarried out in numerous steps for each iteration of subsequent input. The


subsequences stored in the DRAM are already converted as input for eachconnection. This saves expensive memory sizes. The reduced data set takeslesser access time, hence it is faster. Also, the reduced number of bits trans-lates to energy usage, hence low power design.

Note that, each weight requires 16 bits But each input to those weights fromthe subsequences requires only two bits. Thus for 20 weights(10 character),20 inputs(2 bits each) needs as many registers. This leads to wastage of con-siderable space in registers in Rfile as of those 16 bits only 2 bits are usedto represent input. As solution to this problem, the input subsequences arefilled altogether in the registers. That is a total of 40 bits needs 3 registers.As the Silago platform do not allow fine grain operations, thus, it is neededto mask two bits with iterations. To mask, atleast one cycle is needed withSilago platform. Given the priority of performance vs area, the approach canbe opted. We decided to mask the bits for one cycle for better packing.

Following, the Rfile is structure to store the data information is described.The specifications are as below:

1. Weights.

2. Input subsequence.

3. Local best.

4. Global best.

5. Local location.

6. Global location.

7. Current location.

8. Temporary variables.

9. Update factor.

10. Masked current input.

As can be seen from above calculations, a total of 34 registers are needed.Given the Scratchpad size from either 8, 16, 32, 64 or 128. Following, differentcases are considered to decide upon appropriate RFile size:

1) There is an overhead of 2 registers for scratchpad size of 32.


2) In case, only half the weights and subsequences are taken i.e divide aniteration into two. In case of the scratchpad size of 16, the overhead of othervariables increases.

3) In case of last case with keeping the Rfile size of 32, a considerable numberof registers are empty hence wastage of area and memory space.

4) In case of scratchpad size of 64, simply operating with one neuron at a timethere is a lot of wastage, as with previous case.

6) In case of scratchpad size of 64 with computing two neuron data. Theoverhead is reduced as temporary variables, global location/best, local lo-cation/best and the input subsequences are the same. Only other variablemultiples. This results in a total of 56 registers are used.

From above mentioned cases, case 6 theoretically best in performance. Whilecase 3 and case 4 are executable as a cost effective solution. Case 4 is simplerto implement with least possible effort required in the synchronization. Thepossibilities of these options will be later explored for their ease of imple-mentation with Vesyla/Silago platform.

3.5.2 DPU

As deduced from the reverse engineered algorithm, the DPU needs to per-form the following operation:

1) Addition/Subtraction operation: different modes of addition and subtrac-tion operation.

2) Comparator: for branches.

3) Shift right: for neighbourhood update process(as a replacement of expo-nential function).

4) Shift for masking: for appropriate packing and utilization of memoryspace.

5) Multiply-Accumulate(MAC): for better multiply-summation performanceand utilization of memory space.

6) Negation: for calculating neighbourhood function.

As can be noticed from the above requirements and given DPU structure,the DPU needs to be updated with only one functionality. It is, Modify the


shift functionality for masking the input subsequence, thereby helps in sav-ing area and memory space(cost). Although is costs only one cycle of perfor-mance, which is a fair trade-off.

FIGURE 3.4: A simple block representation showing DPU.

The masking can be done by either using shift function or by implementinga multiplexer. We decide to use shift since both takes only one cycle. the shiftfunction is already implemented in the hardware. It is only need to modify alittle for this specific case. Adding, it will not change the hardware chain offunctionality so the verification effort is minimum.

There are two modes of shift operation available in DPU, right shift and leftshift. In the particular algorithm, the right shift is needed for implementingneighbourhood update process. But the left shift serves no purpose withimplementing the algorithm. Thus, this mode can be modified to mask thebits.

The masking algorithm works as follow:


1. The input subsequence is loaded in the RFile.

2. Select an input sequence register of three.

3. Iteration count ’i’ is stored pertaining to calculations associated withweights-inputs.

4. Copy the input subsequence to masking register. The masking registeris shifted right as (16 - 2*i), where i= [1,8]. Afterwards, it is shifted 14time left.

5. Repeat from step iii.

6. Repeat from step ii.

3.5.3 Vesyla

Vesyla, the compiler cum Scheduler platform is under development. Due tocomplex algorithm to solve graph, Vesyla needs appropriate attention. Butas already proven, it reduces verification effort and improves scheduling ofcomplex algorithm(rather than doing manually). It also enable ease in com-paring Silago results with MATLAB reference model. Thus, it is well appreci-ated decision to spend development effort on Vesyla rather than hand codingscheduling.

There were many instances when Vesyla failed to execute and with bugs inscheduling. We spend a considerable amount of time to produce, back trackand analyze the bugs to improve the platform, specifically for SOM applica-tion. At times, the Manus code produces by Vesyla was generate by handcoding it. Although tedious and error prone task, but a contributed to Vesylaplatform development so that there would be no need to hand code complexalgorithms in future.

3.5.4 Scheduling

The MATLAB and instruction list for scheduling on two Silago cells can befound in the appendix.

47

Chapter 4

Results

4.1 Verification

4.1.1 Functional verification

Functional verification is forced to check the DPU to be functionally correctwith the changes in DPU. There is no need to check wholesome functionalityof the DPU. This is because the Silago architecture esp. DPU is designed ina fashion that it can later cooperate changes in the modes of functionality asper the custom requirement of the new algorithm. The new mode was care-fully added without interfering with the other part of the design. Also, theinterference is automatically check while scheduling the complete algorithmfrom Vesyla and comparing with the golden reference model.

The custom testbench for checking the DPU mode is also created by using Ve-syla. It automatically creates a tench in scheduling fashion and cross confirmthe complex algorithm from reference model.

The result report of the testbench is clean and without any error. The resultsfrom testbench, algorithm can be found in appendix b and appendix c, in therespective sections.

4.1.2 Scheduling report

Total number of cycles with 2 Silago blocks cycle was reported to be 816 cy-cles (10220 ns). As a part of future work, it would be noteworthy to seehow performance improves in reality as the number of blocks doubles. Thescheduling summary can be found in appendix d.

48 Chapter 4. Results

4.1.3 Silago results and reference model

The scheduling instruction report is close to scheduling by the testbench.The scheduling report is created by a low level scheduling script, known asmanus script. Manus script is in-turn created by high level MATLAB code tothe Vesyla interface. The algorithm scheduled with Vesyla behaves in correctorder at all levels of abstraction from MATLAB to verification testbench.

The verification testbench creates a result file by using the Silago platform.The MATLAB script when executed on MATLAB software, creates referenceresult. Both of these results came out perfectly same with the algorithm,which proves the custom designed hardware.

4.2 Logic synthesis

Logic synthesis is step where RTL level description of a design is convertedto an optimized gate-level representation. Logic synthesis uses a standardcell library for conversion. The library contains information of logic gatessuch as AND, OR, NOR etc. It also contain information about adder, MUX,and flip-flops etc. Translation from RTL to gates level netlist is done by in-dustry standard synthesis tools. It also optimize the design as a whole. Logicsynthesis tools allow technology independent design. Design reuse is possi-ble for technology-independent descriptions. In Silago 40nm library is used.In thesis library tcbn40lpbwptc is used for logic synthesis.

The typical script for performing logic synthesis specific to SiLago method-ology is written as follow:

• Set Directory

• Set Environment

• Define the libraries and search path

• Analyze and read design

• Elaborate the design

• Set timing exceptions. This includes setting operating conditions, wireload mode, wire load selection group, false path.

• Set Clock. Which includes clock name, period and waveform.

4.2. Logic synthesis 49

• Compile map effort

• Create report area, power and timing report.

• Create netlist file and support file such as .sdf, .sdc, .ddc.

The complete tcl file for logic synthesis can be found in A.8.

The following subsections describes the outcomes of the synthesis i.e timing,power and area report. In order to verify that synthesis is done properly,these report are important to analyzes, along with the netlist file which isverified as well.

4.2.1 Timing report

The timing report clearly indicated timing clean design. The slack was foundto be 0.01 which is good and acceptable. The positive slack indicates thattiming is met and the signal arrives before it is required.

The detailed timing report can be found in appendix g.

4.2.2 Power report

The power report shows power values as per expectations from previousdesign experience of Silago cells, within acceptable margins. This verifies thepower report a well. The details are as follow:

• Internal : 3.4409 mW

• Switching : 0.2305 mW

• Leakage : 3.1191e+03 nW

• Total : 3.6745 mW

The detailed timing report can be found in appendix f.

4.2.3 Area report

The Area report shows area values as per expectations from previous designexperience of Silago cells, within acceptable margins. The black box cells andarea are reported 0 which verifies the area report. The details are as follow:

50 Chapter 4. Results

• Combinational area: 24026.562259

• Buf/Inv area: 3176.964100

• Noncombinational area: 13988.696388

• Macro/Black Box area: 0.000000

• Total cell area: 38015.258646

The detailed timing report can be found in appendix e.

51

Chapter 5

Comparative analysis

The artificial neural networks is booming area of research. With its highlyparallel self analytic ability, many organizations are actively engaged for de-velopment of a generic processor specific to ANN. Intel recently expandedit organization to introduce ANN capable processors in its lineup, producedby Intel Nervana. Produced by ASIC methodology, its extensive architecturalreport is still not open. Similarly, Nvidia is way ahead from its competitorsin terms of technology support for ANN. But as already discussed it does nothave any SOM specific architecture. Its GPUs are currently the most power-ful once in terms of performance in the market. So, it is interesting to comparethe performance of the algorithm with GPU implementation.

5.0.1 Survey of TPU architecture

Recently, Google updates its Tensor Processing Unit(TPU), an embarrass-ingly parallel architecture for ANN[3], as can be seen form figure 5.1.

The architecture, interestingly, works on complex instruction set computer(CISC) style as the basis of the TPU instruction set instead. A CISC de-sign focuses on implementing high-level instructions that run more complextasks (such as calculating multiply-and-add many times) with each instruc-tion. Due to this feature it fits the need of ANN algorithms, Moreover it isan adjustable precision architecture. Which translates to application specificdata fitting. The working process of TPU is as shown in figure 5.2. The com-pany provides brief details of this architecture open to the world to let userprogram through its cloud based service. This is a downside to the projectBioSOM in two ways. Firstly, the individual specific DNA data is venera-ble to the security threats and secondly the high latency from cloud basedservice.

52 Chapter 5. Comparative analysis

FIGURE 5.1: Architectural floormap of TPU

FIGURE 5.2: Working of TPR with accumulation of result toanable variable precision

Chapter 5. Comparative analysis 53

FIGURE 5.3: Massively parallel SOM architecture by prof. M.Porrmann et. al.

5.0.2 An ASIC based architecture

An interesting architecture specific the SOM algorithm is presented by prof.M. Porrmann et. al which is coined as "A massively parallel architecture forself-organizing feature maps"[19],as shown in figure 5.3

The architecture is quite closed to the expectations of a traditional SOM algo-rithm. It includes data pre and post processing functions is a crucial and oftenignored aspect in a neurocomputer design. Thankfully, Silago also providesthis functionality which this thesis made use effectively. SiLago potentiallyoutperforms this architecture based on design custom design efforts, as thisarchitecture is made using ASIC methodology. There are several other note-worthy architecture work both presented and under development aroundthe technical community. It would be interesting to note how these platformsshape up in future.

5.0.3 BioSOM on GPU

The BioSOM was implemented in the CUDA framework to implement it onNvdia GTX 1070 with 64 GB DRAM (This work is not part of the thesis).For further comparison the performance is compared across different parallel

54 Chapter 5. Comparative analysis

streams and neurons. Noteworthy, The latency was reported 51 and energywas noted 571 J for 1024 neurons. While with SiLago implementation, itis reported 80x better for latency and 4 orders of improvement in energy.The reason for this large difference is the custom architecture of SiLago (bothSiLago blocks and memory) as opposed a general GPU architecture.

55

Chapter 6

Conclusion

The thesis presents a custom hardware architecture for an ambitious project,BioSOM, towards the application of DNA sequencing. The hardware is basedupon the SiLago Platform which is a course grain re-configurable architec-ture. With complete of this thesis work, the platform proves its potentialto handle highly parallel neural network algorithm. It also demonstratesits ability to create a custom working architecture with lot less efforts wrtASIC methodology, without compromising on design trade-offs. The logicsynthesis generates a error free netlist along with timing clean report withacceptable area and power performance. As can be noticed from the thesiswork, the verification effort was minimal which is generally a headache withASIC based platforms. The verification effort is supported by Vesyla plat-form which a compiler cum scheduler. The verification cover both functionaland golden reference comparison.

Due to restriction in the number of solutions between the lower abstractionslevel(RTL and below), it was quite easy to incorporate algorithmic optimiza-tions. These optimization lead to atleast 2x improvement in performancewith SiLago platform over the unoptimized version. Furthermore, few othersmart optimizations such as mathematical conversion of exponential func-tion to shift function provides simplicity of design along with area, perfor-mance and energy savings. The area saving masking mechanism provesthe adaptability of SiLago architecture to compute mutli-precision with ef-ficiency.

With the motivating result from the demo algorithm, the thesis work showsway forward to the project BioSOM for a complete working prototype. It alsoprovide an important feedback to the algorithm and memory design groupto optimize their platform. Especially, the algorithm can now possibly incor-porate DNN architecture as well. The platform is already proven for DNN

56 Chapter 6. Conclusion

based algorithm form other related project work. To sum up, the projectwork provides a sound motivation to the complete product development ef-forts for an ambitious project in the welfare of the progressive society.

57

Chapter 7

Future work

With the completion of the thesis work, the inter-dependent work must ac-celerate. This includes:

• Addition, of a top script level to MATLAB which can automate the easyparametric algorithm generation.

• Schedule algorithm to higher number of Silago blocks for scalabilityanalysis.

• Scheduling with Vesyla is limited and under development.

• Possible GPU, custom ASIC and FPGA comparison.

• Test and realistic data specifications for the algorithm. Addition ofDNN to the Algorithm.

• Update of the capabilities of the algorithm to enable more powerfullfeatures and applications.

• Inputs from other parts of the ambitious project such as memory tech-nology.

• Acceleration to project BioSOM as a whole.

59

Appendix A

Appendix A

A.1 Appendix A

Appendix A lists the Matlab algorithm for BioSOM implementation on SiLago.The Algorithm is executes on two SiLago blocks in parallel.

% DNA sequencing algorithm through Self Organizing Algorithms

% Algorithm translated from C++ to Matlab for Vesyla

% Translated by Prashant Sharma

% For the Silago Platform

% Concept-parallelized for 2 Silago blocks

W1 = [1, 1, 1, 1]; %! RFILE<> [0,0]

% regfile space for storing weights of a specific neuron on

% silago block(0,0)

W2 = [1, 1, 1, 1]; %! RFILE<> [0,1]

% regfile space for storing weights of a specific neuron on

% silago block(0,1)

In1 = [1]; %! RFILE<> [0,0]

% input seq neuron on silago block(0,0)

In2 = [1]; %! RFILE<> [0,1]

% input seq neuron on silago block(0,1)

C = [1 1 1 0 1 0]; %! RFILE<> [0,0]

% [curr. prev_best. seq_t temp2] store weight sum sq

60 Appendix A. Appendix A

D = [1 1 1 0 1 0]; %! RFILE<> [0,1]

% [curr. prev_best. seq_t temp2]

Cd = [0 1 0 0 1 0 0 1]; %! RFILE<> [0,0]

% [global_best_location current_location physical_distance

%loop_ctrl_seq_unwrap neighbourhood

%notused temp_find_best notused]

Dd = [0 1 0 0 1 0 0 1]; %! RFILE<> [0,1]

% [global_best_location current_location physical_distance..

%..loop_ctrl_seq_unwrap neighbourhood notused temp_find_best

%..notused]

for i = 1:2

% pick new SOM

for j = 1:2

% pick new neuron

Cd(3)=sum(abs(Cd(2)-Cd(1))); %! DPU [0,0]

% compute difference between locations

Dd(3)=sum(abs(Dd(2)-Dd(1))); %! DPU [0,1]

% repeat for other block

if Cd(3)>8 %! DPU [0,0]

% check if other path of the ring is shorter

Cd(3)=Cd(3)+16; %! DPU [0,0]

% store new path

end

if Dd(3)>8 %! DPU [0,1]


Dd(3)=Dd(3)+16; %! DPU [0,1]

end

for l=1:2

A.1. Appendix A 61

% load from SRAM to RFILE for input sequence.

Cd(4) = 2; %! DPU [0,0]

Dd(4) = 2; %! DPU [0,1]

for k=1:2 % each weight

Cd(4)= Cd(4)+0; %! DPU [0,0]

% *2 increment position

Dd(4)= Dd(4)+0; %! DPU [0,1]


C(3) =bitsll(In1(1), Cd(4)); %! DPU [0,0]

% 1 seq unwrapping from input

D(3) =bitsll(In2(1), Dd(4)); %! DPU [0,1]


if Cd(3)< Cd(5) %! DPU [0,0]

% check for previous best neuron and then update

C(4) = W1(k) - C(3); %! DPU [0,0]

% equ. (data-seq)

C(4) = bitsra(C(4),Cd(3)); %! DPU [0,0]

% equ.

bitsra(a,2);%! DPU [0,0]

%alpha

W1(k) = W1(k) - C(4); %! DPU [0,0]

% equ.

end

if Dd(3)< Dd(5) %! DPU [0,1]


D(4) = W2(k) - D(3); %! DPU [0,1]


D(4) = bitsra(D(4),Dd(3)); %! DPU [0,1]

W2(k) = W2(k) - D(4); %! DPU [0,1]

end

W1(k)=W1(k)-C(3); %! DPU [0,0]

% equ. (W-seq)

W2(k)=W2(k)-D(3); %! DPU [0,1]


end

C(6)=sum(W1.*W1); %! DPU [0,0]

% distance

C(1) = C(1)+C(6); %! DPU [0,0]

% distance

D(6)=sum(W2.*W2); %! DPU [0,1]


D(1) = D(1)+D(6); %! DPU [0,1]


end

if C(1) < C(2) %! DPU [0,0]

% find best from current Silago block

C(2)=C(1);

% save value

Cd(7)=Cd(2);

% save location

end

if D(1) < D(2) %! DPU [0,1]


D(2)=D(1);

Dd(7)=Dd(2);

end

A.1. Appendix A 63

end

if C(2) > D(2) %! DPU [0, 0:1]

% find best from both blocks(global best),

%window size is 2.

Cd(1)=Dd(7);

Dd(1)=Dd(7);

else

% repeat for opposite case

Cd(1)=Cd(7);

% save value

Dd(1)=Cd(7);

% cast to other block

end

end


A.2 Appendix B

The appendix B contains the system verilog output from Silago. Note thatthe results must be read horizontally per line.

Cd( 3)= 1 Dd( 3)= 1 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 1)= 1

W2( 1)= 1 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 2)= 1

W2( 2)= 1 C( 6)= 4 C( 1)= 5

D( 6)= 4 D( 1)= 5 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 1)= 1

W2( 1)= 1 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 2)= 1

W2( 2)= 1 C( 6)= 4 C( 1)= 9

D( 6)= 4 D( 1)= 9 Cd( 3)= 1

Dd( 3)= 1 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 1)= 1 W2( 1)= 1

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 2)= 1 W2( 2)= 1

C( 6)= 4 C( 1)=13 D( 6)= 4

D( 1)=13 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 1)= 1 W2( 1)= 1

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 2)= 1 W2( 2)= 1

C( 6)= 4 C( 1)=17 D( 6)= 4

D( 1)=17 Cd( 1)= 1 Dd( 1)= 1

Cd( 3)= 0 Dd( 3)= 0 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 1

C( 4)= 1 W1( 1)= 0 D( 4)= 1

D( 4)= 1 W2( 1)= 0 W1( 1)= 0

W2( 1)= 0 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 1

C( 4)= 1 W1( 2)= 0 D( 4)= 1

A.2. Appendix B 65

D( 4)= 1 W2( 2)= 0 W1( 2)= 0

W2( 2)= 0 C( 6)= 2 C( 1)=19

D( 6)= 2 D( 1)=19 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 0

C( 4)= 0 W1( 1)= 0 D( 4)= 0

D( 4)= 0 W2( 1)= 0 W1( 1)= 0

W2( 1)= 0 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 0

C( 4)= 0 W1( 2)= 0 D( 4)= 0

D( 4)= 0 W2( 2)= 0 W1( 2)= 0

W2( 2)= 0 C( 6)= 2 C( 1)=21

D( 6)= 2 D( 1)=21 Cd( 3)= 0

Dd( 3)= 0 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 C( 4)= 0

W1( 1)= 0 D( 4)= 0 D( 4)= 0

W2( 1)= 0 W1( 1)= 0 W2( 1)= 0

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 C( 4)= 0

W1( 2)= 0 D( 4)= 0 D( 4)= 0

W2( 2)= 0 W1( 2)= 0 W2( 2)= 0

C( 6)= 2 C( 1)=23 D( 6)= 2

D( 1)=23 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 C( 4)= 0

W1( 1)= 0 D( 4)= 0 D( 4)= 0

W2( 1)= 0 W1( 1)= 0 W2( 1)= 0

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 C( 4)= 0

W1( 2)= 0 D( 4)= 0 D( 4)= 0

W2( 2)= 0 W1( 2)= 0 W2( 2)= 0

C( 6)= 2 C( 1)=25 D( 6)= 2

D( 1)=25 Cd( 1)= 1 Dd( 1)= 1


A.3 Appendix C

The appendix C contains the output of the algorithm upon execution on Mat-lab. Note that the results must be read horizontally per line.

Cd( 3)= 1 Dd( 3)= 1 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 1)= 1

W2( 1)= 1 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 2)= 1

W2( 2)= 1 C( 6)= 4 D( 6)= 4

C( 1)= 5 D( 1)= 5 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 1)= 1

W2( 1)= 1 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 W1( 2)= 1

W2( 2)= 1 C( 6)= 4 D( 6)= 4

C( 1)= 9 D( 1)= 9 Cd( 3)= 1

Dd( 3)= 1 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 1)= 1 W2( 1)= 1

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 2)= 1 W2( 2)= 1

C( 6)= 4 D( 6)= 4 C( 1)=13

D( 1)=13 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 1)= 1 W2( 1)= 1

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 W1( 2)= 1 W2( 2)= 1

C( 6)= 4 D( 6)= 4 C( 1)=17

D( 1)=17 Cd( 1)= 1 Dd( 1)= 1

Cd( 3)= 0 Dd( 3)= 0 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 1

D( 4)= 1 C( 4)= 1 D( 4)= 1

W1( 1)= 0 W2( 1)= 0 W1( 1)= 0

W2( 1)= 0 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 1

D( 4)= 1 C( 4)= 1 D( 4)= 1

A.3. Appendix C 67

W1( 2)= 0 W2( 2)= 0 W1( 2)= 0

W2( 2)= 0 C( 6)= 2 D( 6)= 2

C( 1)=19 D( 1)=19 Cd( 4)= 2

Dd( 4)= 2 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 0

D( 4)= 0 C( 4)= 0 D( 4)= 0

W1( 1)= 0 W2( 1)= 0 W1( 1)= 0

W2( 1)= 0 Cd( 4)= 2 Dd( 4)= 2

C( 3)= 0 D( 3)= 0 C( 4)= 0

D( 4)= 0 C( 4)= 0 D( 4)= 0

W1( 2)= 0 W2( 2)= 0 W1( 2)= 0

W2( 2)= 0 C( 6)= 2 D( 6)= 2

C( 1)=21 D( 1)=21 Cd( 3)= 0

Dd( 3)= 0 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 D( 4)= 0

C( 4)= 0 D( 4)= 0 W1( 1)= 0

W2( 1)= 0 W1( 1)= 0 W2( 1)= 0

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 D( 4)= 0

C( 4)= 0 D( 4)= 0 W1( 2)= 0

W2( 2)= 0 W1( 2)= 0 W2( 2)= 0

C( 6)= 2 D( 6)= 2 C(1)=23

D( 1)=23 Cd( 4)= 2 Dd( 4)= 2

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 D( 4)= 0

C( 4)= 0 D( 4)= 0 W1( 1)= 0

W2( 1)= 0 W1( 1)= 0 W2( 1)= 0

Cd( 4)= 2 Dd( 4)= 2 C( 3)= 0

D( 3)= 0 C( 4)= 0 D( 4)= 0

C( 4)= 0 D( 4)= 0 W1( 2)= 0

W2( 2)= 0 W1( 2)= 0 W2( 2)= 0

C( 6)= 2 D( 6)= 2 C( 1)=25

D( 1)=25 Cd( 1)= 1 Dd( 1)= 1


A.4 Appendix D

The appendix D contains the scheduling information of the instructions. Thisfile is further used for the automatic creation of the VHDL testbench.

.DATA

$W1 FULL_DISTR [<0,0>] [1, 1, 1, 1]

$W2 FULL_DISTR [<0,1>] [1, 1 ,1, 1]

$In1 FULL_DISTR [<0,0>] [1]

$In2 FULL_DISTR [<0,1>] [1]

$C FULL_DISTR [<0,0>] [1, 1, 1, 0, 1, 0]

$D FULL_DISTR [<0,1>] [1, 1, 1, 0, 1, 0]

$Cd FULL_DISTR [<0,0>] [0, 1, 0, 0, 1, 0, 0, 1]

$Dd FULL_DISTR [<0,1>] [0, 1, 0, 0, 1, 0, 0, 1]

.CODE

CELL <0,0>

# for i=1:2

# for j=1:2

LOOPHEADER 0, 1, 0, 2 /* 0 */

LOOPHEADER 1, 1, 0, 2 /* 1 */

# Cd(3)=abs(Cd(2)-Cd(1))

SWB REFI, 0, 3, LATA, 0, 0 /* 2 */

SWB REFI, 0, 2, LATA, 0, 1 /* 3 */

SWB LATA, 0, 1, REFI, 0, 1 /* 4 */

SWB LATA, 0, 0, REFI, 0, 0 /* 5 */

REFI_1 3, 0, 0, 12, 0, 0, 0, 2 /* 6 */

REFI_1 2, 0, 0, 11, 0, 0, 0, 1 /* 7 */

DPU 7, 0, 3, 3, 1, 0, 0, 0 /* 8 */

REFI_1 0, 0, 0, 13, 0, 0, 0, 0 /* 9 */

# if Cd(3)>8

# Cd(3)=Cd(3)+16

# end

REFI_1 3, 0, 0, 13, 0, 0, 0, 1 /* 10 */

DPU 13, 0, 3, 3, 1, 0, 8, 0 /* 11 */

BRANCH 1, 17 /* 12 */

REFI_1 3, 0, 0, 13, 0, 0, 0, 1 /* 13 */

DPU 12, 2, 3, 3, 1, 0, 16, 0 /* 14 */

REFI_1 1, 0, 0, 13, 0, 0, 0, 0 /* 15 */

A.4. Appendix D 69

JUMP 18 /* 16 */

DELAY 0, 3 /* 17 */

DELAY 0, 0 /* 18 */

# for l=1:2

LOOPHEADER 2, 1, 0, 2 /* 19 */

# Cd(4) = 2

DPU 12, 0, 3, 3, 1, 0, 2, 0 /* 20 */

REFI_1 0, 0, 0, 14, 0, 0, 0, 0 /* 21 */

# for k=1:2

RACCU 3, 0, 0, 0, 0, 0 /* 22 */

LOOPHEADER 3, 1, 0, 2 /* 23 */

# Cd(4) = Cd(4) + 0

REFI_1 3, 0, 0, 14, 0, 0, 0, 1 /* 24 */

DPU 12, 2, 3, 3, 1, 0, 0, 0 /* 25 */

REFI_1 1, 0, 0, 14, 0, 0, 0, 0 /* 26 */

# C(3) =bitsll(In1(1), Cd(4));

REFI_1 3, 0, 0, 4, 0, 0, 0, 2 /* 27 */

REFI_1 2, 0, 0, 14, 0, 0, 0, 1 /* 28 */

DPU 21, 0, 3, 3, 1, 0, 0, 0 /* 29 */

REFI_1 0, 0, 0, 7, 0, 0, 0, 0 /* 30 */

# if Cd(3)< Cd(5)

# C(4) = W1(k) - C(3)

# C(4) = bitsra(C(4),Cd(3))

# W1(k) = W1(k) - C(4)

# end

REFI_1 3, 0, 0, 15, 0, 0, 0, 2 /* 31 */

REFI_1 2, 0, 0, 13, 0, 0, 0, 1 /* 32 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 33 */

BRANCH 1, 48 /* 34 */

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 35 */

REFI_1 2, 0, 0, 7, 0, 0, 0, 1 /* 36 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 37 */

REFI_1 1, 0, 0, 8, 0, 0, 0, 0 /* 38 */

REFI_1 3, 0, 0, 8, 0, 0, 0, 2 /* 39 */


REFI_1 2, 0, 0, 13, 0, 0, 0, 1 /* 40 */

DPU 9, 0, 3, 3, 1, 0, 0, 0 /* 41 */

REFI_1 1, 0, 0, 8, 0, 0, 0, 0 /* 42 */

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 43 */

REFI_1 2, 0, 0, 8, 0, 0, 0, 1 /* 44 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 45 */

REFI_1 1, 0, 1, 0, 0, 0, 0, 0 /* 46 */

JUMP 49 /* 47 */

DELAY 0, 12 /* 48 */

DELAY 0, 0 /* 49 */

# W1(k)=W1(k)-C(3)

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 50 */

REFI_1 2, 0, 0, 7, 0, 0, 0, 1 /* 51 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 52 */

REFI_1 1, 0, 1, 0, 0, 0, 0, 0 /* 53 */

RACCU 3, 1, 0, 0, 1, 0 /* 54 */

LOOPTAIL 1, 23, 3 /* 55 */

# C(6) = sum(W1.*W1)

# C(1) = C(1) + C(6)

SWB REFI, 0, 2, LATA, 0, 2 /* 56 */

REFI_1 3, 0, 0, 0, 0, 3, 0, 2 /* 57 */

REFI_1 2, 0, 0, 0, 0, 3, 0, 1 /* 58 */

DPU 2, 2, 3, 3, 1, 0, 3, 0 /* 59 */

REFI_1 0, 0, 0, 10, 0, 0, 0, 4 /* 60 */

DELAY 0, 3 /* 61 */

SWB REFI, 0, 3, LATA, 0, 3 /* 62 */

REFI_1 2, 0, 0, 5, 0, 0, 0, 2 /* 63 */

REFI_1 3, 0, 0, 10, 0, 0, 0, 1 /* 64 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 65 */

REFI_1 0, 0, 0, 5, 0, 0, 0, 0 /* 66 */

LOOPTAIL 1, 18, 2 /* 67 */

# if C(1) < C(2)

# C(2) = C(1)

# Cd(7) = Cd(2)

# end

SWB REFI, 0, 2, LATA, 0, 0 /* 68 */

SWB REFI, 0, 3, LATA, 0, 1 /* 69 */

REFI_1 3, 0, 0, 5, 0, 0, 0, 2 /* 70 */

A.4. Appendix D 71

REFI_1 2, 0, 0, 6, 0, 0, 0, 1 /* 71 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 72 */

BRANCH 1, 81 /* 73 */

SWB REFI, 0, 3, REFI, 0, 1 /* 74 */

REFI_1 3, 0, 0, 5, 0, 0, 0, 1 /* 75 */

REFI_1 1, 0, 0, 6, 0, 0, 0, 0 /* 76 */

SWB REFI, 0, 2, REFI, 0, 0 /* 77 */

REFI_1 2, 0, 0, 12, 0, 0, 0, 1 /* 78 */

REFI_1 0, 0, 0, 17, 0, 0, 0, 0 /* 79 */

JUMP 82 /* 80 */

DELAY 0, 6 /* 81 */

DELAY 0, 0 /* 82 */

LOOPTAIL 1, 1, 1 /* 83 */

# if C(2) > D(2)

# Cd(1) = Dd(2)

# Dd(1) = Dd(2)

# else

# Cd(1) = Cd(2)

# Dd(1) = Cd(2)

# end

SWB REFI, 0, 3, LATA, 0, 0 /* 84 */

SWB REFI, 2, 3, LATA, 0, 1 /* 85 */

REFI_1 3, 0, 0, 6, 0, 0, 0, 1 /* 86 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 87 */

BRANCH 1, 93 /* 88 */

SWB REFI, 2, 2, REFI, 0, 1 /* 89 */

REFI_1 1, 0, 0, 11, 0, 0, 0, 1 /* 90 */

DELAY 0, 0 /* 91 */

JUMP 96 /* 92 */

SWB REFI, 0, 2, REFI, 0, 1 /* 93 */

REFI_1 2, 0, 0, 12, 0, 0, 0, 1 /* 94 */

REFI_1 1, 0, 0, 11, 0, 0, 0, 0 /* 95 */

DELAY 0, 0 /* 96 */

LOOPTAIL 1, 0, 0 /* 97 */

CELL <0,1>

LOOPHEADER 0, 1, 0, 2 /* 0 */

LOOPHEADER 1, 1, 0, 2 /* 1 */

SWB REFI, 2, 3, LATA, 2, 0 /* 2 */

SWB REFI, 2, 2, LATA, 2, 1 /* 3 */

SWB LATA, 2, 1, REFI, 2, 1 /* 4 */


SWB LATA, 2, 0, REFI, 2, 0 /* 5 */

REFI_1 3, 0, 0, 12, 0, 0, 0, 2 /* 6 */

REFI_1 2, 0, 0, 11, 0, 0, 0, 1 /* 7 */

DPU 7, 0, 3, 3, 1, 0, 0, 0 /* 8 */

REFI_1 0, 0, 0, 13, 0, 0, 0, 0 /* 9 */

REFI_1 3, 0, 0, 13, 0, 0, 0, 1 /* 10 */

DPU 13, 0, 3, 3, 1, 0, 8, 0 /* 11 */

BRANCH 1, 17 /* 12 */

REFI_1 3, 0, 0, 13, 0, 0, 0, 1 /* 13 */

DPU 12, 2, 3, 3, 1, 0, 16, 0 /* 14 */

REFI_1 1, 0, 0, 13, 0, 0, 0, 0 /* 15 */

JUMP 18 /* 16 */

DELAY 0, 3 /* 17 */

DELAY 0, 0 /* 18 */

LOOPHEADER 2, 1, 0, 2 /* 19 */

DPU 12, 0, 3, 3, 1, 0, 2, 0 /* 20 */

REFI_1 0, 0, 0, 14, 0, 0, 0, 0 /* 21 */

RACCU 3, 0, 0, 0, 0, 0 /* 22 */

LOOPHEADER 3, 1, 0, 2 /* 23 */

REFI_1 3, 0, 0, 14, 0, 0, 0, 1 /* 24 */

DPU 12, 2, 3, 3, 1, 0, 0, 0 /* 25 */

REFI_1 1, 0, 0, 14, 0, 0, 0, 0 /* 26 */

REFI_1 3, 0, 0, 4, 0, 0, 0, 2 /* 27 */

REFI_1 2, 0, 0, 14, 0, 0, 0, 1 /* 28 */

DPU 21, 0, 3, 3, 1, 0, 0, 0 /* 29 */

REFI_1 0, 0, 0, 7, 0, 0, 0, 0 /* 30 */

REFI_1 3, 0, 0, 15, 0, 0, 0, 2 /* 31 */

REFI_1 2, 0, 0, 13, 0, 0, 0, 1 /* 32 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 33 */

BRANCH 1, 48 /* 34 */

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 35 */

REFI_1 2, 0, 0, 7, 0, 0, 0, 1 /* 36 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 37 */

REFI_1 1, 0, 0, 8, 0, 0, 0, 0 /* 38 */

REFI_1 3, 0, 0, 8, 0, 0, 0, 2 /* 39 */

REFI_1 2, 0, 0, 13, 0, 0, 0, 1 /* 40 */

DPU 9, 0, 3, 3, 1, 0, 0, 0 /* 41 */

REFI_1 1, 0, 0, 8, 0, 0, 0, 0 /* 42 */

A.4. Appendix D 73

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 43 */

REFI_1 2, 0, 0, 8, 0, 0, 0, 1 /* 44 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 45 */

REFI_1 1, 0, 1, 0, 0, 0, 0, 0 /* 46 */

JUMP 49 /* 47 */

DELAY 0, 12 /* 48 */

DELAY 0, 0 /* 49 */

REFI_1 3, 0, 1, 0, 0, 0, 0, 2 /* 50 */

REFI_1 2, 0, 0, 7, 0, 0, 0, 1 /* 51 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 52 */

REFI_1 1, 0, 1, 0, 0, 0, 0, 0 /* 53 */

RACCU 3, 1, 0, 0, 1, 0 /* 54 */

LOOPTAIL 1, 23, 3 /* 55 */

SWB REFI, 2, 2, LATA, 2, 2 /* 56 */

REFI_1 3, 0, 0, 0, 0, 3, 0, 2 /* 57 */

REFI_1 2, 0, 0, 0, 0, 3, 0, 1 /* 58 */

DPU 2, 2, 3, 3, 1, 0, 3, 0 /* 59 */

REFI_1 0, 0, 0, 10, 0, 0, 0, 4 /* 60 */

DELAY 0, 3 /* 61 */

SWB REFI, 2, 3, LATA, 2, 3 /* 62 */

REFI_1 2, 0, 0, 5, 0, 0, 0, 2 /* 63 */

REFI_1 3, 0, 0, 10, 0, 0, 0, 1 /* 64 */

DPU 10, 2, 3, 3, 1, 0, 0, 0 /* 65 */

REFI_1 0, 0, 0, 5, 0, 0, 0, 0 /* 66 */

LOOPTAIL 1, 18, 2 /* 67 */

SWB REFI, 2, 2, LATA, 2, 0 /* 68 */

SWB REFI, 2, 3, LATA, 2, 1 /* 69 */

REFI_1 3, 0, 0, 5, 0, 0, 0, 2 /* 70 */

REFI_1 2, 0, 0, 6, 0, 0, 0, 1 /* 71 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 72 */

BRANCH 1, 81 /* 73 */

SWB REFI, 2, 3, REFI, 2, 1 /* 74 */

REFI_1 3, 0, 0, 5, 0, 0, 0, 1 /* 75 */

REFI_1 1, 0, 0, 6, 0, 0, 0, 0 /* 76 */

SWB REFI, 2, 2, REFI, 2, 0 /* 77 */

REFI_1 2, 0, 0, 12, 0, 0, 0, 1 /* 78 */

REFI_1 0, 0, 0, 17, 0, 0, 0, 0 /* 79 */

JUMP 82 /* 80 */

DELAY 0, 6 /* 81 */

DELAY 0, 0 /* 82 */


LOOPTAIL 1, 1, 1 /* 83 */

SWB REFI, 0, 3, LATA, 2, 0 /* 84 */

SWB REFI, 2, 3, LATA, 2, 1 /* 85 */

REFI_1 3, 0, 0, 6, 0, 0, 0, 1 /* 86 */

DPU 11, 0, 3, 3, 1, 0, 0, 0 /* 87 */

BRANCH 1, 93 /* 88 */

SWB REFI, 2, 2, REFI, 2, 1 /* 89 */

REFI_1 2, 0, 0, 12, 0, 0, 0, 1 /* 90 */

REFI_1 1, 0, 0, 11, 0, 0, 0, 0 /* 91 */

JUMP 96 /* 92 */

SWB REFI, 0, 2, REFI, 2, 1 /* 93 */

REFI_1 1, 0, 0, 11, 0, 0, 0, 1 /* 94 */

DELAY 0, 0 /* 95 */

DELAY 0, 0 /* 96 */

LOOPTAIL 1, 0, 0 /* 97 */

A.5. Appendix E 75

A.5 Appendix E

The appendix E contains the area report as an output of the cell logic synthesis.

****************************************

Report : area

Design : silego

Version: M-2016.12

Date : Fri Feb 2 17:58:22 2018

****************************************

Library(s) Used: tcbn40lpbwptc

(File: /mnt/storage2/stathis/full_40_nm_lib/

synopsys/tcbn40lpbwp_200a/tcbn40lpbwptc.db)

Number of ports: 13287

Number of nets: 35261

Number of cells: 22713

Number of combinational cells: 19257

Number of sequential cells: 3374

Number of macros/black boxes: 0

Number of buf/inv: 4639

Number of references: 10

Combinational area: 24026.562259

Buf/Inv area: 3176.964100

Noncombinational area: 13988.696388

Macro/Black Box area: 0.000000

Net Interconnect area:undefined

Wire load has zero net area)

Total cell area: 38015.258646

Total area: undefined

1


A.6 Appendix F

The appendix F contains the power report as an output of the cell logic synthesis.

Loading db file

’/mnt/storage2/stathis/full_40_nm_lib/

synopsys/tcbn40lpbwp_200a/tcbn40lpbwptc.db’

Information: Propagating switching activity

(medium effort zero delay simulation). (PWR-6)

Warning: Design has unannotated primary inputs.

(PWR-414)

Warning: Design has unannotated sequential

cell outputs. (PWR-415)

****************************************

Report : power

-analysis_effort medium

Design : silego

Version: M-2016.12

Date : Fri Feb 2 17:58:28 2018

****************************************

Library(s) Used:

tcbn40lpbwptc (File: /mnt/storage2/stathis/full_40

_nm_lib/synopsys/tcbn40lpbwp_200a/tcbn40lpbwptc.db)

Operating Conditions: NCCOM Library: tcbn40lpbwptc

Wire Load Model Mode: segmented

Design Wire Load Model Library

------------------------------------------------

silego TSMC32K_Lowk_Conservative tcbn40lpbwptc

Global Operating Voltage = 1.1

Power-specific unit information :

Voltage Units = 1V

Capacitance Units = 1.000000pf

Time Units = 1ns

Dynamic Power Units = 1mW

(derived from V,C,T units)

A.6. Appendix F 77

Leakage Power Units = 1nW

Cell Internal Power = 3.4409 mW (94%)

Net Switching Power = 230.4670 uW (6%)

---------

Total Dynamic Power = 3.6714 mW (100%)

Cell Leakage Power = 3.1192 uW

Power Group InternalPower SwitchingPower Total Power

Power ( % ) Attrs

-----------------------------------------------------

io_pad 0.0000 0.0000 0.0000

0.0000 ( 0.00%)

memory 0.0000 0.0000 0.0000

0.0000 ( 0.00%)

black_box 0.0000 0.0000 0.0000

0.0000 ( 0.00%)

clock_network 0.0000 0.0000 0.0000

0.0000 ( 0.00%)

register 3.3268 8.6423e-03 907.0184

3.3364 ( 90.80%)

sequential 1.4468e-07 0.0000 13.7090

1.3886e-05 ( 0.00%)

combinational 0.1141 0.2218 2.1984e+03

0.3381 ( 9.20%)

-----------------------------------------------------

Total 3.4409 mW 0.2305 mW 3.1191e+03 nW

3.6745 mW

1


A.7 Appendix G

The appendix G contains the time report as an output of the cell logic synthesis.

Information: Updating design information... (UID-85)

Warning: Design ’silego’ contains 1 high-fanout nets.

A fanout number of 1000 will be used for delay

calculations involving these nets. (TIM-134)

Information: Timing loop detected. (OPT-150)

MTRF_cell/dpu_gen/U205/A2 MTRF_cell/dpu_gen/U205/ZN

MTRF_cell/dpu_gen/U251/I MTRF_cell/dpu_gen/U251/Z

MTRF_cell/dpu_gen/U1633/B MTRF_cell/dpu_gen/U1633/ZN

swb/U32/I swb/U32/Z swb/Mux_Generate0_2/U79/C1

swb/Mux_Generate0_2/U79/Z swb/Mux_Generate0_2/U88/A1

swb/Mux_Generate0_2/U88/Z U104/I0 U104/Z MTRF_cell/

dpu_gen/U39/I

MTRF_cell/dpu_gen/U39/ZN MTRF_cell/dpu_gen/U176/A1


MTRF_cell/dpu_gen/U121/ZN MTRF_cell/dpu_gen/U175/I

















MTRF_cell/dpu_gen/U770/Z MTRF_cell/dpu_gen/U771/I1






MTRF_cell/dpu_gen/U635/Z

A.7. Appendix G 79

Warning: Disabling timing arc between pins ’A2’ and

’ZN’ on cell ’MTRF_cell/dpu_gen/U205’

to break a timing loop. (OPT-314)

****************************************

Report : timing

-path full

-delay max

-max_paths 1

Design : silego

Version: M-2016.12

Date : Fri Feb 2 17:58:22 2018

****************************************

# A fanout number of 1000 was used for high fanout net

computations.

Operating Conditions: NCCOM Library: tcbn40lpbwptc

Wire Load Model Mode: segmented

Startpoint: MTRF_cell/seq_gen/reg_port_type_reg[1]

(rising edge-triggered flip-flop clocked by clk)

Endpoint: MTRF_cell/dpu_gen/mul_out_reg[15]

(rising edge-triggered flip-flop clocked by clk)

Path Group: clk

Path Type: max

Des/Clust/Port Wire Load Model Library

------------------------------------------------

register_row_31 TSMC8K_Lowk_Conservative

tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc



tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc

AGU_3 TSMC8K_Lowk_Conservative

A.7. Appendix G 81

tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc


tcbn40lpbwptc

register_file TSMC8K_Lowk_Conservative

tcbn40lpbwptc

RACCU TSMC8K_Lowk_Conservative

tcbn40lpbwptc

register_file_top TSMC16K_Lowk_Conservative

tcbn40lpbwptc

DPU TSMC8K_Lowk_Conservative

tcbn40lpbwptc

sequencer_M4 TSMC16K_Lowk_Conservative

tcbn40lpbwptc

MTRF_cell TSMC32K_Lowk_Conservative

tcbn40lpbwptc

silego TSMC32K_Lowk_Conservative

tcbn40lpbwptc

cell_config_swb TSMC8K_Lowk_Conservative

tcbn40lpbwptc

switchbox_BITWIDTH16_M5

TSMC8K_Lowk_Conservative

tcbn40lpbwptc

InputMux_BITWIDTH16_M5_11


tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc




tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc



tcbn40lpbwptc

DPU_DW01_sub_4 TSMC8K_Lowk_Conservative

tcbn40lpbwptc

DPU_DW01_add_6 TSMC8K_Lowk_Conservative

tcbn40lpbwptc


tcbn40lpbwptc

DPU_DW_mult_tc_2 TSMC8K_Lowk_Conservative

tcbn40lpbwptc


tcbn40lpbwptc

Point Incr Path

--------------------------------------------------------------------------

clock clk (rise edge) 0.00 0.00

clock network delay (ideal) 0.00 0.00

MTRF_cell/seq_gen/reg_port_type_reg[1]/CP (EDFCNQD4BWP)

0.00 # 0.00 r

MTRF_cell/seq_gen/reg_port_type_reg[1]/Q (EDFCNQD4BWP)

0.12 0.12 f

MTRF_cell/seq_gen/reg_port_type[1] (sequencer_M4)

0.00 0.12 f

MTRF_cell/reg_top/reg_port_type[1] (register_file_top)

0.00 0.12 f

MTRF_cell/reg_top/U13/ZN (CKND2D1BWP) 0.05 0.16 r

MTRF_cell/reg_top/U2/ZN (INVD3BWP) 0.04 0.20 f

MTRF_cell/reg_top/U175/Z (AN2XD1BWP) 0.05 0.25 f

MTRF_cell/reg_top/AGU_Rd_1_instantiate/instr_initial_delay[1]

(AGU_1) 0.00 0.25 f

A.7. Appendix G 83

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U203/ZN (NR3D2BWP)

0.06 0.31 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U44/ZN (ND2D3BWP)

0.04 0.35 f

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U36/ZN (INR2XD4BWP)

0.04 0.39 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U38/ZN (MUX2ND1BWP)

0.12 0.51 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U37/ZN (CKND2BWP)

0.05 0.56 f


0.10 0.67 f

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U71/ZN (CKND0BWP)

0.03 0.69 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U412/ZN (ND2D1BWP)

0.03 0.72 f


0.06 0.78 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U410/Z (AO21D1BWP)

0.05 0.83 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/U586/Z (AO221D1BWP)

0.07 0.90 r

MTRF_cell/reg_top/AGU_Rd_1_instantiate/addr_out[3] (AGU_1)

0.00 0.90 r

MTRF_cell/reg_top/RegisterFile/rd_addr_1[3] (register_file)

0.00 0.90 r

MTRF_cell/reg_top/RegisterFile/U191/ZN (CKND0BWP)

0.04 0.94 f

MTRF_cell/reg_top/RegisterFile/U184/Z (AN2D1BWP) 0.06 1.00 f

MTRF_cell/reg_top/RegisterFile/U664/Z (AN2XD1BWP)0.05 1.05 f

MTRF_cell/reg_top/RegisterFile/U180/Z (CKBD3BWP) 0.07 1.12 f

MTRF_cell/reg_top/RegisterFile/U2429/ZN (AOI22D1BWP)

0.07 1.19 r

MTRF_cell/reg_top/RegisterFile/U2428/ZN (ND4D1BWP 0.07 1.25 f

MTRF_cell/reg_top/RegisterFile/U89/ZN (NR2XD0BWP) 0.06 1.31 r

MTRF_cell/reg_top/RegisterFile/U19/ZN (MUX2ND4BWP)0.08 1.39 f

MTRF_cell/reg_top/RegisterFile/U18/Z (AN2D4BWP) 0.04 1.43 f

MTRF_cell/reg_top/RegisterFile/data_out_reg_1_left[1]

(register_file) 0.00 1.43 f

MTRF_cell/reg_top/data_out_reg_1_left[1] (register_file_top)

0.00 1.43 f

MTRF_cell/data_out_reg_1_left[1] (MTRF_cell) 0.00 1.43 f

swb/inputs_reg[2][1][1] (switchbox_BITWIDTH16_M5) 0.00 1.43 f


swb/U48/Z (BUFFD4BWP) 0.06 1.48 f

swb/Mux_Generate0_3/inputs_reg[2][1][1]

(InputMux_BITWIDTH16_M5_9)0.00 1.48 f

swb/Mux_Generate0_3/U77/Z (AO222D1BWP) 0.13 1.61 f

swb/Mux_Generate0_3/U78/ZN (OAI31D1BWP) 0.06 1.67 r

swb/Mux_Generate0_3/U79/ZN (IOA21D1BWP) 0.04 1.71 f

swb/Mux_Generate0_3/U81/Z (OR4D1BWP) 0.11 1.82 f

swb/Mux_Generate0_3/output[1]

(InputMux_BITWIDTH16_M5_9) 0.00 1.82 f

swb/int_v_input_bus[3][1] (switchbox_BITWIDTH16_M5)0.00 1.82 f

U149/Z (CKMUX2D1BWP) 0.09 1.91 f

MTRF_cell/dpu_in_1[1] (MTRF_cell) 0.00 1.91 f

MTRF_cell/dpu_gen/dpu_in_1[1] (DPU) 0.00 1.91 f

MTRF_cell/dpu_gen/U118/ZN (IND2D2BWP) 0.06 1.97 f


MTRF_cell/dpu_gen/U576/Z (OR2D1BWP) 0.06 2.08 f

MTRF_cell/dpu_gen/U577/Z (OR2D1BWP) 0.06 2.14 f


MTRF_cell/dpu_gen/U591/ZN (CKND2BWP) 0.02 2.22 r

MTRF_cell/dpu_gen/U589/ZN (CKND2D2BWP) 0.02 2.24 f





MTRF_cell/dpu_gen/U113/ZN (ND2D2BWP) 0.03 2.44 f







MTRF_cell/dpu_gen/U31/ZN (ND2D2BWP) 0.02 2.64 r

MTRF_cell/dpu_gen/U23/Z (XOR2D2BWP) 0.11 2.75 f


MTRF_cell/dpu_gen/U765/Z (BUFFD16BWP) 0.06 2.83 r

MTRF_cell/dpu_gen/add_446/B[16] (DPU_DW01_add_6) 0.00 2.83 r

MTRF_cell/dpu_gen/add_446/U137/ZN (CKND2BWP) 0.02 2.86 f

MTRF_cell/dpu_gen/add_446/U194/Z (OR2D1BWP) 0.06 2.92 f

MTRF_cell/dpu_gen/add_446/U139/ZN (ND2D1BWP) 0.03 2.94 r

MTRF_cell/dpu_gen/add_446/U138/ZN (XNR2D1BWP) 0.08 3.02 f

MTRF_cell/dpu_gen/add_446/SUM[15] (DPU_DW01_add_6)

0.00 3.02 f

MTRF_cell/dpu_gen/U165/Z (OA21D4BWP) 0.07 3.09 f


A.7. Appendix G 85

MTRF_cell/dpu_gen/U37/ZN (INVD4BWP) 0.02 3.13 f

MTRF_cell/dpu_gen/U159/ZN (AOI22D4BWP) 0.04 3.17 r

MTRF_cell/dpu_gen/U158/ZN (OAI211D4BWP) 0.06 3.23 f

MTRF_cell/dpu_gen/mult_486/a[0] (DPU_DW_mult_tc_2)

0.00 3.23 f

MTRF_cell/dpu_gen/mult_486/U736/Z (CKBD16BWP) 0.09 3.32 f

MTRF_cell/dpu_gen/mult_486/U386/ZN (XNR2D1BWP) 0.06 3.38 f

MTRF_cell/dpu_gen/mult_486/U777/Z (OR2XD1BWP) 0.06 3.44 f

MTRF_cell/dpu_gen/mult_486/U778/ZN (CKND2D2BWP) 0.03 3.46 r

MTRF_cell/dpu_gen/mult_486/U324/S (CMPE42D1BWP) 0.29 3.75 f

MTRF_cell/dpu_gen/mult_486/U322/CO (CMPE42D1BWP) 0.30 4.06 f

MTRF_cell/dpu_gen/mult_486/U1020/ZN (ND2D1BWP) 0.04 4.10 r

MTRF_cell/dpu_gen/mult_486/U698/ZN (OAI21D2BWP) 0.04 4.14 f

MTRF_cell/dpu_gen/mult_486/U1035/ZN (AOI21D2BWP) 0.05 4.19 r

MTRF_cell/dpu_gen/mult_486/U702/ZN (OAI21D4BWP) 0.05 4.25 f

MTRF_cell/dpu_gen/mult_486/U796/Z (AO21D0BWP) 0.08 4.32 f

MTRF_cell/dpu_gen/mult_486/U1094/CO (FA1D2BWP) 0.07 4.39 f

MTRF_cell/dpu_gen/mult_486/U1093/CO (FA1D2BWP) 0.07 4.46 f

MTRF_cell/dpu_gen/mult_486/U733/S (FA1D0BWP) 0.11 4.57 r

MTRF_cell/dpu_gen/mult_486/product[30] (DPU_DW_mult_tc_2)

0.00 4.57 r

MTRF_cell/dpu_gen/sub_abs_489/B[30]

(DPU_DW01_sub_7) 0.00 4.57 r

MTRF_cell/dpu_gen/sub_abs_489/U107/ZN (INVD1BWP) 0.03 4.60 f

MTRF_cell/dpu_gen/sub_abs_489/U106/S (HA1D0BWP) 0.08 4.69 f

MTRF_cell/dpu_gen/sub_abs_489/DIFF[30] (DPU_DW01_sub_7)

0.00 4.69 f

MTRF_cell/dpu_gen/U374/Z (CKMUX2D1BWP)0.08 4.77 f

MTRF_cell/dpu_gen/U90/ZN (ND2D4BWP) 0.03 4.80 r

MTRF_cell/dpu_gen/U89/ZN (INVD8BWP) 0.04 4.83 f

MTRF_cell/dpu_gen/U91/Z (AO221D0BWP) 0.15 4.98 f

MTRF_cell/dpu_gen/mul_out_reg[15]/D (DFCND1BWP) 0.00 4.98 f

data arrival time 4.98

clock clk (rise edge) 5.00 5.00

clock network delay (ideal) 0.00 5.00

MTRF_cell/dpu_gen/mul_out_reg[15]/CP (DFCND1BWP) 0.00 5.00 r

library setup time -0.01 4.99

data required time 4.99

------------------------------------------------------

data required time 4.99

data arrival time -4.98

------------------------------------------------------

slack (MET) 0.01

1


A.8 Appendix H

The appendix H contains the tcl file for logic synthesis of the fabric.

set SYNDIR /afs/kth.se/home/p/r/prasha//mnt/storage3/

som/som_logic_syn/wc/SYN

set SYNDB /afs/kth.se/home/p/r/prasha//mnt/storage3/

som/som_logic_syn/write

set RPTDIR /afs/kth.se/home/p/r/prasha//mnt/storage3/

som/som_logic_syn/report

######Enviroment############

define_design_lib work -path $SYNDIR/work

set write_name_nets_same_as_ports "true"

#set compile_preserve_sync_resets "true"

#set verilogout_equation false

#set verilogout_no_tri true

set sdc_write_unambiguous_names false

set search_path "$search_path /mnt/storage2/farahini/

Documents/myLibraryFiles/TSMC_40/tcbn40lpbwp_120b/FrontEnd/

tcbn40lpbwp_120b/ /afs/kth.se/pkg/vol/contrib/

ict/nveg/pkg/vol2/synopsys/syn_M-2016.12/libraries/syn"

set synlib "/afs/kth.se/pkg/vol/contrib/ict/nveg/pkg/vol2/

synopsys/syn_M-2016.12/libraries/syn/dw_foundation.sldb"

set target_library tcbn40lpbwptc.db

set symbol_library tcbn40lpbwptc.sdb

set link_path "$search_path"

set synthetic_library "$synthetic_library $synlib

/afs/kth.se/pkg/vol/contrib/ict/nveg/pkg/vol2/synopsys/

syn_M-2016.12/libraries/syn/standard.sldb"

set link_library "* $target_library $synthetic_library"

cd wc

define_design_lib WORK -path ./SYN/WORK

read_file -format vhdl {../../fabric/mtrf/misc.vhd}

A.8. Appendix H 87

read_file -format vhdl {../../fabric/dimarch/

SINGLEPORT_SRAM_AGU_types_n_constants.vhd}

read_file -format vhdl {../../fabric/mtrf/util_package.vhd}

read_file -format vhdl {../../fabric/hw_setting.vhd}

read_file -format vhdl {../../fabric/mtrf/

top_consts_types_package.vhd}

read_file -format vhdl {../../fabric/mtrf/tb_instructions.vhd}


noc_types_n_constants.vhd}

read_file -format vhdl {../../fabric/mtrf/register_row.vhd}

read_file -format vhdl {../../fabric/mtrf/InputMux.vhd}


crossbar_types_n_constants.vhd}

read_file -format vhdl {../../fabric/mtrf/

seq_functions_package.vhd}

read_file -format vhdl {../../fabric/dimarch/segmented_bus.vhd}

read_file -format vhdl {../../fabric/mtrf/bus_selector.vhd}

read_file -format vhdl {../../fabric/dimarch/selector.vhd}

read_file -format vhdl {../../fabric/dimarch/source_decoder.vhd}

read_file -format vhdl {../../fabric/dimarch/source_fsm.vhd}


source_decoder_n_fsm.vhd}

read_file -format vhdl {../../fabric/dimarch/iSwitch.vhd}

read_file -format vhdl {../../fabric/dimarch/SRAM.vhd}

read_file -format vhdl {../../fabric/dimarch/data_crossbar.vhd}

read_file -format vhdl {../../fabric/dimarch/sram_agu.vhd}

read_file -format vhdl {../../fabric/mtrf/cell_config_swb.vhd}

read_file -format vhdl {../../fabric/mtrf/switchbox.vhd}

read_file -format vhdl {../../fabric/mtrf/RACCU.vhd}

read_file -format vhdl {../../fabric/mtrf/sequencer.vhd}

read_file -format vhdl {../../fabric/mtrf/DPU.prashant.vhd}

read_file -format vhdl {../../fabric/mtrf/register_row.vhd}

read_file -format vhdl {../../fabric/mtrf/register_file.vhd}

read_file -format vhdl {../../fabric/mtrf/AGU.vhd}

read_file -format vhdl {../../fabric/mtrf/register_file_top.vhd}

read_file -format vhdl {../../fabric/mtrf/MTRF_cell.vhd}

read_file -format vhdl {../../fabric/dimarch/SRAMTile.vhd}

read_file -format vhdl {../../fabric/mtrf/silego.vhd}

analyze -format vhdl {../../fabric/mtrf/fabric.vhd}


set_dont_touch [get_nets h_bus*] true

#set_dont_touch [get_cells {SILEGO*}] true

elaborate fabric -lib work

set_dont_touch [get_nets h_bus*] true

set_operating_conditions NCCOM

set_wire_load_mode top

set_wire_load_selection_group WireAreaLowkCon

#set_false_path -from [find / -port rst_n]

########Set Clock#############

create_clock -name "clk" -period 5 -waveform { 0 2.5 } { clk }

compile -map_effort medium

#######Report####################

report_timing > $RPTDIR/report_timing1.rpt

report_area > $RPTDIR/area1.rpt

report_power -analysis_effort medium > $RPTDIR/power1.rpt

#######Result####################

write -format db -hier -o $SYNDB/fabric1.db

write -format verilog -hier -o $SYNDB/fabric1.v

write_sdf -version 2.1 $SYNDB/fabric1.sdf

write_sdc $SYNDB/fabric1.sdc

write -hierarchy -format ddc -output $SYNDB/fabric1.ddc

write -hierarchy -format ddc

89

Bibliography

[1] Gene M. Amdahl. “Computer Architecture and Amdahl’s Law”. In:IEEE Solid-State Circuits Society Newsletter 12 (3 2007), pp. 4–9.

[2] M. Rahmani Amir. “The Dark Side of Silicon: Energy Efficient Com-puting in the Dark Silicon Era”. In: (2017).

[3] Cloud TPU documentation | Cloud TPU | Google Cloud. URL: https://cloud.google.com/tpu/docs/.

[4] Fumo David. A Gentle Introduction To Neural Networks Series. URL: https://towardsdatascience.com/a-gentle-introduction-to-

neural-networks-series-part-1-2b90b87795bc.[5] DNA Sequencing Fact Sheet. URL: https://www.genome.gov/10001177/

dna-sequencing-fact-sheet/.[6] Nasim Farahini. “Physical Design Aware System Level Synthesis of

Hardware”. In: SAMOS 10 (2015).[7] Schiff GD. “Diagnostic error in medicine: Analysis of 583 physician-

reported errors.” In: Archives of Internal Medicine. (2009), 1881–1887.[8] Genome Technology Program. URL: https : / / www . genome . gov /

10000368/genome-technology-program/.[9] Simon Haykin. “Neural Networks: A Comprehensive Foundation (2nd

Edition)”. In: Prentice Hall 2 (1998).[10] A. Hemani, S. M. A. H. Jafri, and S. Masoumian. “Synchoricity and

NOCs could make billion gate custom hardware centric SOCs afford-able”. In: 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS). 2017, pp. 1–10.

[11] H.a. Carleton J. Besser. “Next-generation sequencing technologies andtheir application to the study and control of bacterial infections”. In:Clinical Microbiology and Infection 322 (4 2018), pp. 335–341.

[12] S. M. A. H. Jafri, N. Farahini, and A. Hemani. “SiLago-CoG: Coarse-Grained Grid-Based Design for Near Tape-Out Power Estimation Ac-curacy at High Level”. In: 2017 IEEE Computer Society Annual Sym-posium on VLSI (ISVLSI). 2017, pp. 25–31. DOI: 10.1109/ISVLSI.2017.15.

https://cloud.google.com/tpu/docs/

https://cloud.google.com/tpu/docs/

https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b90b87795bc



https://www.genome.gov/10001177/dna-sequencing-fact-sheet/

https://www.genome.gov/10001177/dna-sequencing-fact-sheet/

https://www.genome.gov/10000368/genome-technology-program/

https://www.genome.gov/10000368/genome-technology-program/

https://doi.org/10.1109/ISVLSI.2017.15

https://doi.org/10.1109/ISVLSI.2017.15

90 BIBLIOGRAPHY

[13] Hiroyuki Ishiura Jun Mitsui. “DNA Sequencing and Other Methods ofExonic and Genomic Analyses”. In: Rosenberg’s Molecular and GeneticBasis of Neurological and Psychiatric Disease 5 (2015), pp. 77–85.

[14] Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI GenomeSequencing Program (GSP). URL: https://www.genome.gov/27541954/dna-sequencing-costs-data/.

[15] Nikola Kasabov and Lubica Benuskova. “Computational Neurogenet-ics”. In: Journal of Computational and Theoretical Nanoscience 1 (Mar. 2004),pp. 47–61. DOI: 10.1166/jctn.2004.006.

[16] T. Kohonen. “Things you haven’t heard about the self-organizing map”.In: IEEE International Conference on Neural Networks 3 (1993), pp. 1147 –1156.

[17] Xing Liu. “PASQUAL: Parallel Techniques for Next Generation GenomeSequence Assembly”. In: Parallel Distributed Systems 7 (14535 2013).

[18] James M.Heather. “The sequence of sequencers: The history of sequenc-ing DNA”. In: Genomics 107 (1 2016), pp. 1–8.

[19] M. Porrmann, U. Witkowski, and U. Ruckert. “A massively parallelarchitecture for self-organizing feature maps”. In: IEEE Transactions onNeural Networks 14.5 (2003), pp. 1110–1121. ISSN: 1045-9227. DOI: 10.1109/TNN.2003.816368.

[20] Aleksandr B Sahakyan. “Machine learning model for sequence-drivenDNA G-quadruplex formation”. In: Scientific Reports 7 (14535 2017).

[21] Burton EC Shojania KG. “Changes in rates of autopsy-detected diag-nostic errors over time: A systematic review”. In: JAMA (2003), 2849–2856.

https://www.genome.gov/27541954/dna-sequencing-costs-data/

https://www.genome.gov/27541954/dna-sequencing-costs-data/

https://doi.org/10.1166/jctn.2004.006

https://doi.org/10.1109/TNN.2003.816368

https://doi.org/10.1109/TNN.2003.816368

TRITA TRITA-EECS-EX-2018:59

www.kth.se