Transcript
Page 1: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0

Modular Switched Multi-ported SRAM-based Memories

AMEER M.S. ABDELHADI, The University of British ColumbiaGUY G.F. LEMIEUX, The University of British Columbia

Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector pro-cessors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for par-allel access. Although memories with a large number of read and write ports are important, their high im-plementation cost means they are used sparingly. As a result, FPGA vendors only provide dual-ported blockRAMs (BRAM) to handle the majority of usage patterns. Furthermore, recent attempts to create FPGA-based multi-ported memories suffer from low storage utilization. While most approaches provide simpleunidirectional ports with a fixed read or write, others propose true bidirectional ports where each port dy-namically switch read and write. True RAM ports are useful for systems with transceivers and provide highRAM flexibility; however, this flexibility incurs high BRAM consumption. In this paper, a novel, modularand BRAM-based switched multi-ported RAM architecture is proposed. In addition to unidirectional portswith fixed read/write, this switched architecture allows a group of write ports to switch with another groupof read ports dynamically, hence altering the number of active ports. The proposed switched-ports architec-ture is less flexible than a true-multi-ported RAM where each port is switched individually. Nevertheless,switched memories can dramatically reduce BRAM consumption compared to true-ports for systems withalternating port requirements. Previous live-value-table (LVT) and XOR approaches are merged and opti-mized into a generalized and modular structure we call an invalidation-based live-value-table (I-LVT). Likea regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the tableare made; the LVT approach requires multiple write ports, often leading to an area-intensive register-basedimplementation, while the XOR approach suffers from excessive storage overhead since wider memories arerequired to accommodate the XOR-ed data. Two specific I-LVT implementations are proposed and evaluated,binary and thermometer coding. The I-LVT approach is especially suitable for deep memories because thetable is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying lessBRAMs than earlier approaches: for several configurations, BRAM usage is reduced by over 44% and clockspeed is improved by over 76%. The I-LVT can be used with fixed ports, true-ports or the proposed switchedports architectures. Formal proofs for the suggested methods, resources consumption analysis, usage guide-line and analytic comparison to other methods are provided. A fully parameterized Verilog implementationis released as an open source library. The library has been extensively tested using Altera’s EDA tools.

Categories and Subject Descriptors: B.3.2 [MEMORY STRUCTURES]: Design Styles Cache memories,Shared memory; C.1.2 [PROCESSOR ARCHITECTURES]: Multiple Data Stream Architectures (Multi-processors) Interconnection architectures, Parallel processors

General Terms: Design, Algorithms, Performance

Additional Key Words and Phrases: Embedded memory, programmable memory, block RAM, multi-portedmemory, shared memory, cache memory, register-file, parallel memory access

ACM Reference Format:Ameer M.S. Abdelhadi and Guy G.F. Lemieux, 2014. Modular Switched Multi-ported SRAM-based Memo-ries. ACM Trans. Embedd. Comput. Syst. 0, 0, Article 0 (November 2014), 25 pages.DOI:http://dx.doi.org/10.1145/0000000.0000000

Author’s addresses: Ameer M.S. Abdelhadi and Guy G.F. Lemieux, Department of Electrical and ComputerEngineering, The University of British Columbia, Vancouver, B.C., V6T 1Z4, Canada.Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2014 ACM 1539-9087/2014/11-ART0 $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 2: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:2 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

1. INTRODUCTIONMulti-ported memories are the cornerstone of all high-performance CPU designs. Theyare often used in the register files, but also in other shared-memory structures suchas caches and coherence tags. Hence, high-bandwidth memories with multiple par-allel reading and writing ports are required. In particular, multi-ported RAMs areoften used by wide superscalar processors [Tseng and Asanovic 2003], VLIW proces-sors [Tseng and Asanovic 2003],[Fisher 1983], multi-core processors [Fetzer and Orton2002],[Bajwa and Chen 2007], vector processors, coarse-grain reconfigurable arrays(CGRAs) [Kwok and Wilton 2005], and digital signal processors (DSPs). For example,the second generation of the Itanium processor employs a 20-port register file con-structed from SRAM bit cells with 12 read ports and 8 write ports [Fetzer and Orton2002]. The key requirement for all of these designs is fast, single-cycle, and concurrentaccess from multiple requesters for performance reasons.

One way of synthesizing a multi-ported RAM is to build it from registers and logic.However, this is only feasible for very small memories. Another way is to alter thebasic SRAM bit cell to provide extra access ports, but area growth is quadratic withthe number of ports, so this requires a custom design for each unique set of parameters(number of ports, width and depth of RAM). Since FPGAs must fix their RAM blockdesigns for generic designs, it is too costly to provide highly specialized RAMs with alarge number of ports. A multi-ported RAM can also be emulated through banking ormulti-pumping. Banking uses hashing and arbitration to provide access, but it leads tounpredictable (multi-cycle) access latencies under collisions; this complicates systemdesign and compromises performance. Multi-pumping provides a few extra ports, butit is limited by the amount of overclocking. Hence, a method of composing arbitrary,multi-ported RAMs from simpler RAM blocks is required.

Recently, a few FPGA-based modular simple/true-multi-ported RAM designs havebeen proposed. Live-value-table (LVT) is used together with multi-banking to createsimple-multi-ported RAM [LaForest and Steffan 2010]. While each writing port writesto a different bank, the LVT tracks the latest written bank for each memory address,allowing to read the latest data. However, the LVT is composed of registers, a limitedand area-intensive resource. Alternatively, The XOR-based cancellation method re-trieves the latest written data by utilizing logical XOR properties [Laforest et al. 2012].The XOR-based method overcomes the register-based memory limitation by storing alldata in BRAMs; however, this incurs excessive memory overhead since wide memoriesare required to accommodate all the XOR-ed data. Instead, An SRAM-based LVT isproposed by utilizing invalidation-table data structure [Abdelhadi and Lemieux 2014].Similar to a regular LVT, the invalidation-based live-value-table (I-LVT) determinesthe correct bank to read from, but it differs in how updates to the table are made.In contrast to register-based LVTs, the I-LVT is composed of BRAMs only, hence itis capable of constructing deep LVTs. Choi et al. changed the data bank connectiv-ity in the LVT-based approach to support bidirectional true-ports [Choi et al. 2012].Their design has two main shortcomings. First, the LVT is still register-based anddoes not scale well for deep memories. Second, all ports are true bidirectional ports;this flexibility incurs high BRAM consumption, especially for memories with mixedsimple/true-port requirements. Hence, a method of composing modular BRAM-basedand storage-efficient multi-ported RAM supporting both simple and switched ports isrequired.

As described in Figure 1 (left), RAM ports can be classified into three categoriesbased on their functionality: (1) simple/unidirectional ports with fixed read/write func-tionality, (2) true/bidirectional ports, where each port can switch read/write function-ality, and (3) switched ports, where a group of write ports can switch with another

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 3: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:3

RW

Details

True

W R

Switched

RW

True

AddrRData1

RAddr1

RData2

RAddr2

WData1

WAddr1

WData2

WAddr2

Switched

R/W (shared)

W R

Symbols

SwitchedSimple True W1 R1

R2

R3

W2

W3

R4

R5

R6

R

W

W E n

SimpleWrite

SimpleWrite

SimpleRead

SimpleRead

AddrR / W

RDataWData

AddrWData

RData

Fig. 1. (left) RAM port classification. (right) Example of mixed simple and switched ports.

group of read ports dynamically, hence altering the number of active ports. Figure 1(right) provides an example of RAM module with mixed simple and switched port re-quirements; specifically, one simple write (W1) and three simple reads (R1..3) have fixedfunctionality, while two write ports (W2..3) are switched with three read ports (R4..6).

In this paper, a novel, modular and parametric switched multi-ported RAM is con-structed out of basic dual-ported BRAMs while keeping minimal area and perfor-mance overhead. The proposed method provides a modular architecture that supportsmixed simple/switched port requirements and significantly reduces BRAM consump-tion and improves performance compared to previous attempts. Despite being less flex-ible than true RAM ports, switched ports dramatically reduce BRAM consumption ifmixed-ports are required. The suggested switched data banks employs an SRAM-basedinvalidation-live-value-table (I-LVT) [Abdelhadi and Lemieux 2014] to track the latestwritten data banks for each RAM address, hence this architecture is purely SRAM-based. To verify correctness, the proposed architecture is fully implemented in Verilog,simulated using Altera’s ModelSim, and compiled using Quartus II. A large variety ofdifferent architectures and parameters, e.g. bypassing, memory depth, width and num-ber of ports are simulated in batch, each with over a million random memory cycles.Stratix V, Altera’s high-end performance-oriented FPGA, is used to implement andcompare the proposed approach with previous techniques. In addition to the suggestedswitched multi-ported RAM architecture, major contributions of this paper are:

— A bypassing circuitry for both simple (unidirectional) and true (bidirectional) ports.The bypassing circuit is capable of producing new data for read-after-write (RAW)and read-during-write (RDW) data dependencies.

— A detailed analytic comparison of the proposed and previous methods. A guideline forchoosing the most efficient architecture based on design parameters is also provided.

— A fully parameterized Verilog implementation of the suggested methods, togetherwith previous approaches. A flow manager to simulate and synthesize designs withvarious parameters in batch using Altera’s ModelSim and Quartus II is also pro-vided. The Verilog modules and the flow manager are available online [Abdelhadiand Lemieux 2015].

Notation and abbreviations used for the rest of the paper are listed in Table I. Therest of this paper is organized as follows. In section 2, conventional RAM multi-portingtechniques in embedded systems are reviewed. Previous attempts to provide multi-ported memories are reviewed in section 3. The proposed invalidation-based live-value-table method with switched ports is described in detail and compared to previous meth-ods in section 4. The experimental framework, including simulation and synthesis andresults, are discussed in section 5, and conclusions are drawn in section 6.

2. RAM MULTI-PORTING TECHNIQUES IN EMBEDDED SYSTEMSThis section provides a review of current methods of creating multi-ported RAMs inembedded systems. Creating multi-ported access to register-based memories is de-scribed in subsection 2.1. Multi-pumping is described in subsection 2.2. Replicating

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 4: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:4 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

Table I. List of notations and abbreviations

nW , nR, nt Write/read/true ports number WAddr/RAddr Write/read addressnW,f , nR,f Write/read simple (fixed) ports number WData/RData Write/read datanW,s, nR,s Write/read switched-ports number RWData Bidirectional read/write datad, w Memory depth, data width RBankSel Read bank selectornM20K , nBReg M20K blocks / bypass registers number LVT Live-value-tableffb, fout LVT feedback/out functions I-LVT Invalidation LVT

Register-basedArray

Width (w)

RDatanR-1

RAddrnR-1

WData1

WData0

Dep

th (d)

WAddr1

WAddr0

RData1

RAddr1

RData0

RAddr0

Ad

dre

ss D

eco

de

rs Reg0

Regd-1

Reg1

En

D Q

Q

QD

En

En

D

Fig. 2. Register-based multi-ported RAM.

1 Write/1 Read Dual-port RAM

RData1

RDatanR-1

RData0

RAddr1

RAddrnR-1

RAddr0

RAddr

RData

WAddr

WData

Rea

d-p

ort

Wri

te-p

ort

Mod nRCounter ÷nR

Mod nWCounter÷nW

WData1

WDatanW-1

WData0

end

d

d

dec

od

er

WAddr1

WAddrnW-1

WAddr0

en

en

Fig. 3. Multi-pumping: RAM is overclocked, al-lowing multiple access during one pipeline cycle.

a memory bank to increase the number of read and write ports is described in subsec-tions 2.3 and 2.4, respectively.

2.1. Register-based RAMMulti-ported RAM arrays can be constructed using basic flip-flop cells and logic. Asdepicted in Figure 2, each writing port uses a decoder to steer the relevant writtendata into the addressed row. Each read port uses a mux to choose the relevant registeroutput. This method is not practical for large memories due to area inflation, fan-outincrease, performance degradation, and a decline in routability.

2.2. RAM Multi-pumpingA time-multiplexing approach can be applied to a single dual-ported SRAM block toreuse access ports and share them among several clients, each during a different timeslot. As depicted in Figure 3, addresses and data from several clients are latched thengiven round-robin access to a dual-ported RAM. The RAM must operate at a higherfrequency than the rest of the circuit. If the maximum RAM frequency is similar to thepipe frequency, or a large number of access ports are required, then multi-pumpingcannot be used. A number of designs utilize multi-pumping to gain additional ac-cess ports while keeping area overhead minimal [Chappell et al. 1996],[Yokota 1990].The 2.3GHz Wire-Speed POWER processor uses double-pumping to double the writingports [Ditlow et al. 2011].

2.3. Multi-read RAM: Bank ReplicationTo provide more reading ports, the whole memory bank is replicated while keepingcommon write address and data as shown in Figure 4. Data will be written to allbank replicas at the same address, hence reading from any bank is equivalent. Thismethod incurs high SRAM area and consumes more power. However, the replicationapproach has two strong advantages over other multi-porting approaches. The firstis the simplicity and modularity of bank replication. The second is that read accesstime is unaffected as the number of ports increases; only write delays increase dueto fan-out, but this can be hidden by pipelining and bypassing. The bank replication

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 5: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:5

1W/1R Dual-port RAM Array

ReadPort 1

WritePort W

rite

Por

t

Rea

dP

ort Addr

Data

ReadPort 2W

rite

Por

t

Rea

dP

ort Addr

Data

Addr

Data

Addr

Data

ReadPort 2W

rite

Por

t

Rea

dP

ort Addr

Data

Addr

Data

1W/nR Multi-read RAM

Rea

dP

ort 1

AddrData

Rea

dP

ort 2

AddrData

Rea

dP

ort n

AddrData

Addr

Data

Wr

it

eP

or

t

Fig. 4. (left) Replicated dual-ported banks with a commonwrite port. (right) Multi-read RAM symbol used in this paper.

Hashing, Arbitration& Conflict resolver

Bank1

Port 1 2 n

Bank2 Bankm

Fig. 5. Multi-banking: RAM capacityis divided into several banks. Portsaccess with a fixed hashing scheme.

technique is commonly used in state-of-the-art processors to increase parallelism. The2.3GHz Wire-Speed POWER processor replicates a 2-read SRAM bank to achieve 4read ports [Ditlow et al. 2011]. Each of the two integer clusters of the Alpha 21264processor has a replicated 80-entry register file, thus doubling the number of read portsto support two concurrent integer units. Similarly, the 72-entry floating-point registerfile is duplicated, supporting two concurrent floating-point units [Kessler 1999].

2.4. Multi-write RAM: Emulation via Multi-bankingMulti-ported memories are very expensive in terms of area, delay, and power for alarge number of ports. The overhead of multi-porting can be reduced by multi-bankingif one relaxes the guaranteed access delay constraint. As depicted in Figure 5, the totalRAM capacity can be divided into several banks, each with few ports (e.g. dual-port). Afixed hashing scheme is used to match each access to a single bank; often, the addressMSBs are used. Arbitration logic steers access from multiple ports to each bank. Sincetwo ports can request access to data in the same bank at the same time, a conflictresolving circuit determines which port grants access to a specific bank. The otherport will miss the arbitration and is required to request access again. Not only doesthis approach provide unpredictable access latency due to the arbitration miss, butit also increases delay due to the additional circuitry. Several approaches have beenproposed to improve multi-banking [Tseng and Asanovic 2003],[Mattausch 1997],[Jiet al. 2007],[Zuo et al. 2008]. State-of-the-art memory controllers and caches are basedon multi-banking due to area and power efficiency. For example, Intel’s i486 CPU hasa data cache with 8 interleaved banks and two access ports [Alpert and Avnon 1993].

3. MULTI-PORTED SRAM-BASED MEMORIES: PREVIOUS WORKIn this section a review of two previous multi-ported SRAM-based memories are pro-vided. The first approach is based on multi-banking with a live-value-table (LVT)[LaForest and Steffan 2010], [Laforest et al. 2014] and is described in subsection 3.1.The second approach retrieves the latest written data by utilizing logical XOR proper-ties [Laforest et al. 2012], [Laforest et al. 2014] and is described in subsection 3.2.

3.1. LVT-based Multi-ported RAMFor each RAM address, the LVT stores the ID of the bank replica that holds the latestdata. As depicted in Figure 6 (left), an LVT-based multi-ported RAM uses a differentbank replica for each writing port, while each bank has several reading ports. All banksare accessed by all read addresses in parallel; the LVT helps to steer the read data outof the correct bank since it holds the ID of last accessed (written) bank for each address.

Actually, the LVT itself is a multi-ported RAM with the same depth and number ofwriting ports as the implemented multi-ported memory. However, since the LVT storesonly bank IDs, the data (line) width of the LVT table is only dlog2e of the number of

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 6: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:6 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

RAM 01 Write/nR Read

WA

ddr

RA

ddr

RAM 11 Write/nR Read

RAM nW-11 Write/nR Read

LVT

BankSelWAddr0

WData0

WAddr1

WData1

WAddrnW-1

WDatanW-1

RAddr0

RData0

RAddr1

RData1

RAddrnR-1

RDatanR-1

Bank IDs

nW write/nR read RAM

RAdd

r

WA

ddr

BankSelnW write/nR read LVT

Fig. 6. (left) LVT-based multi-ported RAM. (right) An LVTimplemented using multi-ported RAM.

BRAM0,0

BRAMnW-2,0

BRAM0,0

BRAM1,0

BRAMnR-1,0

BRAM0,1

BRAMnW-2,1

BRAM0,1

BRAM1,1

BRAMnR-1,1

BRAM0,nW-1

BRAMnW-2,nW-1

BRAM0,nW-1

BRAM1,nW-1

WData0

WData1

RData0

RData1

Replicas of read-

port

S i m p l e d u a l - p o r t R A M A r r a y s

BRAMnR-1,nW-1

Fig. 7. XOR-based multi-ported RAM.

banks, which is equal to the number of writing ports. As described in Figure 6 (right),the LVT doesn’t have write data, instead it writes a fixed bank ID for each port.

Since an LVT is a narrow, multi-port memory, it is implemented as a registered-based, multi-ported RAM. As explained in subsection 2.1, register-based RAM is notsuitable for building deep memories. While the LVT width is only log2 of the number ofwriting ports, the depth of the LVT is still identical to the depth of the original RAM.This is the main cause of the area overhead. In this paper, to reduce this area overhead,two methods of constructing SRAM-based LVTs are described. The methodology ofconstructing SRAM-based LVTs is also generalized. To the authors’ best knowledgethis is the first attempt to build an LVT out of SRAM blocks only.

Assuming that bank IDs are binary encoded, the total number of registers requiredto implement the LVT is

d · dlog2 nW e. (1)

For deep memories, the large number of registers and huge read multiplexers makeregister-based LVTs impractical. For example, a Stratix V GX A5 device (185k ALMs)can fit up to 16k-deep memory with four write ports.

A register-based LVT with SRAM banks requires nW multi-read banks for each writeport. Each multi-read bank supports nR reading ports, allowing the LVT to select therequired data block. The total number of SRAM cells is

d · w · nW · nR. (2)

Using Altera’s Stratix M20K BRAMs, the total number of required M20K blocks is

nM20K(d,w) · nW · nR. (3)

Where nM20K(d,w) is the number of M20K Blocks required to construct a RAM witha specific depth and data width. This value, described by Equation (4), is derived fromFigure 8, which shows how Altera’s M20K blocks can be configured into several RAMdepth and data width configurations [Altera Corp. 2013]. The total amount of utilizedSRAM bits can be either 16Kbits, or 20Kbits. Assuming that the RAM packing processminimizes the number of blocks cascaded in depth to avoid additional address decod-ing, each 16K lines will be packed into single bit-wide blocks, and the remainder willbe packed into the minimal required configuration as follows

nM20K(d,w) =

⌊d

16k

⌋·

w d%16k > 8k

bw/2 c 8k ≥ d%16k > 4k

bw/5 c 4k ≥ d%16k > 2k

bw/10c 2k ≥ d%16k > 1k

bw/20c 1k ≥ d%16k > 12k

bw/40c 12k ≥ d%16k

. (4)

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 7: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:7

5 10 20 40

½1

2

4

Data width (bits)

BR

AM

dep

th (

K li

nes)

1 2 4 8 16 32½1

2

4

8

16

Data width (bits)

BR

AM

dep

th (

K li

nes)

Fig. 8. Altera M20Kconfiguration (top)20Kb (bottom) 16Kb.

RAM

R/W

Data

3

S3

n Read / n WriteRegister-based

LVT

S0 S1 S2 S3 Sn-1

R/WData1

RAM

3 Read / 3 WriteRegister-based

LVT

Fig. 9. Data banks with bidirectional true-ports support. Each port share asingle BRAM with any other port.(left) Generalized approach. (right) A 3 true-ports example.

Two papers [Choi et al. 2012], [Laforest et al. 2014] introduce a modification to LVT’sdata bank to support bidirectional (true) ports. This is achieved by utilizing the bidi-rectional functionality of true-dual-port BRAMs. Figure 9 (left) illustrates a general-ization of this method. Each port either writes to a set of data banks, or reads fromthem. Every port of ports has one data bank in common, hence, when a port reads, itcan access data written from any other port. A register-based LVT determines whichbank holds the latest data. Figure 9 (right) describes an example of a RAM with 3true-ports. This problem is identical to the mathematical handshakes problem, whereeach port must connect (handshake) to all other ports via a RAM. Hence, a total of

1

2· nt · (nt − 1) (5)

data copies are required, where nt is the number of bidirectional (true) ports. Never-theless, this true-port RAM architecture is still based on a register-based LVT, henceit suffers from the same shortcomings.

3.2. XOR-based Multi-ported RAMWhile the LVT-based multi-port RAM just shown implements its LVT as a register-based multi-ported RAM, the XOR-based multi-ported RAM is implemented usingSRAM blocks [Laforest et al. 2012], [Laforest et al. 2014]. This makes it more effi-cient for deep memories. However, as will be shown, it is inefficient for wide memories.The XOR-based method utilizes the special properties of the XOR function to retainthe latest written data for each write port. XOR is commutative a ⊕ b = b ⊕ a, asso-ciative (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c), zero is the identity a ⊕ 0 = a, and the inverse of eachelement is itself a⊕ a = 0.

As illustrated in Figure 7, each write port has a bank with multi-read and a singlewrite. Part of the read ports are used as a feedback to generate new data and rewrite aspecific bank, while the other read ports generate the data outputs. To perform a write,the new data is XOR’ed together with all the old data from the other banks; the resultis rewritten to the corresponding bank. Hence if an address A is written through writeport i with data WDatai, Banki will be rewritten with

Banki[A]← Bank0[A]⊕Bank1[A]⊕ · · · ⊕WDatai ⊕ · · · ⊕BanknW−1[A]. (6)

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 8: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:8 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

A read is performed by XOR’ing all the data for the corresponding read address fromall the banks, hence,

RDatai[A]← Bank0[A]⊕Bank1[A] · · · ⊕BanknW−1[A]. (7)

Substituting Banki[A] from (6) into (7) and applying commutative and associativeproperties of the XOR shows that each bank appears twice in the XOR equation, hencewill be cancelled since XORing similar elements is 0. The only remaining item will beWDatai, the required data.

The XOR-based multi-ported RAM requires nW multi-read banks for each write port.Each multi-read bank supports nW −1 read ports to feedback the other ports via XORs,and nR read ports. Each feedback read port is of width d, to match the write data, sothese feedback memories can be quite large. The number of required SRAM cells is

d · w · nW · (nR + nW − 1). (8)

Using Altera’s Stratix M20K BRAMs, the total number of required M20K blocks is

nM20K(d,w) · nW · (nR + nW − 1). (9)

Since FPGA block RAM is synchronous, data feedbacks are read with a one-cycle readdelay. Hence, the written data, their addresses and write-enables must be retimed tomatch the feedback data. This requires the following number of registers

nW · (w + dlog2 de+ nW ). (10)

4. INVALIDATION TABLEAs described in the previous section, the XOR-based multi-ported memories requiresnW · (nR + nW − 1) manipulated copies of the RAM content, while the LVT approachrequires another register-based multi-ported memory with the same number of readand write ports for bank IDs.

This work proposes to implement LVTs using SRAM blocks only, which has a majoradvantage over register-based LVTs and a lower SRAM area compared to the XOR-based approach. Instead of requiring multiple write ports to each multi-read bankin the regular LVT method, we suggest a design with a single write port each likethe XOR method. This makes it feasible to implement the LVT using standard dual-ported RAMs. However, writing an ID to one bank requires also invalidating the IDsin the other banks, which produces the need for the multiple write ports. Instead, wesuggest writing an ID to only one specific bank and invalidating all the other IDs witha single write by using an invalidation table. Since the invalidation table has the samefunctional behavior as an LVT, we call it an invalidation-based LVT, or I-LVT.

The I-LVT doesn’t require multiple writes to indicate the last-written bank. Instead,as described in Figure 10, the I-LVT reads all other bank IDs as feedback, embeds thenew bank ID into the other values through a feedback function ffb, then rewrites thespecific bank. To extract back the latest written bank ID, all banks are read and datais processed with the output function fout to regenerate the required ID. Selection ofthe ffb and fout functions is what distinguishes different I-LVT implementations.

The I-LTV requires nW multi-read banks, each with nR read ports for output ex-traction. Furthermore, an additional nW − 1 read ports are required in each bank forfeedback rewriting. The data width of these read ports varies depending on the feed-back method and the bank ID encoding. In this paper, two bank ID encoding meth-ods are presented, binary and thermometer. The binary method employs exclusive-ORfunctions to embed the bank IDs, while the second uses mutual-exclusive conditions toinvalidate table entries and generate one-hot-coded bank selectors. The two methodsare described in subsections 4.1 and 4.2, respectively.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 9: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:9

Write-Port

Write-Port

Waddr0

Waddr1

WaddrnW-1

Raddr0Raddr1RaddrnR-1

RBankSelnR-1

Feedback Read-Ports (Address,Data) Pairs

f fb,nW

-1f fb,0

Out Read- AddrPort nR-1 Data

Out Read- AddrPort 0 Data

Out Read- AddrPort 1 Data

FB Read- AddrPort 0 Data

FB read- Addrport nW-2 Data

1 Write/nW+nR-1 Read Bank 0

Write-Port

Data

Addr

Out Read- AddrPort nR-1 Data

Out Read- AddrPort 0 Data

Out Read- AddrPort 1 Data

FB Read- AddrPort 0 Data

FB read- Addrport nW-2 Data

1 Write/nW+nR-1 Read Bank 1

Data

Addr

Out Read- AddrPort nR-1 Data

Out Read- AddrPort 0 Data

Out Read- AddrPort 1 Data

FB Read- AddrPort 0 Data

FB read- Addrport nW-2 Data

1 Write/nW+nR-1 Read Bank nW-1

Data

Addr

f fb,1

0 nW-2

0

Port#

Bank#

0 nW-2

1

0 nW-2

nW-1

f out

f out

f out

RBankSel1

RBankSel1

Fig. 10. Generalized approach for building the I-LVT.

4.1. Bank ID Embedding: Binary-coded Bank IDs and SelectorsThis approach attempts to reduce the SRAM cell count in the I-LVT by employingbinary-coded bank IDs. The special properties of the exclusive-OR function are utilizedto embed the latest written bank ID, hence invalidating all other IDs. The currentwritten bank ID is XOR’ed with the content of all the other banks from the same writeaddress as described in the following feedback function,

ffb,k = k⊕

0≤i<nW ;i6=k

Banki[WAddrk], (11)

where k is the ID of the currently written bank.Similar to the XOR-based method described in subsection 3.2, the last written bank

ID is extracted by XOR’ing the content of all the banks from the same read address asdescribed in the following output extraction function

fout,k =⊕

0≤i<nW

Banki[RAddrk]. (12)

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 10: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:10 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

Write-Port

Data

AddrWAddr0

WAddr1

RBankSel0

Write-Port

Data

Addr

FB Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

FB Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

1 write/3 read bank 1 (1 bit width)

RAddr0RAddr1

RBankSel1

1 write/3 read bank 0 (1 bit width)

Fig. 11. A 2W/2R SRAM-based I-LVT; identical for binary-coded or thermometer-coded bank IDs.

Without loss of generality, if address A in bank k is written with the feedback functionfrom Equation (11), then

Bankk[A] = k⊕

0≤i<nW ;i 6=k

Banki[A]. (13)

If one of the read ports, say read port r, is trying to read from the same address,namely RAddrr = A, then the read bank selector will be generated using the sameoutput extraction function from (12), hence

RBankSelr =⊕

0≤i<nW

Banki[A]. (14)

Due to XOR operation associativity, RBankSelr from (14) can be expressed as

RBankSelr = Bankk[A]⊕

0≤i<nW ;i 6=k

Banki[A], (15)

Substituting Bankk[A] from (13) into (15) provides

RBankSelr = k⊕

0≤i<nW ;i 6=k

Banki[A]⊕

0≤i<nW ;i 6=k

Banki[A]. (16)

The last two series in (16) can be reduced revealing that RBankSelr = k, the ID of thelatest writing bank into address A, as required.

Figure 11 provides an example of 2W/2R binary-coded I-LVT. As will become appar-ent in the next section, in case of 2 write ports only, the binary-coded and thermometer-coded I-LVTs are identical. Figure 12 (left) shows a 3W/2R binary-coded I-LVT.

The required data width of the I-LVT SRAM blocks is dlog2 nW e. Also, nW multi-readbanks are required each with nR output ports for ID extraction and nW − 1 feedbackports for ID rewriting. Hence, the number of required SRAM cells is

d · dlog2 nW e · nW · (nW + nR − 1). (17)

Respectively, the number of required M20k block RAMs is

nM20K(d, dlog2 nW e) · nW · (nW + nR − 1). (18)

Similarly, the number of registers required for retiming is

nW · (dlog2 de+ 1). (19)

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 11: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:11

Wad

dr 0

Raddr0Raddr1

Feedback Read-Ports

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 0 - 2 Bits Width

: Port#

: Bank#

0 1 0

1

1 0 1

20

{{{{{{{ { {

Wad

dr 1

Wad

dr 2

FB Read Addr-Port 1 Data

FB Read Addr-Port 0 Data

<1>

<1>

<0>

<1>

<0>

<0>

a

b

c

RBankSel<0> ← a=={b<0>,c<0>}

RBankSel<1> ← b=={a<0>,c<1>}

RBankSel<2> ← c=={a<1>,b<1>}

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 1 - 2 Bits Width

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 2 - 2 Bits Width

<0>

<1>

<0>

<1>

<0>

<1>

a

b

c

<0>

<1>

<0>

<1>

<0>

<1> RB

an

kSe

l 0<

2:0

>

FB Read Addr-Port 1 Data

FB Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

FB Read Addr-Port 1 Data

FB Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

RB

an

kSe

l 1<

2:0

>

Wad

dr 0

Raddr0Raddr1

RB

an

kSe

l 1

Feedback Read-Ports

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 0 - 2 Bits Width

‘00’

: Port#

: Bank#

0 1 0

1

1 0 1

20

{{{{{{{ { {

FB Read Addr-Port 1 Data‘01’

‘10’

Wad

dr 1

Wad

dr 2

FB Read Addr-Port 0 Data

FB Read Addr-Port 1 Data

FB Read Addr-Port 0 Data

FB Read Addr-Port 1 Data

FB Read Addr-Port 0 Data

RB

an

kSe

l 0

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 1 - 2 Bits Width

Write-Port

Data

Addr

1 Write/4 Read RAM Bank 2 - 2 Bits Width

OutRead Addr-Port 1 Data

Out Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

Out Read Addr-Port 1 Data

Out Read Addr-Port 0 Data

Fig. 12. A 3W/2R SRAM-based I-LVT with (left) binary-coded bank IDs, and (right) thermometer-codedbank IDs.

4.2. Mutual-exclusive Conditions: Thermometer-coded Bank IDs with One-hot-codedSelectors

The previous binary-coded I-LVT incurs a long path delay through the feedback andoutput extraction functions due to the nW -wide XOR gates used to generate thesefunctions, which causes a performance reduction in structures with more ports. Onthe other hand, employing a thermometer ID encoding reduces the feedback paths toa single inverter at most, compared to the nW -wide XOR used earlier.

Mutual-exclusive conditions are used to rewrite the RAM contents. A specific bankis written data that contradicts all the other banks, hence only this specific bank willbe valid and all the others are invalid. By checking the appropriate mutual-exclusivecondition for each bank, only the latest written bank will hold the valid data.

Equations (20) and (21) describe mutual-exclusive feedback functions for nW = 2, 3,respectively. The angle brackets in theses equations are used for bit selection and con-catenation, while the square brackets in other equations are used for RAM addressing.As can be seen from these equations, writing to one bank will invalidate all the otherbanks at the same address since one mutual negated bit is shared between each twolines. For example, writing to bank2 when nW = 3 will write Bank1〈0〉 ← Bank0〈0〉which will invalidate bank 0, and Bank1〈1〉 ← Bank2〈1〉 which will invalidate bank2.

nW = 2 :

{ffb,0 : Bank0〈0〉 ← Bank1〈0〉ffb,1 : Bank1〈0〉 ← Bank0〈0〉

(20)

nW = 3 :

ffb,0 : Bank0〈1 : 0〉 ← 〈Bank2〈0〉, Bank1〈0〉〉ffb,1 : Bank1〈1 : 0〉 ← 〈Bank2〈1〉, Bank0〈0〉〉ffb,2 : Bank2〈1 : 0〉 ← 〈Bank1〈1〉, Bank0〈1〉〉

(21)

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 12: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:12 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

Equation (22) generalizes the feedback function to

ffb,k〈i〉|0≤i<nW−1 : Bankk[WAddrk]〈i〉 ←

{Banki[WAddrk]〈k − 1〉 i < k

Banki+1[WAddrk]〈k〉 otherwise(22)

This equation shows that each bank is using bits from all other banks to write its owncontent. To prove that each two banks are mutually exclusive, one bit of these banksshould be mutually negated. Suppose 0 ≤ k0 ≤ nW −1 a bank ID, and 0 ≤ i0 ≤ nW −1 abit index. From Equation (22) if i0 ≥ k0 then another bank ID k1 and bit index i1 existsuch that Bankk0〈i0〉 ← Bankk1〈i1〉, k1 = i0 + 1, and i1 = k0. Hence, i1 < k1 and from(22) Bankk1

〈1〉 ← Bankk0〈i0〉 as required. The proof in case of i0 < k0 is identical.

The output extraction function checks for each one-hot output selector if the readdata from a specific bank matches the mutual-exclusive case. Hence, only one case willmatch due to exclusivity. The output extraction function consists of an nW − 1 bit widecomparator for each one-hot selector.

An example of a 2W/2R thermometer-coded I-LVT is shown in Figure 11, while a3W/2R thermometer-coded I-LVT is depicted in Figure 12 (right).

The thermometer-coded I-LVT requires nW − 1 SRAM bits to save the mutually ex-clusive cases. However, the feedback read ports requires only one bit, since only onebit is used by the feedback function from each bank. nW multi-read banks are requiredeach with nR output ports for one-hot selectors extraction and nW − 1 feedback portsfor mutually exclusive cases rewriting. Hence, the number of required SRAM cells is

d · (nW − 1) · nW · nR + d · nW · (nW − 1). (23)

Respectively, the number of required M20k block RAMs is

nM20K(d, nW − 1) · nW · nR +BM20K(d, 1) · nW · (nW − 1). (24)

Similarly, the number of registers required for retiming is equal to the binary-codedcase and is described by (19).

4.3. Switched-ports SupportThe I-LVT multi-ported RAM, similar to other previous register-based LVT [LaFor-est and Steffan 2010] and XOR [Laforest et al. 2012] cancellation methods, offersa fixed number of simple writing and reading ports. However, the vast majority ofcomputation applications use different numbers of reading and writing ports for eachcomputation cycle. On the other hand, Choi et al. multi-ported RAM architecture sup-ports bidirectional true-ports only [Choi et al. 2012]; this excessive port flexibility in-curs high BRAM consumption, especially for memories with mixed simple/true-portrequirements. For instance, Figure 13 (left) provides an example of a CPU–RAM pair-ing where the CPU operations are mutual-exclusive; namely, only one operation can beactive at a single cycle. The mutual-exclusive operations in Figure 13 (left) are: f with3 operands and 3 results, and g with 6 operands and a single result. when f is active,3 RAM writes and 3 RAM reads are required, while a single RAM write and 6 RAMreads are required when g is active. At any given cycle, a maximum of 3 writes and6 reads are required, hence using a multi-ported RAM with fixed ports requires threewriting ports (nW = 3) and 6 reading ports (nR = 6). On the other hand, true portscan be configured into writes or reads, hence using true-ports requires the maximumof total writes and reads at any given cycle, namely 7 true-ports (nt = 7). Since RAMports will not be active at the same cycle, a multi-ported RAM with switched write/readports as illustrated in Figure 13 (right) can be used to reduce SRAM consumption.

The configurability of true dual-ported BRAMs is utilized to construct RAM portswith exchangeable read and write functionality. The proposed switched ports archi-

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 13: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:13

CPU

R0R1R2R3R4R5

W0WE0W1WE1W2WE2

‘1’

f/g

MP-RAM

g(a,b,c,d,e,f)

enb

f1(a)f2(b)f3(c)

W0 R0

R1

R2

W1

W2

R3

R4

R5

RW

Fig. 13. (left) Switched ports application example:CPU multi-ported RAM pairing. (right) Symbol ofth required switched ports RAM.

Normal Bank

Switched Bank

Switched Bank

W0

W1

W2

R0

R1

R2

W0

R0

R1

R2

R3

R4

R5

Normal Bank

Switched Bank

Switched Bank

Fig. 14. switched ports example with(nW,f , nR,f ) = (1, 3) and (nW,s, nR,s) = (2, 3) (left)Write configuration (right) Read configuration.

tecture has two sets of ports. The first set is nR,f and nW,f read and write simple(fixed) ports, respectively. As their name suggest, these are fixed simple unidirectionalports. The second set is nR,s and nW,s read and write switched ports, respectively. Thefunctionality of these ports alternates at runtime dynamically in two modes, read andwrite, as follows. If the write mode is chosen, the nW,s switched write ports are active,while the nR,s read ports are inactive. On the other hand, if the read mode is chosen,the nR,s switched read ports are active, while the nW,s write ports are inactive. Thesuggested architecture reconfigures dual-ported BRAM write ports into reads whenmore reads and less writes are required (read mode). As depicted in Figure 15 (right),dual-ported BRAMs in switched banks are replicated nR,f times to provide nR,f readsin write mode. Each instance of the nR,f replicas reconfigures its write into a read (inaddition to the other read port) in the read mode. Hence, up to nR,f additional switchedreads can be generated, namely nR,s ≤ nR,f .

The key idea behind the SRAM savings is reconfiguring unused writing ports intoreading ports. Figures 15 illustrates the suggested architecture. Only data bankswhose writing ports are unused in read mode are altered, namely nW,s banks. Thewriting ports of each of these banks are redirected to serve as reading ports in readmode as depicted in Figure 15 (lower right). Other banks that will keep writing portsin read mode (nW,f banks) must increase the number of reading ports to nR,f + nR,s

to match read ports requirement in read mode as described in Figure 15 (upper right).Compared to a fixed ports multi-ported RAM with the maximum number of writingports nW = nW,f + nW,s and the maximum number of reading ports nR = nR,f + nR,s,the proposed switched architecture reduces the number of data banks by nW,s · nR,s.Hence, this reduces the number of BRAMs by

nW,s · nR,s · nM20K(d,w). (25)

Figure 14 describes an example of a switched multi-ported RAM with (nW,f , nR,f ) =(1, 3) fixed ports and (nW,s, nR,s) = (2, 3) switched ports. Therefore, in read mode, thereis only one write port and double the fixed read ports. Figure 14 (left) shows the writemode configuration, while Figure 14 (right) shows the read mode configuration. In thisexample, the upper multi-read bank keeps the minimal single write operation, whilethe other banks sacrifice write ports to provide additional read ports. According toequation 25, if the switched RAM in Figure 14 has 32-bits in width (w = 32) and 32k-lines in depth (d = 32k), 384 BRAM blocks are saved compared to fixed-ports RAM.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 14: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:14 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

WDataWAddr RAddr0

RData0

RAddrRData

nR-1nR-1

Simple Bank 0

Simple Bank nW,f-1

Switched Bank nW,f

Switched Bank nW-1

WData0

WAddr0

WDataWAddrnW,f-1

nW,f-1

WDataWAddrnW,f

nW,f

WDataWAddrnW

nW

Rd/Wr

nW write / nR read(nW=nW,f+nW,s; nR=nR,f+nR,s)

I-LVT

RAddr0

RAddrnR-1

RData0

RDatanR-1

BankSel

WAdd

r’s

RAddr’s

WDataWAddr RAddr0

RData0

RAddrRData

nR-1nR-1

WDataWAddr RAddr0

RData0

RAddrRData

nR-1nR-1

WDataWAddr RAddr0

RData0

RAddrRData

nR-1nR-1

Rd/Wr

Rd/Wr

Swit

ched

Rea

d P

orts

RAddrRDataRAddrRData

RAddrRData N

orm

al R

ead

Po

rts

WAddrDin

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

WData BRAM0

BRAMnR-1

RAddr0

RData0

nR,f-nR,s-1

nR,f-nR,s-1

nR,f-nR,s

nR,f-nR,s

nR,f-1

nR,f-1

RAddrRData

nR,f

nR,f

RAddrRData

nR-1

nR-1

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAMnR,f-nR,s-1

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAMnR,f-nR,s

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAMnR,f-1

RAddrRData

WAddrWData

nR-1

nR-1

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAM0

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAM1

Din

Por

t A

Dout

AddrR/W

Por

t B

Din

Dout

AddrR/W

BRAMnR-1

RAddrRData

RAddrRData

0

0

1

1

Swit

che

d

Ban

kSi

mp

le

Ban

kFig. 15. (left) Switched ports architecture. Data banks: (upper right) simple, and (lower right) switched.

4.4. Data Dependencies and BypassingThe new I-LVT structure and the previous XOR-based multi-ported RAMs incur datadependencies due to feedback functions and the latency of reading the I-LVT to decideabout the last written bank. Data dependencies can be handled by employing bypass-ing, also known as forwarding.

Figure 16 illustrates two types of bypassing based on write data and address pipelin-ing. Bypassing is necessary because Altera’s dual-port block RAMs cannot internallyforward new data when one port reads and the other port writes the same addresson the same clock edge, constituting a read-during-write (RDW) hazard. Both bypass-ing techniques are functionally equivalent, allowing reading of the data that is beingwritten on the same clock edge, similar to single register functionality. However, thefully-pipelined two-stages bypassing shown in Figure 16 (lower left) can overcome anadditional cycle latency, namely an additional pipe stage on writing data and address(not shown in the figures). This capability is required if a block RAM has pipelinedinputs (e.g., cascaded from another block RAM) that need to be bypassed.

The single-stage and the two-stage bypass circuitry for a w bits data width andd lines depth block RAM requires w registers for data bypassing, two dlog2 de wideaddress registers and one enable register, for a total of

nBReg,unidirectional(d,w) = w + 2dlog2 de+ 1. (26)

The switched multi-ported RAM described in section 4.3 utilizes true-dual-portBRAMs to switch port functionality. However, since writing and reading operationsin true-dual-ported RAMs are exchangeable, the bypassing circuitry requires specialhandling. As described in Figure 16 (right), the bypass circuit of the bidirectional RAMis mirrored compared to the unidirectional RAM. Thus, it can bypass written data fromany direction. However, the control logic that drives the bypassing mux selectors needto be altered to detect the direction of writing. Since the bidirectional bypassing circuitis a mirroring of the unidirectional bypassing circuit, it requires twice the registers

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 15: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:15

Unidirectional(Simple-dual-port)

Bidirectional(True-dual-port)

Sing

le-s

tage

Din

Po

rt A

Dout

AddrR/W

Po

rt B Din

Dout

AddrR/W

=Din

01

‘1’‘0’

Dout

AddrAddrDin

Po

rt A

Dout

AddrR/W

Po

rt B Din

Dout

AddrR/W

=Din

Dout

AddrR/W

Din

Dout

AddrR/W

01

01

Tw

o-st

age Din

Po

rt A

Dout

AddrR/W

Po

rt B Din

Dout

AddrR/W

=

Dout

Addr‘1’

Din

‘0’

00

1x

01

=

AddrDin

Po

rt A

Dout

AddrR/W

Po

rt B Din

Dout

AddrR/W

=

Din

Dout

AddrR/W

Din

Dout

AddrR/W

00

1x

01

==

00

1x

01

Fig. 16. Single-stage and two-stage bypassing for simple and true dual-port RAM.

used for the unidirectional bypass circuit, hence

nBReg,bidirectional(d,w) = 2 · nBReg,unidirectional(d,w). (27)

The most severe data dependency the I-LVT design suffers from is write-after-write(WAW), namely, writing to the same address that has been written before in the previ-ous cycle. This dependency occurs because of the feedback reading and writing latency.A single-stage bypassing for the feedback banks should solve this dependency.

Two types of reading hazards are introduced by the proposed I-LVT design, read-after-write (RAW) and read-during-write (RDW). RAW occurs when the same datathat have been written in the previous clock edge are read in the current clock edge.RDW occurs when the same data are written and read on the same clock edge.

Due to the latency of the I-LVT, reading from the same address on the next clock edgeafter writing (RAW) provides the old data. To read the new data, the output banks ofthe I-LVT are bypassed by a single-stage bypass to overcome the I-LVT latency.

The deepest bypassing stage is reading new data on the same writing clock edge(RDW), which is similar to a single register stage latency. This can be achieved by 2-stage bypass on the output extract ports of the I-LVT or the XOR-based design to allowreading on the same clock edge. The data banks, which are working in parallel withthe I-LVT, are bypassed by a single-stage to provide new data. Table II summarizesthe required bypassing for data banks, feedback banks and output banks for each typeof bypassing of the XOR-based, binary-coded and thermometer-coded I-LVT.

Since XOR-based multi-ported RAM requires bypassing for all the nW ·(nW +nR−1)banks to read new data when RAW or RDW, the additional registers required for thebypassing are

nW · (nW + nR − 1) · nBReg(d,w). (28)

RAW for binary-coded method requires bypassing the I-LVT only. Since the I-LVT isbuilt out of nW ·(nW +nR−1) blocks, each with dlog2 nW e bits width data, the followingamount of additional registers is required

nW · (nW + nR − 1) · nBReg(d, dlog2 nW e). (29)

RAW for thermometer-coded method requires bypassing the whole I-LVT, nW ·(nW −1)feedback banks with 1 bit width and nW · nR output banks with nW − 1 bits width,

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 16: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:16 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

Table II. Bypassing for XOR-based and binary/thermometer-coded I-LVT multi-ported memories

XOR-based I-LVT-basedFeedback banks Output banks Data banks Feedback banks Output banks

Allow WAW 1-stage None None 1-stage NoneNew data RAW 1-stage 1-stage None 1-stage 1-stageNew data RDW 1-stage 2-stage 1-stage 1-stage 2-stage

I

WA

dd

r

RA

dd

r

RData0

RData1

RDatanR-1

WData0

WData1

WDatanW-1

U

U

I-LVT

BankSel0 I I I I I

0 0 0 0 0

0 0 0 0 0

WData 0

WData 1

WDatanW-1

RData0

RData1

RDatanR-1

Fig. 17. Initial value for (left) I-LVT-based (right) XOR-based. (0: zeros, I: initial content, U: uninitialized)

hence a total registers of

nW · (nW − 1) · nBReg(d, 1) + nW · nR · nBReg(d, nW − 1). (30)

RDW for both binary and thermometer-coded methods require bypassing the nW · nRdata banks in addition to the I-LVT, hence the following amount of registers is addedto the previous count in (29) and (30)

nW · nR · nBReg(d,w). (31)

4.5. Initializing Multi-ported RAM ContentDue to the special structure of the proposed I-LVT-based multi-ported memories andthe previously proposed XOR-based method, RAM data may have replicas in severalbanks. Hence, initializing the multi-ported RAM with a specific content requires spe-cial handling. For the XOR-based multi-ported RAM, the first multi-read bank shouldbe initialized to the required initial content; all the other multi-read banks shouldbe initialized to zero. On the other hand, the binary/thermometer-coded I-LVT-basedmulti-ported RAM requires initializing all the I-LVT banks with zeros. The binary-coded I-LVT will generate a selector to the first data bank (indexed zero), since XOR’ingall the initial values (zeros) will generate zero. Similarly, the thermometer-coded I-LVT will be initialized to the first mutually exclusive case, hence the first bank willbe selected. Only the first bank holds the initial data; the remaining banks are leftuninitialized. The initial values of each bank in the binary/thermometer-coded I-LVTand XOR-based designs are shown in Figure 17.

4.6. Comparison and DiscussionIn this section, we compare SRAM and register consumption of our proposed ap-proaches with previous approaches based on RAM architecture and ports functionality.A usage guideline based on the required RAM parameters is also provided. These an-alytical results are in agreement with experimental results in section 5.

4.6.1. SRAM Usage based on RAM Architecture. In this section, we compare the previ-ous LVT and XOR approaches to the new I-LVT approaches for building multi-portmemories. Using the equations provided, we will illustrate why the I-LVT approachis superior in terms of number of BRAMs required, and number of registers required.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 17: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:17

hRegister-based RAMRegister-based LVT

XOR-based I -L VT

d

nW

nR binary

3

Fig. 18. Multi-ported RAM usage guideline. α is deter-mined by the minimum of Equations (32) and (33), whilethe trend f(nW ) is determined by Equation (34).

nt True-ports

Switched-ports√5nW,f

nW,s

nt True-ports

Simple-ports

nW,f

Fig. 19. Port architecture usage guideline.

Also, between the two I-LVT methods proposed, we will inspect the number of BRAMsand registers used by each bypassing method.

Table 3 summarizes SRAM resource usage for each of the three multi-ported RAMapproaches: the XOR-based and the binary/thermometer-coded I-LVT. Both the gen-eral SRAM cell count and the number of Altera’s M20K blocks are described. Compar-ing the SRAM cell counts, the XOR-based approach consumes fewer SRAM cells thanthe thermometer-coded I-LVT if

w < nR + 1. (32)

This inequality is unlikely to be satisfied, since for a single byte data width, the num-ber of reading ports nR would need to be larger than 8, which is very rare except forsystems with multiple requesters requiring a concurrent access to few bits of data (e.g.mutex or mailbox system). Hence, for typical use cases, the thermometer-coded I-LVTapproach will consume fewer SRAM cells.

Comparing the XOR-based approach to the binary-coded approach, the XOR-basedapproach consumes fewer SRAM cells only if

w <dlog2 nW e · (nW + nR − 1)

nW − 1

∣∣∣∣nW>1

. (33)

Both (32) and (33) show that the XOR-based approach will consume less SRAM cellsonly for a very narrow data widths which are uncommonly used. Hence, the I-LVT ap-proach will be the choice for most applications. Comparing the two I-LVT approaches,Table III shows that the thermometer-coded I-LVT consumes fewer SRAM cells thanthe binary-coded I-LVT if

1 < nW ≤ 3 OR nR <(nW − 1) · (dlog2 nW e − 1)

(nW − 1)− dlog2 nW e

∣∣∣∣nW>3

. (34)

Figure 18 illustrates a guideline for choosing a multi-ported RAM architecture basedon area efficiency. For shallow memories, register-based RAM and LVT offer the bestarea efficiency. However, their area rapidly inflates as RAM goes deeper. The usageof BRAMs alleviates the problem since SRAM-based memories have higher capaci-ties that register-based memories. The XOR-based method is suitable for memoriesnarrower than α which is determined by the minimum of Equations (32) and (33).Choosing binary-coded or thermometer-coded I-LVT is based on nW and nR only andis determined by Equation (34).

4.6.2. Register Usage based on RAM Architecture. Table V summarizes register usage forall multi-ported RAM architectures and bypassing. Only the register-based LVT archi-tecture is directly proportional to memory depth. As a consequence, it consumes muchmore registers than other architectures, making register-based LVTs impractical fordeep memories. With a single-stage bypassing, the XOR-based design consumes fewer

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 18: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:18 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

registers than the binary-coded ifw < dlog2 nW e. (35)

Equation (35) is unlikely to be satisfied. Even if the data width is just one byte (w = 8),the number of write ports nW would need to be larger than 256, which is impractical.On the other hand, with a single-stage bypass, the XOR-based design consumes fewerregisters than the thermometer-coded I-LVT design if

w <1 + nR

1 + nR

nW−1

∣∣∣∣∣nW>1

. (36)

In a typical compute-oriented designs, nR = 2 · nW . Assuming that nR = 2 · (nW − 1)requires that 3 · w − 1 < nR; even for a one byte data width, this requires 23 < nR tosatisfy (36), which is impractical. Therefore, for a single-stage bypass, the I-LVT baseddesigns will consume fewer registers than the XOR-based design.

Considering two-stage bypassing, I-LVT based designs will consume nW · nR ·nBReg(d,w) more registers, as described in (31). In this case, XOR-based design con-sumes fewer registers than the binary-coded I-LVT design only if

w < dlog2 nW e ·(1 +

nRnW − 1

). (37)

On the other hand, XOR-based design consumes fewer registers than thethermometer-coded I-LVT design only if

w < nR + 1. (38)Similar to (32), which is equal to (38), this is unlikely to be satisfied in practical de-signs. Hence, in the case of two-stage bypassing, the I-LVT-based design will consumefewer bypassing registers than the XOR-based method.

4.6.3. SRAM Usage based on Port Functionality. The previous analysis provides a guide-line for using XOR, LVT, binary-coded or thermometer-coded I-LVT. However, a guide-line for using simple, true, or switched port architectures as illustrated in Figure 1 isrequired. A multi-ported RAM with the following mixed port requirements is used forcomparison: (1) nW,f write / nR,f read simple (fixed) ports, (2) nt true-ports, and (3)nW,s write / nR,s read switched ports. To implement the mixed-port multi-ported RAMusing different port architectures, the following transformations are required: (1) Atrue-port can be emulated as two simple ports, a write and a read sharing the sameaddress. (2) A switched port can be emulated as nW,s simple write ports and nR,s sim-ple read ports, or max(nW,s, nR,s) true-ports. The M20K count of the mixed-port RAMfor each port architecture is provided by:

Simple : nM20K(d,w)((nR,f + nt + nR,s) · (nW,f + nt + nW,s)

)True : nM20K(d,w)

(na(na−1)2

∣∣na=nW,f+nR,f+nt+max(nW,s,nR,s)

)Switcheded : nM20K(d,w)

((nW,f + nt + nW,s)(nR,f + nt) + (nW,f + nt)nR,s

).

(39)

To simplify the comparison, we assume that the number of read ports is twice thenumber of write ports for simple and switched ports, hence nR,f = 2nW,f and nR,s =2nW,s. Figure 19 shows a linear approximation of the BRAM consumption equilibrium.These plots form guidelines for choosing the most area efficient architecture. Figure19 (left) shows that the true-ports architecture is more efficient in BRAM usage only ifthe number of true-ports is more than

√5 times the simple write ports count. On the

other hand, Figure 19 (right) shows that the true-ports architecture is more efficientonly if the number of true-ports is more than

√5nW,f + (

√5− 1)nW,s.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 19: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:19

Table III. Summary of SRAM bits usage

Data banks LVT feedback banks LVT output banksRegister-based LVT d w nW nR N/A N/AXOR-based d w nW (nR + nW − 1) N/A N/ABinary-coded I-LVT d w nW nR d dlog2 nW e nW (nW − 1) d dlog2 nW e nW nR

Thermometer-coded I-LVT d w nW nR d nW (nW − 1) d (nW − 1 ) nW nR

Table IV. Summary of M20K blocks usage

Data banks LVT feedback banks LVT output banksRegister-based LVT nM20K(d,w) nW nR N/A N/AXOR-based nM20K(d,w) nW (nR+nW−1) N/A N/ABinary-coded I-LVT nM20K(d,w) nW nR nM20K(d,dlog2nW e) nW (nW−1) nM20K(d, dlog2 nW e) nW nR

Thermometer-coded I-LVT nM20K(d,w) nW nR nM20K(d, ) nW (nW−1) nM20K(d, nW − 1 ) nW nR

Note: BM20K(d,w) is the number of Altera’s M20Ks required to construct a d deep by w wide RAM as described in (4).

Table V. Summary of register usage

No bypass Additional registers for RAW bypass Additional for RDWRegister-based LVT d dlog2 nW e None NoneXOR-based nW (w + dlog2 de+ 1) nW (nR + nW − 1)nBReg(d,w ) NoneBinary-coded I-LVT nW ( dlog2 de+ 1) nW (nR + nW − 1)nBReg(d, dlog2 nW e) nW nR nBReg(d,w)

Thermometer-coded I-LVT Same as binary-coded nW ((nW−1) nBReg(d,1) + nR nBReg(d,nW−1)) Same as binary-coded

Note: nBReg(d,w) is the number of additional registers required to bypass a d deep by w wide RAM as described in (26) and(28); use nBReg,bidirectional if the proposed switched architecture is used, otherwise use nBReg,unidirectional.

Table VI. Resources consumption for a 4W/8R multi-ported RAM test-case with 8k-entries of 32-bit words

Register-based LVT Binary-coded I-LVT Thermometer-coded I-LVTXOR-based

Simple Switched % Changefrom XOR

Simple Switched % Changefrom XOR

Simple Switched % Changefrom XOR

M20K’s 704 512 384 -45.5% 556 428 -39.2% 588 460 -34.7%Registers 2781 18288 17816 +540.6% 3220 2748 -1.2% 3240 2768 -0.5%ALM’s 2010 61662 61333 +2951.4% 1454 1290 -35.8% 1604 1445 -28.1%Fmax(Mhz) 270 213 213 -21.1% 325 338 +25.2% 390 388 +43.7%

Note a: This test-case consists of 2 fixed and 2 switched write ports (nW,f , nW,s) = (2, 2), 4 fixed and 4 switched readports (nR,f , nR,s) = (4, 4) and new data RDW bypassing.

Note b: A register-based multi-ported RAM with the same parameters does not fit in the same device.

Table VII. Register consumption for a 4W/8R multi-ported RAM test-case with 8k-entries of 32-bit words

Register-based LVT Binary-coded I-LVT Thermometer-coded I-LVTXOR-based

Simple Switched % Changefrom XOR

Simple Switched % Changefrom XOR

Simple Switched % Changefrom XOR

No Bypass 184 16400 16400 +8813.0% 56 56 -69.6% 56 56 -69.6%Allow WAW 892 16400 16400 +1738.6% 404 404 -54.7% 392 392 -56.1%New Data RAW 2780 16400 16400 +489.9% 1332 1307 -53.0% 1352 1352 -51.4%New Data RDW 2781 18288 17816 +540.6% 3220 2748 -1.2% 3240 2768 -0.5%

Note a: This test-case consists of 2 fixed and 2 switched write ports (nW,f , nW,s) = (2, 2) with 4 fixed and 4 switched readports (nR,f , nR,s) = (4, 4).

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 20: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:20 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

5. EXPERIMENTAL RESULTSIn order to verify and simulate the suggested approach and compare to previous at-tempts, fully parameterized Verilog modules have been developed. Both the previousXOR-based multi-ported RAM method, and the proposed I-LVT method have been im-plemented. To simulate and synthesize these designs with various parameters in batchusing Altera’s ModelSim version 10.1e and Quartus II version 14.0, a run-in-batchflow manager has also been developed. The Verilog modules and the flow manager areavailable online [Abdelhadi and Lemieux 2015]. To verify correctness, the proposed ar-chitecture is simulated using Altera’s ModelSim. A large variety of different memoryarchitectures and parameters are swept, e.g. bypassing, memory depth, data width,number of ports, and simulated in batch, each with over million random memory ac-cess cycles.

All different multi-ported design modules are implemented using Altera’s QuartusII on Altera’s Stratix V 5SGXMA5N1F45C1 device. This is a high-performance de-vice with 185k ALMs, 2304 M20K blocks and 1064 I/O pins. We performed a generalsweep and test all combinations of the following parameters: nW = 2..4, nR = 3..6,d = 16k, 32k, w = 8, 16, 32, bypassing: None, RAW and RDW. Following this, we analyzethe full set of results. In this paper, we omit many of the in-between settings becausethey behaved as one might expect to see via interpolation of the endpoints.

Figure 20 plots the maximum frequency derived from Altera’s Quartus IISTA at 0.9V and temperature of 0 ◦C. The results show a higher Fmax forbinary/thermometer-coded I-LVT compared to the XOR-based approach for all designcases. With 3 or more writing ports, the thermometer-coded I-LVT supports a higherfrequency compared to all other design styles. Compared to the XOR-based approach,the thermometer-coded I-LVT improves Fmax by 38% on average for all design config-urations, while the maximum Fmax improvement is 76%.

Figure 21 (top) plots the number of Altera’s M20K blocks used to implement eachmulti-ported RAM configuration. The proposed binary/thermometer-coded I-LVT con-sumes the least BRAM blocks in all cases. The average reduction of the best ofbinary/thermometer-coded I-LVT compared to XOR-based approach is 19% for alltested design configurations, while it can reach 44% for specific configurations. Thedifference of consumed Altera’s M20Ks between binary-coded I-LVT and thermometer-coded I-LVT is less than 6%. To clarify the difference in BRAM consumption, Figure21 (bottom) shows the percentage of BRAM overhead above the register-based LVT,which uses the fewest possible BRAMS overall. The XOR-based design consumes moreBRAMs in all cases, up to twice the BRAMs compared to register-based LVT. On theother hand, I-LVT-based methods consume only 12.5% more BRAMs in the case of32-bit wide memories.

Figure 22 shows the number of ALMs consumed by each design with differentbypassing methods. New data RAW bypassing consumes more ALMs than the non-bypassed version due to address comparators and data muxes. On the other hand,new data RDW bypassing requires an additional address comparator and a wider mux;hence it consumes more ALMs than a new data RAW bypass. In all bypass modes, asmemory data width goes higher, the XOR-based method consumes more ALM’s thanthe I-LVT methods due to wider XOR gates.

The number of registers required for various designs and bypassing styles is shownin Figure 23. The I-LVT-based methods consume fewer registers compared to the XOR-based method for no bypassing or new data RAW bypass. For new data RDW bypass,the I-LVT based methods must bypass the data banks, hence the register consumptiongoes higher than the XOR-based method. However, the register consumption of theregister-based LVT method is the highest overall and can be three orders of magnitude

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 21: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:21

higher since it is directly proportional to memory depth. Furthermore, register-basedLVT memories with 4 write ports and over 16k-entries failed to synthesize on ourStratix V with 185k ALMs.

Since the register-based LVT approach is not feasible with the provided deep mem-ory test-cases, the register-based LVT trends are derived analytically from Table III,IV and V and not from experimental results. Hence, the register-based LVT trend wasadded as a reference baseline to Figure 21 and 23 only.

The switched multi-ported RAM is demonstrated with several design cases and nom-inal parameters of d = 32K and w = 32. The total number of write ports is fixed to 3(nW = nW,s + nW,f = 3), while the number of switched write ports is a sweep of 1, 2and 3 ports (nW,s = 1, 2, 3); hence the number of simple write ports is a sweep of 2,1, and 0 (nW,f = 2, 1, 0), respectively. On the other hand, the number of simple readports is set to 3 (nR,f = 3), while the number of switched ports is a sweep of 1, 2,and 3 ports (nR,s = 1, 2, 3); hence the number of total read ports is a sweep of 4, 5,and 6 (nR = nR,s + nR,f = 4, 5, 6), respectively. Figure 24 (top and middle) is a plot ofFmax increase and ALMs overhead percentages, respectively, compared to a simple-ports baseline design (without the switched ports mechanism) and with the maximumavailable ports, namely nW writing ports and nR reading ports. Figure 24 (bottom)shows the M20K reduction for the equivalent design with simple-ports only (nW writ-ing ports and nR reading ports), and for the equivalent design with true-ports only(nW,f + nR,f +max(nR,s, nW,s) true-ports, as described in Section 4.6.3). For the giventest case, using switched-ports can save up to 45% of the BRAM consumption com-pared to the equivalent simple-port or true-port designs. The BRAM consumption canbe anticipated from Equations 25 and 39. The BRAM reduction relieves the routingresources hence an Fmax increase is observed as shown in Figure 24 (top). The Fmaxincrease is more significant in thermometer-based I-LVT, achieving over 22% improve-ment. Additional data multiplexing causes an increase of ALMs. However, the increasepercentage is lower than 31% in all designs as shown in Figure 24 (middle).

6. CONCLUSIONS AND FURTHER DIRECTIONSIn this paper, we have proposed a novel, modular, BRAM-based and switched-multi-ported RAM architecture. In addition to unidirectional ports with fixed read/write,this switched architecture allows a group of write ports to switch with another groupof read ports dynamically. The proposed switched-ports architecture is less flexiblethan a true-multi-ported RAM where each port is switched individually. Nevertheless,switched memories can dramatically reduce BRAM consumption compared to true orsimple ports for systems with alternating port requirements.

An invalidation-based live-value-table (I-LVT) is used to determine the latest writ-ten data banks for our suggested multi-ported RAM. The I-LVT generalizes and re-places two prior techniques, the LVT and XOR-based approaches. A general I-LVT isdescribed, along with two specific implementations: binary-coded and thermometer-coded. Both methods are purely SRAM based. While the original LVT demand use ofan infeasible number of registers, the I-LVT register usage is not directly proportionalto memory depth; hence it requires orders of magnitudes fewer registers. Furthermore,the proposed I-LVT can reduce BRAM consumption up to 44% and improve Fmax byup to 76% compared to previous approaches. The thermometer-coded I-LVT method ex-hibits the highest Fmax, while keeping BRAM consumption within 6% of the minimalrequired BRAM count.

A detailed analysis and comparison of resource consumption of the suggested meth-ods and previous methods is provided. With this information we develop a guidelinefor choosing the most area efficient approach. Generally, past approaches of XOR andLVT are only recommended for narrow data widths or shallow depths, respectively.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 22: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:22 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

150

200

250

300

350

400

450

8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

Fm

ax

(M

Hz)

-N

o B

yp

ass

Width

Depth (k)

#Read

#Write

200

250

300

350

400

8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

Fm

ax

(M

Hz)

-N

ew

Da

ta R

DW

Width

Depth (k)

#Read

#Write

XOR-based

Binary-coded I-LVT

Thermometer-coded I-LVT

Fig. 20. Fmax (MHz) T=0C (top) No bypass (bottom) new data RDW bypass.

In all other cases, the new I-LVT approaches are superior. A fully parameterized andgeneric Verilog implementation of the suggested methods is provided as open sourcehardware [Abdelhadi and Lemieux 2015].

As future work, the new multi-ported memories can be tested with various otherFPGA vendors’ tools and devices. Furthermore, these methods can also be testedfor ASIC implementation using dual-ported RAMs as building blocks, and comparedagainst memory compiler results. Also, to improve Fmax, time-borrowing techniquescan be utilized. The goal would be to recover the frequency drop due to the multi-portedRAM additional logic, feedback and bank selection logic. One possible approach usesshifted clocks to provide more reading and writing time [Brant et al. 2012]. However,adapting this method to multi-ported memories is not trivial due to internal timingpaths across the I-LVT. The true-multi-pored RAM proposed by others [Choi et al.2012], [Laforest et al. 2014] can utilize our I-LVT method to implement BRAM-basedLVT instead of register-based LVT, which eliminated the need of register-based mem-ories and allows higher RAM capacities.

ELECTRONIC APPENDIXThe electronic appendix for this article can be accessed in the ACM Digital Library.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 23: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:23

0

2

4

6

8

10

12

14

16

18

20

22

24

8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

M2

0K

To

tal

Blo

cks

Co

un

t (1

00

's)

Width

Depth (k)

#Read

#Write

XOR-based

Binary-coded I-LVT

Thermometer-coded I-LVT

Register-based LVT

0

10

20

30

40

50

60

70

80

90

100

8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632 8 1632

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

M2

0K

Ov

erh

ea

d (

%)

Width

Depth (k)

#Read

#Write

Fig. 21. M20K blocks (top) total count (bottom) overhead percentage relative to register-based LVT.

ACKNOWLEDGMENTS

The authors would like to thank Altera for donations of hardware and tools, as well as NSERC for funding.

REFERENCESAmeer M.S. Abdelhadi and Guy G.F. Lemieux. 2014. Modular Multi-ported SRAM-based Memories. In Proc.

of the 2014 ACM/SIGDA Internat. Symp. on Field-programmable Gate Arrays (FPGA ’14). 35–44.Ameer M.S. Abdelhadi and Guy G.F. Lemieux. 2015. Switched Multi-ported RAM Verilog source code. (2015).

https://github.com/AmeerAbdelhadi/Switched-Multiported-RAMD. Alpert and D. Avnon. 1993. Architecture of the Pentium Microprocessor. IEEE Micro 13, 3 (1993), 11–21.Altera Corp. 2013. Stratix V Device Handbook. (May 2013).H. Bajwa and X. Chen. 2007. Low-Power High-Performance and Dynamically Configured Multi-Port Cache

Memory Architecture. In Electrical Engineering, 2007. ICEE ’07. Internat. Conference on. 1–6.A. Brant, A. Abdelhadi, A. Severance, and G.G.F. Lemieux. 2012. Pipeline frequency boosting: Hiding dual-

ported block RAM latency using intentional clock skew. In Field-Programmable Technology (FPT), 2012Internat. Conference on. 235–238.

B.A. Chappell, T.I. Chappell, M.K. Ebcioglu, and S.E. Schuster. 1996. Virtual multi-port RAM employingmultiple accesses during single machine cycle. (July 30 1996). US Patent 5,542,067.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 24: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

0:24 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

0

2

4

6

8

1 0

1 2

1 4

1 6

1 8

2 0

2 2

2 4

8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

M2

0K

To

tal B

lock

s C

ou

nt

(10

0's

)

Width

Depth (k)

#Read

#Write

XOR-basedBinary-coded I-LVTOne-hot-coded I-LVTRegister-based LVT

0

10

20

30

40

50

60

70

80

90

100

8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

M20

K O

verh

ead

(%)

Width

Depth (k)

#Read

#Write

1 5 0

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

4 5 0

8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

Fmax

(M

Hz)

-N

o B

ypas

s

Width

Depth (k)

#Read

#Write

2 0 0

2 5 0

3 0 0

3 5 0

4 0 0

8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32 16 32 16 32 16 32 16 32

3 6 3 6 3 6

2 3 4

Fmax

(M

Hz)

-N

ew

Dat

a R

DW

Width

Depth (k)

#Read

#Write

XOR-basedBinary-coded I-LVTOne-hot-coded I-LVT

0 . 1

1

1 0

1 0 0

1 0 0 0

8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32

3 6

2 4

Reg

iste

rs -

No

Byp

ass

(100

's)

Width

Depth (k)

#Read

#Write

XOR-basedBinary-coded I-LVTOne-hot-coded I-LVTRegister-based LVT

1

1 0

1 0 0

1 0 0 0

8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32

3 6

2 4

Reg

iste

rs -

New

Dat

a R

DW

(10

0's)

Width

Depth (k)

#Read

#Write

0

1

2

3

4

5

6

7

8

9

8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32

3 6

2 4

ALM

s -

No

Byp

ass

(100

's)

Width

Depth (k)

#Read

#Write

0

5

1 0

1 5

2 0

2 5

3 0

8 16 32 8 16 32 8 16 32 8 16 32

16 32 16 32

3 6

2 4

ALM

s -

New

Dat

a R

DW

(10

0's)

Width

Depth (k)

#Read

#Write

XOR-based [ref]Binary-coded I-LVTOne-hot-coded I-LVT

- 8

- 3

2

7

1 2

1 7

2 2

4 5 6 4 5 6 4 5 6

0 1 2

Fmax

Incr

ease

(%

)

#Read ext.

#Write ext.

6

1 1

1 6

2 1

2 6

3 1

4 5 6 4 5 6 4 5 6

0 1 2

ALM

s O

verh

ead

(%

)

#Read ext.

#Write ext.

Binary-coded I-LVTOne-hot-coded I-LVT

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

4 5 6 4 5 6 4 5 6

0 1 2

M2

0K

Red

uct

ion

(%

)

#Read ext.

#Write ext.

SimpleTrue

Fig. 22. ALMs (top) no-bypass (bottom) new data RDW bypass.

J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. 2012. Impact of Cache Architecture andInterface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Internat. Symp. on. 17–24.

G.S. Ditlow, R.K. Montoye, S.N. Storino, S.M. Dance, S. Ehrenreich, B.M. Fleischer, T.W. Fox, K.M. Holmes,J. Mihara, Y. Nakamura, S. Onishi, R. Shearer, D. Wendel, and L. Chang. 2011. A 4R2W register file fora 2.3GHz wire-speed POWERTM processor with double-pumped write operation. In Solid-State CircuitsConference Digest of Technical Papers (ISSCC), 2011 IEEE Internat. 256–258.

E.S. Fetzer and J.T. Orton. 2002. A fully-bypassed 6-issue integer datapath and register file on an Itaniummicroprocessor. In Solid-State Circuits Conference, 2002. ISSCC. 2002 IEEE Internat., Vol. 1. 420–478.

Joseph A. Fisher. 1983. Very Long Instruction Word Architectures and the ELI-512. In Proc. of the 10thAnnual Internat. Symp. on Computer Architecture (ISCA ’83). ACM, NY, USA, 140–150.

Weixing Ji, Feng Shi, Baojun Qiao, and Hong Song. 2007. Multi-port Memory Design Methodology Based onBlock Read and Write. In Control and Automation. ICCA ’07. IEEE Internat. Conference on. 256–259.

R. E. Kessler. 1999. The Alpha 21264 Microprocessor. IEEE Micro 19, 2 (March 1999), 24–36.Z. Kwok and S.J.E. Wilton. 2005. Register file architecture optimization in a coarse-grained reconfigurable

architecture. In Field-Programmable Custom Comp. Machines. 13th Annual IEEE Symp. on. 35–44.C.E. Laforest, Z. Li, T. O’rourke, M.G. Liu, and J.G. Steffan. 2014. Composing Multi-Ported Memories on

FPGAs. ACM Trans. Reconfigurable Technol. Syst. 7, 3 (Sept. 2014).C.E. Laforest, M.G. Liu, E.R. Rapati, and J.G. Steffan. 2012. Multi-ported Memories for FPGAs via XOR. In

Proc. of the ACM/SIGDA Internat. Symp. on Field Programmable Gate Arrays (FPGA ’12). 209–218.

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 25: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Modular Switched Multi-ported SRAM-based Memories 0:25

���

��

���

����

� ���� � ���� � ���� � ����

�� �� �� ��

� �

� �

��� ���������������������

��� �

�� �����

����

����

!"#$%&'()*+%,-#./('(01#234!+'#5/6#./('(01#234"'7*&6',#$%&'(0234

��

���

����

� ���� � ���� � ���� � ����

�� �� �� ��

� �

� �

��� �����8��� �������������

��� �

�� �����

����

���� 9:;<=>

Fig. 23. Registers count (top) no-bypass (bottom) newdata RDW bypass.

��

��

��

��

��

� � � � � � � � �

� �

�� �������������

���������

�!"#������

��

��

��

��

��

� � � � � � � � �

� �

$%&��'(��)��*����

���������

�!"#������

+#,�"-./0����1.2345,�.60�./0����1.234

��

��

��

��

� � � � � � � � �

� �

&�7�8�*9�:;<�����

=8��*�� :>

=?�;:��� :>

@#ABC�4"D�

EFGHIJKLK

MNOPQR

STUVWXYZY

[\]

_ab_cb

defgfdefh

ijklmn opo

Fig. 24. Switched-ports compared to simple-ports and true-ports.

C.E. LaForest and J.G. Steffan. 2010. Efficient Multi-ported Memories for FPGAs. In Proc. of the 18th AnnualACM/SIGDA Internat. Symp. on Field Programmable Gate Arrays (FPGA ’10). 41–50.

H.J. Mattausch. 1997. Hierarchical N-port memory architecture based on 1-port memory cells. In Solid-StateCircuits Conference, 1997. ESSCIRC ’97. Proc. of the 23rd European. 348–351.

J.H. Tseng and K. Asanovic. 2003. Banked Multiported Register Files for High-frequency Superscalar Mi-croprocessors. In Proc. of the 30th Annual Internat. Symp. on Comp. Architecture (ISCA ’03). 62–71.

H. Yokota. 1990. Multiport memory system. (May 29 1990). US Patent 4,930,066.Wang Zuo, Wang Zuo, and Li Jiaxing. 2008. An Intelligent Multi-Port Memory. In Intelligent Information

Technology Application Workshops, 2008. IITAW ’08. Internat. Symp. on. 251–254.

Received November 2014; revised March 2015; accepted June 2015

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 26: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

Online Appendix to:Modular Switched Multi-ported SRAM-based Memories

AMEER M.S. ABDELHADI, The University of British ColumbiaGUY G.F. LEMIEUX, The University of British Columbia

This appendix is a user guide for the Switched Multi-ported RAM (SMPRAM) mod-ule Verilog package provided with this paper. The SMPRAM module is compatible withVerilog-2001. The provided Verilog is generic, however, it has been tested using Altera’sModelSim (version 10.0d) only. The SMPRAM module, including interface signals andconfiguration parameters is described in Figure 25. Table VIII lists all interface portswhile Table IX lists all configuration parameters for the SMPRAM module. The codein Listing 1 describes an SMPRAM module instantiation. Furthermore, to instantiatethe SMPRAM module, all *.v & *.vh files in this package should present in your workdirectory. The following commandline clones the package from a GitHub repository.git clone https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM.git

(nWPF+nWPS)·DATW

WData RData

WAddr RAddr

WEnb

(nRPF+nRPS)·DATW

(nWPF+nWPS)·log2MEMD

(nRPF+nRPS)·log2MEMD

clk rst RdWr

SMPRAM

MEMD: Memory depthDATW: Data widthnRPF: # Read ports / fixednWPF: # Write ports / fixednRPS: # Read ports / switchednWPS: # Write ports / switchedARCH: Architecture typeBYPS: Bypass modeFILE: Initialization mif file

nWPNConfiguration Parameters

Write Ports

Read Ports

Fig. 25. Switched Multi-ported RAM (SMPRAM) module block.

Table VIII. List of SMPRAM module interface ports

Port I/O width Descriptionclk Input 1 Global clock.rst Input 1 Global synchronous reset.rdWr Input 1 If high, enables switched read ports and disables switched

write ports; vice versa otherwise.WEnb Input nWPF + nWPS Write enable for nWPF (LSB) fixed write ports and nWPS

switched write ports.WAddr Input (nWPF + nWPS) · log2(MEMD) Write addresses: packed fromnWPF(LSB) fixedwriteports

and nWPS switched write ports; log2(MEMD) bits each.WData Input (nWPF + nWPS) ·DATW Write data: packed from nWPF (LSB) fixed write ports

and nWPS switched write ports; DATW bits each.RAddr Input (nRPF + nRPS ) · log2(MEMD) Read addresses: packed from nRPF (LSB) fixed read ports

and nRPS switched read ports; log2(MEMD) bits each.RData Output (nRPF + nRPS ) ·DATW Read data: packed from nRPF (LSB) fixed read ports and

nRPS switched read ports; DATW bits each.

c© 2014 ACM 1539-9087/2014/11-ART0 $15.00DOI:http://dx.doi.org/10.1145/0000000.0000000

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.

Page 27: 0 Modular Switched Multi-ported SRAM-based Memorieslemieux/publications/abdelhadi-trets2016.pdf · 0 Modular Switched Multi-ported SRAM-based Memories AMEER M.S. ABDELHADI, The University

App–2 Ameer M.S. Abdelhadi and Guy G.F. Lemieux

Table IX. List of SMPRAM module parameters

Parameter Type Defaultvalue

Value range Description

MEMD Integer N/A Power of 2 Memory depth.DATW Integer N/A 1 ≤ DATW Data width.nRPF Integer N/A 1 ≤ nRPF Number of fixed read ports.nWPF Integer N/A 0 ≤ nWPF Number of fixed write ports.nRPS Integer 0 0 ≤ nRPS ≤ nRPF Number of switched read ports.nWPS Integer 0 0 ≤ nWPS

1 ≤ nWPS + nWPFNumber of switched write ports.

ARCH String ”AUTO” ”AUTO”,”REG”,”XOR”,”LVTREG”,”LVTBIN”, or”LVTTHR”

Multi-port RAM architecture: use”AUTO” to choose automatically,”REG” for register-based RAM, ”XOR”for XOR-based, ”LVTREG” for register-based LVT, ”LVTBIN” for binary-coded I-LVT-based, or ”LVTTHR” forthermometer-coded I-LVT-based.

BYPS String ”RAW” ”NON”,”WAW”,”RAW”, or”RDW”

Bypassing type: use ”NON” to preventadditional bypassing circuit, ”WAW” toallow Write-After-Write, ”RAW” to readnew data when Read-After-Write, or”RDW” to read new data when Read-During-Write.

FILE String Notinitialized

”mif” file name /without extension

Initialization file in ”mif” format, op-tional.

Listing 1. Switched Multi-ported RAM (SMPRAM) module instantiation// ins tant ia t e a multiported−RAMsmpram #(

.MEMD (MEMD ) , // integer : memory depth

.DATW (DATW ) , // integer : data width

.nRPF (nRPF ) , // integer : # f i xed read ports

.nWPF (nWPF ) , // integer : # f i xed write ports

.nRPS (nRPS ) , // integer : # switched read ports

.nWPS (nWPS ) , // integer : # switched write ports

.ARCH (ARCH ) , // str ing : multi−port RAM archi t e c ture

.BYPS (BYPS ) , // str ing : bypass mode

. FILE ( ” ” ) // str ing : I n i t i a l i z a t i o n f i l e , opt ional) smpram inst (

. c lk ( c lk ) , // global c lock

. r s t ( rs t ) , // global r e s e t

. rdWr (rdWr ) , // enables read or write switched ports

.WEnb (WEnb ) , // write enables [ (nWPF+nWPS)−1:0 ]

.WAddr(WAddr) , // write addresses [ (nWPF+nWPS)∗ log2 (MEMD)−1:0]

.WData(WData) , // write data [ (nWPF+nWPS)∗DATW −1:0]

. RAddr (RAddr ) , // read addresses [ (nRPF+nRPS)∗ log2 (MEMD)−1:0]

. RData (RData ) // read data [ (nRPF+nRPS)∗DATW −1:0]) ;

ACM Transactions on Embedded Computing Systems, Vol. 0, No. 0, Article 0, Publication date: November 2014.


Recommended