Tile Processor Architecture Overview for the TILEPro ... · Tile Processor Architecture Overview for the TILEPro Series 1 Tilera Confidential — Subject to Change Without Notice

TILE PROCESSORARCHITECTURE

OVERVIEW FOR THETILEPRO SERIES

RELEASE 1.2DOC. NO. UG120

MARCH 2011TILERA CORPORATION

Copyright © 2008-2011 Tilera Corporation. All rights reserved. Printed in the United States of America.

No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, except as may be expressly permitted by the applicable copyright statutes or in writing by the Publisher.

The following are registered trademarks of Tilera Corporation: Tilera and the Tilera logo.

The following are trademarks of Tilera Corporation: Embedding Multicore, The Multicore Company, Tile Processor, TILE Architecture, TILE64, TILEPro, TILEPro36, TILEPro64, TILExpress, TILExpress-64, TILExpress-20G, TILExpressPro-22G, iMesh, TileDirect, TILEmpower, TILEncore, TILE-Gx, TILE-Gx16, TILE-Gx36, TILE-Gx64, TILE-Gx100, DDC, Multicore Development Environment, Gentle Slope Programming, iLib, TMC (Tilera Multicore Components), hardwall, Zero Overhead Linux (ZOL), MiCA (Multistream iMesh Crypto Accelerator), and mPIPE (multicore Programmable Intelligent Packet Engine). All other trademarks and/or registered trademarks are the property of their respective owners.

Third-party software: The Tilera IDE makes use of the BeanShell scripting library. Source code for the BeanShell library can be found at the BeanShell website (http://www.beanshell.org/developer.html).

This document contains advance information on Tilera products that are in development, sampling or initial production phases. This information and specifications contained herein are subject to change without notice at the discretion of Tilera Corporation.

No license, express or implied by estoppels or otherwise, to any intellectual property is granted by this document. Tilera disclaims any express or implied warranty relating to the sale and/or use of Tilera products, including liability or warranties relating to fitness for a particular purpose, merchantability or infringement of any patent, copyright or other intellectual property right.

Products described in this document are NOT intended for use in medical, life support, or other hazardous uses where malfunction could result in death or bodily injury.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. Tilera assumes no liability for damages arising directly or indirectly from any use of the information contained in this document.

Publishing Information:

Contact Information:

Document number: UG120

Release 1.2

Date 3/28/11

Tilera Corporation

Information [email protected] Site http://www.tilera.com

Tile Processor Architecture Overview for the TILEPro Series iii

Tilera Confidential — Subject to Change Without Notice

Contents

CHAPTER 1 INTRODUCTION

1.1 Features .........................................................................................................................................................................................2

1.2 Document Overview ...................................................................................................................................................................2

1.3 Other Related Documents ...........................................................................................................................................................3

CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW

2.1 Programmable Multicore Processor ..........................................................................................................................................6

2.2 Tile Array ......................................................................................................................................................................................6

2.3 Tile Processor™ Architecture .....................................................................................................................................................6

2.4 The TILEPro Tile Processor Implementations ..........................................................................................................................8

2.4.1 TILEPro64 Processor ...........................................................................................................................................................8

2.4.2 TILEPro36 Processor ...........................................................................................................................................................9

CHAPTER 3 TILE PROCESSING ENGINE

3.1 Instruction Set Architecture (ISA) ...........................................................................................................................................11

3.1.1 Registers .............................................................................................................................................................................11

3.1.2 Instruction Set ............................................................................................................................................ 12

3.2 Operating System Support .......................................................................................................................................................19

3.2.1 Special Purpose Registers (SPRs) ...................................................................................................................................19

3.2.2 Interrupts and Exceptions ...............................................................................................................................................19

3.2.3 Protection Mechanisms ....................................................................................................................................................19

3.2.4 Virtual Memory ................................................................................................................................................................19

3.3 The TILEPro64 Processing Engine Pipeline ...........................................................................................................................19

3.3.1 Fetch ....................................................................................................................................................................................20

3.3.2 RegisterFile (RF) ................................................................................................................................................................20

3.3.3 Execute Stages (EX0, EX1) ...............................................................................................................................................21

3.3.4 WriteBack (WB) .................................................................................................................................................................21

3.3.5 Pipeline Latencies .............................................................................................................................................................21

CHAPTER 4 MEMORY ARCHITECTURE

4.1 Memory Architecture ................................................................................................................................................................23

5.2 Cache Architecture ....................................................................................................................................................................24

4.2.1 Overview ............................................................................................................................................................................24

4.2.2 Cache Microarchitecture ..................................................................................................................................................25

CONTENTS

iv Tile Processor Architecture Overview for the TILEPro Series


4.2.2.1 Dynamic Distributed Cached Shared Memory .....................................................................................26

4.2.2.2 Coherent and Direct-to-Cache I/O ........................................................................................................28

4.2.2.3 Striped Memory .......................................................................................................................................28

CHAPTER 5 IMESH NETWORK

5.1 Overview .................................................................................................................................................................................... 29

5.2 Mesh Networks .......................................................................................................................................................................... 30

5.2.1 Static Network .................................................................................................................................................................. 30

5.2.2 Dynamic Networks .......................................................................................................................................................... 31

5.2.2.1 Support for Stream Data Transfer ..........................................................................................................31

5.2.2.2 Multicore Hardwall .................................................................................................................................32

5.3 iMesh Characteristics on the TILEPro .................................................................................................................................... 32

CHAPTER 6 INPUT/OUTPUT

6.1 Architecture Overview ............................................................................................................................................................. 33

6.1.1 Ingress Dataflow .............................................................................................................................................................. 34

6.1.2 Egress Dataflow ............................................................................................................................................................... 35

6.1.3 Direct-to-Cache I/O ......................................................................................................................................................... 36

6.1.4 UserIO ................................................................................................................................................................................ 36

6.2 TILEPro64 I/O ........................................................................................................................................................................... 37

6.2.1 Memory Interface: DDR2 ................................................................................................................................................ 37

6.2.2 10Gb Ethernet ................................................................................................................................................................... 38

6.2.3 PCIe .................................................................................................................................................................................... 38

6.2.4 Ethernet MACs ................................................................................................................................................................. 39

6.2.5 Host Port Interface ........................................................................................................................................................... 40

6.2.6 Flexible I/O ....................................................................................................................................................................... 40

6.2.7 I2C Interface ...................................................................................................................................................................... 40

6.2.8 UART Interface ................................................................................................................................................................. 41

6.2.9 SPI SERIAL ROM Interface ............................................................................................................................................ 41

CHAPTER 7 TILEPRO TEST, DEBUG, AND BOOT

7.1 TILEPro Test ............................................................................................................................................................................... 43

7.2 JTAG ............................................................................................................................................................................................ 43

7.3 Boot Support .............................................................................................................................................................................. 43

GLOSSARY ............................................................................................... 45

INDEX ................................................................................................ 47

Tile Processor Architecture Overview for the TILEPro Series v


CONTENTS

CONTENTS

vi Tile Processor Architecture Overview for the TILEPro Series


Tile Processor Architecture Overview for the TILEPro Series 1


CHAPTER 1 INTRODUCTION

In virtually every domain, application demand for computing cycles continues to increase rap-idly. For example, modern video workloads require 10 to 100 times more compute power than a few years ago due to increasing resolution, more sophisticated compression algorithms and increasing numbers of channels. The computing demand of intelligent networking and security applications are also increasing dramatically, driven by increasing bandwidth requirements, dep-erimeterization of services from the periphery of enterprises into the core of the network, and increasing sophistication and deployment of security and services functions. Unfortunately, the performance delivered by conventional processors and DSPs has not kept pace with the increas-ing computing demand. The increased scaling disparity between the number of transistors that can be integrated on a single chip and the delivered performance of single-core processors is a phenomenon that has been labelled Moore’s Gap.

The Tile Processor™ from Tilera® is a new class of multicore processing engine that simultane-ously delivers unprecedented levels of performance, programmability and power efficiency. Tilera’s multicore technology is a new approach that bridges the gap between the increasing demand from applications and the performance delivered by existing processors. It addresses Moore’s Gap by integrating large numbers of power-efficient processors interconnected by a scal-able point-to-point mesh network, and minimizes both power and wire delay issues. The Tile Processor increases throughput by exploiting parallelism in applications. Parallelism is frequently encountered in modern day applications, such as video (which is pixel and frame parallel), net-working (which is session and packet parallel), and wireless (which is channel parallel).

Based on Tilera’s iMesh™ Multicore technology, Tilera’s device is fully-programmable via a stan-dard ANSI C and C++ environment and is scalable in multiple dimensions. For example, in one case the Tile processor capitalizes on the fact that it is easy to use—Tile processors containing a larger number of tiles can be built using the same architecture. The device achieves the perfor-mance of an ASIC in a software-programmable solution, at the same time reducing development time and enabling a faster time-to-market.

The TILEPro™ Series extends the Tile Processor family with enhanced support for thread-based shared memory programs. These enhancements significantly increase available bandwidth and decrease latency for access to shared data structures. The TILEPro also implements additional instructions to accelerate DSP applications that require saturating arithmetic and unaligned mem-ory accesses.

Chapter 1 Introduction

2 Tile Processor Architecture Overview for the TILEPro Series


1.1 FeaturesThe TILEPro processor incorporates a number of industry innovations:

• Homogeneous full-featured processor tiles that allow users to run existing C and C++ software on a single tile (for example, an operating system like Linux), while simplifying the mapping of parallel applications across multiple tiles.

• Dynamic Distributed Cache (DDC™) system that (for parallelism with other clauses) provides an L3 caching architecture that is distributed among all of the tiles, rather then a large central-ized shared L3 cache. This distributed cache is fully coherent across tiles and is scalable into the 100's of cores due to its decentralized nature. Fine-grained, hash-based cacheline distribution across the distributed cache balances cache utilization and boosts performance without impact on software.

• Protection via its Multicore Hardwall, which allows multiple applications and operating sys-tems to be protected from each other due to unwanted interactions.

Tilera’s multicore software development environment provides the tools to realize peak system performance:

• A “gentle slope” C and C++-based programming model, that enables users to get their existing software up and running quickly. Various programming models are readily supported.

• Run-to-completion, process based programs can be replicated across tiles with a distribu-tion / load-balancer function to share the workload

• Threaded programs compile and run seamlessly across many tiles

• Clusters of tiles with cache affinity can be created to segment parallel processing in multi-ple tiers

• Multi-tile pipelines may be used to deliver very high single-stream execution performance

• For those developers desiring extreme control of hardware resources or strict real-time guaran-tees, a Bare Metal programming environment is offered. The Bare Metal Environment allows direct access to the hardware resources and includes run-time libraries to facilitate communi-cation with I/O devices and other tiles.

This document provides an overview of the Tile Processor architecture, paying particular atten-tion to those aspects that are unique. A key to improving performance lies in exploiting parallelism, thereby making full use of the multiple processor cores and of the high-performance interconnect between them. The Tilera Multicore Development Environment™ (MDE) provides a complete set of tools for developing parallel programs, including standard operating system mechanisms like processes and threads, and additional Tilera-specific APIs to achieve maximum performance by taking advantage of Tile Processor-specific hardware features.

1.2 Document OverviewThis document includes:

• An introduction to Tilera technology: a high-level hardware introduction to motivate and pro-vide context for understanding the programming methodology and mechanisms.

• An overview of the Tile Processor architecture: architecture principles and programmer-visible architecture, including specific details about the TILEPro Processor family.



Other Related Documents

1.3 Other Related DocumentsFor a further introduction to the Tilera multicore software development environment, please see Programming the Tile Processor (UG205) included in the Multicore Development Environment soft-ware release. For a further details about the Tile processor hardware architecture, please see the Tile Processor User Architecture Manual (UG101), Tile Processor System Architecture Manual (UG103), and the Tile Processor I/O Device Guide (UG104).

Chapter 1 Introduction





CHAPTER 2 TILE PROCESSOR ARCHITECTURE OVERVIEW

The Tile Processor™ implements Tilera’s multicore architecture, incorporating a two-dimensional array of processing elements (each referred to as a tile), connected via multiple two-dimensional mesh networks. Based on Tilera’s iMesh™ Interconnect technology, the architecture is scalable and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma-ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect.

Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure.

Figure 2-1. Tile Processor Hardware Architecture

Each tile is a powerful, full-featured computing system that can independently run an entire oper-ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle.

CDNTDNIDNMDNSTNUDN

1,1 6,1

3,2 4,2 5,2 6,2 7,2

XAUI

(10GbE)

TDNIDNMDNSTNUDN

LEGEND:

Tile Detail

port2msh0

port0

port2 port1 port0

DDR2

DDR2

port0msh1

port2

port0 port1 port2

DDR2

DDR2

RGMII

(GbE)

XAUI

(10GbE)

FlexI/O

PCIe

(x4 lane)

I2C, JTAG,

HPI, UART,

SPI ROM

FlexI/O

PCIe

(x4 lane)

port1 port1

msh3 msh2

port2msh0

port0

port2 port1 port0

port0msh1

port2

port0 port1 port2

port1 port1

msh3 msh2

gpio1

port0

port1

port1

port0

port1

xgbe0

gbe0

xgbe1

port0

gpio1

port1

port0

port1

gbe1

port0

port1

xgbe0

xgbe1

port0

0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3

0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5

0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6

0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7

7,00,0 1,0 2,0 3,0 4,0 5,0 6,0

0,1 1,1 6,12,1 3,1 4,1 5,1 7,1

3,2 4,2 5,2 6,2 7,20,2 1,2 2,2

0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4

port07,0 port0

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

pcie0

port0

port1

rshim0

gpio0

pcie1port0

port1

SwitchEngine

CacheEngine

ProcessorEngine

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

STNSTN

TDNTDNIDNIDNMDNMDN

UDNUDN

CDNCDN

Chapter 2 Tile Processor Architecture Overview



The Tile Processor’s tiles have been designed to strike an optimal balance between the size of a single core and the number of cores. The result is a breakthrough in programmability leading to multicore performance. Any multicore processor must make a tradeoff between the complexity of a core (and hence its size) and the number of cores. A core that is too small makes the multicore processor hard to program because it is not be able to run off-the-shelf programs written in a high-level language such as C. A core that is too large uses more power and reduces the number of cores that can be implemented on a single chip, thereby reducing the peak multiprogram per-formance. Tilera designed a streamlined tile to maximize the number of tiles on a single chip, while retaining sufficient performance and capabilities to run off-the-shelf C programs.

2.1 Programmable Multicore ProcessorAt the system level, employing a Tile Processor chip resembles using a conventional processor chip that has one or more processing cores, on-chip memory controllers, and I/O interfaces. As with conventional processors, integration of multiple functions onto a single chip lowers system component count by allowing direct connection to DRAM, I/O PHYs and/or L1 electrical interfaces.

2.2 Tile ArrayAt the chip level, the Tile Processor contains a two-dimensional array of identical tiles or cores. In addition to traditional compute resources, each tile includes routers for on-chip data transfers. Tilera’s iMesh™ interconnect ties the array of tiles together in a scalable way that can be used to transport streams of data, memory blocks, or scalar values. Each tile in the two-dimensional array connects to other tiles using multiple mesh networks implemented by the network routers in each tile. At the edge of the tile array, the mesh connects into I/O Shims. The I/O Shims convert from the on-chip networks emanating out of the tile array perimeter to specific data formats required, for example, by memory controllers, I/O devices, MACs, or test logic.

Because the tiles are identical and connected by short point-to-point connections in the iMesh, the tile array is both scalable and power efficient. Implementations can support arbitrary tile array sizes. The simple, clear boundary between the tile array and the I/O Shims makes it a straight-for-ward process to connect the multicore processing capability to any arbitrary I/O protocol/device. As a result, the Tile Processor lends itself to a family of variations utilizing tile arrays of different sizes and different I/Os all aimed at a broad range of target applications, performance levels and cost points.

2.3 Tile Processor™ ArchitectureEach tile contains a Processor Engine, a Cache Engine, and a Switch Engine, which combine com-ponents to make a powerful, full-featured compute engine.

The Processor Engine is a conventional three-way VLIW processor with three instructions per bundle and full memory management, protection, and OS support. Compile-time scheduling of VLIW operations results in lower power consumption than dynamically-scheduled superscalar processors. The Tile Processor includes special instructions to support commonly-used embedded operations in DSP, video and network packet processing, including:

• Sum of absolute differences (SAD),

• Hashing and checksums,

• Instructions to accelerate encryption,

• SIMD instructions for sub-word parallelism,

• Saturating arithmetic, and

• Unaligned access acceleration.



Tile Processor™ Architecture

The Cache Engine contains the tile’s Translation Lookaside Buffers (TLBs), caches, and cache sequencers. For the TILEPro, each tile has 16KB L1 Instruction Cache, 8KB L1 Data Cache, and a 64KB combined L2 Cache. This delivers a total of 5.5MB of on chip cache for the TILEPro64 or 3.2MB on chip cache for the TILEPro36. Each tile also contains a 2D DMA engine that orchestrates memory data streaming between tiles and external memory, and among the tiles.

The Switch Engine implements six independent networks. The Switch Engine switches scalar data between tiles through the Static Network (STN) with very low latency. Five dynamic networks (UDN, TDN, MDN, CDN and IDN) aid the Switch Engine by routing packet-based data among tiles, tile caches, external memory, and I/O controllers. Of the five dynamic networks, only the User Dynamic Network (UDN) is user-visible. The others are used to satisfy cache misses from external memory and other tiles, and for various system-related functions.

The Static Network in addition to the five Dynamic Networks comprise the interconnect fabric of the Tilera iMesh, as shown in Figure 2-2. The user does not explicitly need to manage these net-works, rather they are used by the software system to efficiently implement the application-level API abstractions, such as user-generated inter-process socket-like streams.

Figure 2-2. Detail of a Tile within the Tile Processor

The Tile Processor contains extensive clock gating for power-efficient operation. Further, the architecture includes a software-usable NAP instruction that can be executed on a tile to put the tile into a low-power IDLE mode until a user-selectable external event, such as an interrupt or a packet, arrives.

SwitchEngine

CacheEngine

ProcessorEngine

MDN

UDN

TDN

IDN

STN

RegFile

P2 P1 P0

L2 Cache

2D DMA

L2 CacheL1I L1D

ITLB DTLB

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

STN

TDNIDNMDN

UDN

CDN

STN

TDNIDN

MDN

UDN

CDN

CDN




2.4 The TILEPro Tile Processor ImplementationsTwo TILEPro processors are included in the TILEPro family: the TILEPro64™ and the TILEPro36™.

2.4.1 TILEPro64 ProcessorThe second generation TILEPro family includes the TILEPro64 chip. Its sixty-four tiles are arranged in an eight-by-eight tile array. Each tile’s Processor Engine is a 32-bit, three-way issue, scalar VLIW machine. The Static network and five Dynamic networks are each full-duplex and 32 bits wide. The TILEPro64 supports four 64-bit DDR2 interfaces, two 10Gbps XAUI ports, two 4-lane PCIe ports, two 1GbE ports, one 16-bit HPI, and sixty-four Flexible I/Os. A few TILEPro64 performance metrics are listed in Table 2-1.

Table 2-1. TILEPro64 Performance Metrics1

1 Performance metrics are shown at 866MHz frequency.

Clock Frequency -9 device 866MHz

-7 device 700MHz

Tiles 64

Operations Per Second 8-bit 443 BOPs

16-bit 222 BOPs

32-bit 166 BOPs

Data I/O 40+ Gbps

Memory I/O 205 Gbps

Bisection Bandwidth 2660 Gbps

On-chip Cache Memory Bandwidth 1774 Gbps



The TILEPro Tile Processor Implementations

2.4.2 TILEPro36 ProcessorThe TILEPro36 processor is a scaled-down version of the 64-tile version for applications that do not require all of the performance and I/O capability of the larger device.

Table 2-2. TILEPro36 Performance Metrics1

1 Performance metrics are shown at 866MHz frequency.

Clock Frequency -5 device 500MHz

Tiles 36

Operations Per Second 8-bit 144 BOPs

16-bit 72 BOPs

32-bit 54 BOPs

Data I/O 20 Gbps

Memory I/O 102 Gbps

Bisection Bandwidth 1152 Gbps

On-Chip Cache Memory Bandwidth 576 Gbps






CHAPTER 3 TILE PROCESSING ENGINE

The Tile Processor uses a Very Long Instruction Word (VLIW) Processor Engine in each tile to provide a tremendous amount of raw computing power, in balance with the very large bandwidth of the on-chip network. Individual instructions are a typical RISC and DSP combination of instructions aimed at a variety of applications, such as video and network services. The VLIW architecture defines a 64-bit instruction bundle. Instructions that are common for computation can be encoded into a three-instruction bundle that issues into scalar execution pipelines. For compu-tationally-intense applications, the Tilera compiler generates tight three-instruction per bundle sequences and loops to maximize compute resources and minimize the instruction-stream foot-print. All instructions, including those less commonly used, encode into two-instruction bundles often used in sequential code with less Instruction Level Parallelism (ILP).

The three-way VLIW Processor Engine in each tile contains computing pipelines: P0, P1 and P2. Pipe P0 is capable of executing all arithmetic and logical operations, bit and byte manipulation, selects, and all multiply and fused multiply (multiply-add) instructions. P1 can execute all arith-metic and logical operations, special-purpose register (SPR) reads and writes, and control flow instructions. P2 services all memory operations, including loads, stores, and test and set instructions.

3.1 Instruction Set Architecture (ISA)

3.1.1 RegistersThe Tile Processor provides sixty-four (64) 32-bit architecturally visible registers in each of the Processor Engines. Register zero is used for read-as-zero. The on-chip networks of Tilera’s iMesh™ are register-mapped using seven of the registers, enabling very low latency data trans-fers. LR is the Link Register, SP the Stack Pointer and TP the Thread Pointer. The remaining fifty-three (53) registers are for general purpose use, which give the compiler a large register space to keep values live, unroll loops and otherwise reduce register pressure and minimize mem-ory traffic.

Chapter 3 Tile Processing Engine



Table 3-3 lists architectural register sets:

3.1.2 Instruction Set

The VLIW processing engine in each tile executes the instructions contained in the 64-bit instruc-tion bundle during each cycle. Each 64-bit instruction bundle specifies either 2 or 3 instructions. The 2-instruction (X1, X0) bundle encoding is known as X-mode. The 3-instruction (Y2, Y1, Y0) bundle encoding is known as the Y-mode.

X-Mode

The X-mode, the two-instruction bundle format (X1, X0) supports all instructions. X1 and X0 instructions can use the following opcodes:

• X0: Arithmetic/Compare/Logical Instructions

• X0: SIMD Instructions

• X0: Bit/Byte and Special Instructions

• X0: Multiply Instructions

• X1: Arithmetic/Compare/Logical Instructions

• X1: Control Transfer Instructions

• X1: Memory Management Instructions

• X1: Supports the X0: SIMD Instructions, excluding SAD{H,B}, SADA{H,B}, and AVG{H,B}.

Table 3-3. Architectural Register Set

Register Name Purpose

r0-r52 General Purpose Registers

tp Thread Pointer

sp Stack Pointer

lr Link Register

sn Static Network

io0-io1 IDN Ports 0-1

us0-us3 UDN Ports 0-3

zero Read As Zero



Instruction Set Architecture (ISA)

X0: Arithmetic/Compare/Logical Instructions

The ISA provides a full set of RISC arithmetic, shift and add, select, logical, shift/rotate, mask, and index instructions, as listed in Table 3-4:

Table 3-4. X0 Instructions

Instruction Description New for TilePro

ADD Add Word

ADDI Add Immediate Word

ADDLI Add Long Immediate Word

ADDLIS Add Long Immediate Static Write Word

ADDS Add Word Saturating

AND And Word

ANDI And Immediate Word

AULI Add Upper Long Immediate Word

DWORD_ALIGN Double Word Align

FNOP Filler No Operation

MM Masked Merge Word

MNZ Mask Not Zero Word

MNVZ Move Not Zero Word

MVZ Move Zero Word

MZ Mask Zero Word

NOP Architectural No Operation

NOR Nor Word

OR Or Word

ORI Or Immediate Word

RL Rotate Left Word

RLI Rotate Left Immediate Word

S1A Shift Left One Add Word

S2A Shift Left Two Add Word

S3A Shift Left Three Add Word

SEQ Set Equal Word

SEQI Set Equal Immediate Word

SHL Logical Shift Left Word

SHLI Logical Shift Left Immediate Word

SHR Logical Shift Right Word

SHRI Logical Shift Right Immediate Word

SLT Set Less Than Word




SLTI Set Less Than Immediate Word

SLTE Set Less Than or Equal Byte

SLT_U Set Less Than Unsigned Byte

SLTE_U Set Less Than or Equal Unsigned Word

SLTI_U Set Less Than Unsigned Immediate Word

SNE Set Not Equal Word

SRA Arithmetic Shift Right Word

SRAI Arithmetic Shift Right Immediate Word

SUB Subtract Word

SUBS Subtract Word Saturating

TBLIDX Table Index

XOR Exclusive Or Word

XORI Exclusive Or Immediate Word

Table 3-4. X0 Instructions (continued)

Instruction Description New for TilePro




X0: SIMD Instructions

Byte and halfword data manipulation is supported with a full set of computational and format-ting SIMD instructions:

Table 3-5. X0: SIMD Instructions

Instruction Description New for TILEPro

ADD{H,B,HS,BS_U} Add Word {Half Word, Byte, Half Word Saturating, Unsigned Byte Saturating}

{HS,BS_U}

ADDI{H,B} Add Immediate Word {Half Word, Byte}

ADIFF{H,B} Absolute Difference {Half Word, Byte}

INT{H,L}{H,B} Interleave {High, Low} {Half Word, Byte}

MAX{I}{H,B_U} Maximum {Immediate} {Half Word, Byte_Unsigned}

MIN{I}{H,B_U} Minimum {Immediate} {Half Word, Byte_Unsigned}

MZ{H,B} Mask Zero {Half Word, Byte}

MNZ{H,B} Mask Not Zero Word {Half Word, Byte}

PACK{HS,BS_U} Pack {Half Word, Unsigned Byte} Saturating

PACK{L,H}B Pack {Low/High} Byte

SAD{H,B} Sum of Absolute Difference Half Words

SADA{H,B} Sum of Absolute Difference Accumulate {Half Word, Byte}

SEQ{H,B} Set Equal Word {Half Word, Byte}

SEQI{H,B} Set Equal Immediate Word {Half Word, Byte}

SHL{H,B} Logical Shift Left Word {Half Word, Byte}

SHLI{H,B} Logical Shift Left Immediate Word {Half Word, Byte}

SHR{H,B} Logical Shift Right {Half Word, Byte}

SHRI{H,B} Logical Shift Right Immediate Word {Half Word, Byte}

SLT{H,B,H_U,B_U} Set Less Than Word {Half Word, Byte, Half Word_Unsigned, Byte_Unsigned}

SLTE{H,B,H_U,B_U} Set Less Than or Equal Word {Half Word, Byte, Half Word_Unsigned, Byte_Unsigned}

SLTI{H,B,H_U,B_U} Set Less Than Immediate Word {Half Word, Byte, Half Word_Unsigned, Byte_Unsigned}

SNE{H,B} Set Not Equal Word {Half Word, Byte}

SRA{H,B} Arithmetic Shift Right Word {Half Word, Byte}

SRAI{H,B} Arithmetic Shift Right Immediate Word {Half Word, Byte}

SUB{H,B,HS,BS_U} Subtract Word {Half Word, Byte, Half Word Saturating, Unsigned Byte Saturating}

{HS,BS_U}




X0: Bit/Byte and Special Instructions

Special purpose arithmetic instructions include ones used for population count, count zeros, and bit and byte exchanges.

X0: Multiply Instructions

Instruction X0 encodings include a 16-bit multiplier and all multiply and fused muladd instruc-tions to generate wider multiplies:

Table 3-6. X0: Bit/Byte and Special Instructions


BITX Bit Exchange Word

BYTEX Byte Exchange Word

CLZ Count Leading Zeros Word

CRC32_8 CRC32 8-bit Step

CRC32_32 CRC32 32-bit Step

CTZ Count Trailing Zeros Word

PCNT Population Count Word

Table 3-7. X0: Multiply Instructions


MUL{LL,HH,HL}SA_UU Multiply {Low Low, High High, High Low} Half Word Shift Accumulate Unsigned Unsigned

MULHH{A}_{UU,SS,SU} Multiply {Accumulate} High Half Word times High Half Word {Unsigned Unsigned, Signed Signed, Signed Unsigned}

MULHL{A}_{UU,SS,SU,US} Multiply {Accumulate} High Half Word times Low Half Word {Unsigned Unsigned, Signed Signed, Signed Unsigned, Unsigned Signed}

MULLL{A}_{UU,SS,SU} Multiply {Accumulate} Low Half Word times Low Half Word {Unsigned Unsigned, Signed Signed, Signed Unsigned}




X1: Arithmetic/Compare/Logical Instructions

The X1 instruction can encode the same arithmetic instructions as are in X0, except for DWORD_ALIGN.

Refer to Table 3-4.

X1: Control Transfer Instructions

X1 encodes all control transfer instructions. They use compiler directed static branch prediction and return prediction, and implicitly load lr for linkage:

X1: Memory Management Instructions

Memory management instructions provide flush and invalidate control of caches and memory fence for explicit control of memory ordering:

Table 3-8. X1: Control Transfer Instructions


BBNS{T} Branch Bit Not Set Word {Taken}

BBS{T} Branch Bit Set Word {Taken}

BGEZ{T} Branch Greater Than or Equal to Zero Word {Taken}

BGZ{T} Branch Greater Than Zero Word {Taken}

BLEZ{T} Branch Less Than or Equal to Zero Word {Taken}

BNZ{T} Branch Not Zero Word {Taken}

BLZ{T} Branch Less Than Zero Word {Taken}

BZ{T} Branch Zero Word {Taken}

J Jump

JAL Jump and Link Word

JALR{P} Jump and Link Register Word {Predict}

JR{P} Jump Register Word {Predict}

LNK Link

Table 3-9. X1: Memory Management Instructions


FINV Flush and Invalidate Cache Line

FLUSH Flush Cache Line

INV Invalidate Cache Line

MF Memory Fence

WH64 Write Hint 64 Bytes




X1: Memory Instructions

The memory operations include word, halfword, and byte loads and stores, and a test and set syn-chronization primitive:

Y-Mode

The Y-mode, 3-instruction bundle format (Y2,Y1,Y0) supports a subset of operations for each pipe. Y2, Y1, and Y0 instructions can use the following opcodes.

Y0: Arithmetic/Compare/Logical Instructions

The Y0 instruction can encode all of the same Arithmetic/Compare/Logical instructions as X0 except for Long Immediate encodings. Refer to “X1: Arithmetic/Compare/Logical Instructions” on page 17 for more information.

Y0: Bit/Byte Instructions

The Y0 instruction can encode all of the same Bit/Byte instructions as X0. Refer to “X0: Bit/Byte and Special Instructions” on page 16 for more information.

Y0: Multiply Instructions

The Y0 instruction can encode a subset of the X0 Multiply instructions that are useful for 3-wide issue. These instructions include: mulll{a}_{uu,ss}, mulhh{a}_{uu,ss}, and mulhlsa_uu. Refer to “X0: Multiply Instructions” on page 16 for more information.

Y1: Arithmetic/Compare/Logical Instructions

The Y1 instruction can encode all of the same Arithmetic/Compare/Logical instructions as X1 except for Long Immediate encodings. Refer to “X1: Arithmetic/Compare/Logical Instructions” on page 17 for more information.

Y2: Memory Instructions

The Y2 instruction can encode all of the same Memory instructions as X2 except no L{W,H,H_U,B,B_U}ADD, S{W,H,B}ADD, and test-and-set (TNS) encoding. Refer to “X1: Mem-ory Instructions” on page 18 for more information.

Table 3-10. X1: Memory Instructions


L{W,H,H_U,B,B_U, _NA} Load {Word, Half Word, Half Word Unsigned, Byte, Byte Unsigned, No Alignment Trap}

L{W,H,H_U,B_U, _NA}ADD Load {Word, Haft Word, Half Word Unsigned, Byte, Byte Unsigned, No Alignment Trap} and Add

S{W,H,B} Store {Word, Half Word, Byte}

S{W,H,B}ADD Store {Word, Haft Word, Byte} and Add

TNS Test and Set Word



Operating System Support

3.2 Operating System SupportThe Tile Processor includes the mechanisms needed within the tile to support a reliable operating system port, whether it is Tilera’s distributed Hypervisor, an embedded operating system, a stan-dard operating system like Linux, or a special purpose, lightweight runtime framework.

3.2.1 Special Purpose Registers (SPRs)The Tile Processor provides access to architectural control and implementation-specific state via moves to/from Special Purpose Registers. SPR state is used to access test hooks, configure hard-ware structures and options at boot time, and monitor real time processing state. In general, OS-level functions can be coded to poll SPRs for control state or configure certain conditions to cause an interrupt.

3.2.2 Interrupts and ExceptionsEach tile contains a local interrupt controller that processes the priority of pending interrupts and can deliver one interrupt per cycle. The majority of the interrupts are responses to processing pro-tection violations. As such, they safeguard the system from programmer errors. Each tile contains a tile timer that can be used as source of local periodic interrupts. Also, the local interrupt control-ler processes inbound external interrupts from I/Os and system global devices.

The Tile Processor supports highly-programmable external interrupts. External interrupts are sig-naled through Flexible I/O pins, which can be vectored to any tile for handling. Interrupts are communicated to tiles as packets over the IDN. For more information about the Flexible I/O pins, refer to the TILE64Pro Data Sheet.

3.2.3 Protection MechanismsThe Tile Processor provides a simple, yet robust and fully-protected environment that allows multiple operating systems, multiple levels within an operating system, and user processes to co-exist. What makes this possible are mechanisms, the Minimum Protection Level (MPL) and Current Protection Level (CPL) that prevent memory, network and I/O device accesses. The Pro-cessor Engine maintains an MPL needed to access each protected resource. The CPL for the executing process is then tracked. Interrupts are permitted or masked, depending on the compar-ison of the current CPL with a requested resource’s specific MPL.

3.2.4 Virtual MemoryWithin a tile, the architecture provides a 32-bit virtual address space. The virtual addresses are translated to a 64-bit global physical address space. Translation includes checking for protected regions of memory and the Address Space IDentifier (ASID) for managing multiple active address spaces.

3.3 The TILEPro64 Processing Engine PipelineThe TILEPro64 Processor Engine has three execution pipelines (P2, P1, P0) of two stages (EX0, EX1) each. Both modes of bundling instructions, namely the X mode and the Y mode, can issue instructions into any of the three of the execution pipelines (P2, P1, P0). Y-mode uses all three pipelines simultaneously. One of the pipelines remains in IDLE mode during X-mode issue. P0 is capable of executing all arithmetic and logical operations, bit and byte manipulation, selects, and all multiply and fused multiply instructions. P1 can execute all of the arithmetic and logical oper-ations, SPR reads and writes, conditional transfer, and control flow instructions. P2 services all memory operations — loads, stores, and test-and-set instructions.




The TILEPro64 Processor Engine uses a short, in-order pipeline aimed at low branch and low load-to-use latencies. The basic pipeline consists of five stages: Fetch, RegisterFile, Execute0, Execute1, and WriteBack.

3.3.1 FetchThe Fetch pipeline stage runs the complete loop from updating the Program Counter (PC) through instruction fetch to selecting a new PC. The PC provides an index into several structures in parallel: the ICache data, tag, and merged Branch Target Buffer (BTB)/line prediction arrays, and a small ITLB. The next fetch address multiplexor must then predict the next PC from many sources including the next instruction, line or branch prediction, branch mispredict, trap or boot address.

3.3.2 RegisterFile (RF)There are three instruction pipelines, one for each of the instructions in a bundle. These pipelines are designated as P0, P1 and P2. Two instruction bundles will always issue one instruction in P0, and will issue the second instruction in either P1 or P2 depending on the type of the instruction.

P0 and P1 require a destination operand, and the memory load path requires a destination register operand.

The RF stage produces valid source operands for the instructions. This requires decoding the two or three instructions contained in the instruction bundle provided by the Fetch stage each cycle, accessing the source operands from the main register file, checking instruction dependencies, and bypassing operand data from previous instructions. A three-instruction bundle can require up to seven source register operands and three destination register operands — three source operands to support the fused MulAdd and conditional transfer operations, two source operands each for the other two instruction pipelines.

Figure 3-1. TILEPro64 Processor Pipeline

Fetch RF EX0 EX1 WB

EX0 EX1 WB

L1Tag/Dat L1CMP WB

ARB MAF, L2Tag/CMP L2 Dat Drive

Commit



The TILEPro64 Processing Engine Pipeline

3.3.3 Execute Stages (EX0, EX1)The EX0 pipeline stage is the instruction commit point of the processor; if no exception occurs, then the architectural state can be modified. The early commit point allows the processor to trans-mit values computed in one tile to another tile with extremely low, register-like latencies. Single-cycle operations can bypass from the output of EX0 into the subsequent EX0. Two-cycle operations are fully pipelined and can bypass from the output of EX1 into the input of EX0.

3.3.4 WriteBack (WB)Destination operands from P1 and P0 are written back to the Register File in the WB stage. Load data returning from memory is also written back to the Register File in the WB stage. The Register File is write-through, eliminating a bypass requirement from the output of WB into EX0.

3.3.5 Pipeline LatenciesWith pipelined executions, executions of multiple operations are not sequential, instead they overlap in time. In the TILE Architecture instructions that have longer latencies are fully-pipelined.

Table 3-11. TILEPro Pipeline Latencies (1 Cycle is 1.33 ns)

Operation Latency

Branch Mispredict 2 cycles

Load to Use - L1 hit 2 cycles

Load to Use - L1 miss, L2 hit 8 cycles

Load to Use - L1/L2 Miss, adjacent Dynamic Distributed Cache hit 35 cycles

Load to Use - L1/L2 Miss, DDR2 page open, typical 69 cycles

Load to Use - L1/L2 Miss, DDR2 page miss, typical 88 cycles

MUL*, SAD*, ADIFF instructions 2 cycles

All other instructions 1 cycle

Tile to tile network hop 1 cycle

Tile to tile network hop and data use 2 cycles






CHAPTER 4 MEMORY ARCHITECTURE

4.1 Memory ArchitectureThe Tile Processor architecture defines a flat, globally shared 64-bit physical address space and a 32-bit virtual address space. The TLR6400™ and TILEPro™ family of processors implement a 36-bit physical address space. The globally shared physical address space provides the mecha-nism by which processes and threads can share instructions and data. Data memory is byte, half-word, and word addressable.

By default, hardware provides a cache-coherent view of data memory to applications. That is, a read by a thread or process to a physical address P will return the value of the most recent write to address P. Instruction memory that is written by the process itself (self-modifying code) or by other processes is not kept coherent by hardware. Special software sequences using the icoh instruction must be used to enforce coherence between data and instruction memory. In the TLR6400 implementation, IO writes are not kept coherent with on-chip caches. The TILEPro implementation provides hardware cache coherence for IO accesses.

A non-coherent and a non-cacheable memory mode is also supported, as shown in Table 4-1. In addition to the memory modes, the architecture provides several memory attributes for control-ling the allocation and distribution of cache lines. These are shown in Table 5-2.

The Tile processor architecture memory attributes and modes are managed and configured through system software programming of page tables and enforced through TLB entries. Chapter 4 of the Multicore Development Environment Optimization Guide (UG105) provides the Application Programmer Interface (API) and details about memory allocation.

Table 4-1. Tile Processor Architecture Memory Modes

Memory Mode Description

Coherent Memory Hardware cache coherent memory.

Non-Coherent Memory Hardware does not maintain coherence.

Non-Cacheable Memory Data cache blocks are not cached in any on chip caches. Instruction cache blocks are not cached in the unified L2. Instruction cache blocks are always cached in the L1 instruction cache.

Chapter 4 Memory Architecture



5.2 Cache Architecture4.2.1 Overview

Due to the large difference between DRAM and processor speeds, the cache subsystem is critical for delivering high performance. The cache subsystem’s primary role is to prevent the processor cores from stalling due to long memory latencies. To this end, the cache subsystem implements a high performance, non-blocking, two-level cache hierarchy. The two-level design isolates the tim-ing-critical L1 caches from complexity, allowing the L1 data and instruction cache design to be simple, fast, and low power.

The execution engine does not stall on load or store cache misses. Rather, execution of subsequent instructions continue until the data requested by the cache miss is actually needed by another instruction. The cache subsystem is non-blocking and supports multiple concurrent outstanding memory operations. The cache subsystem supports hit under miss and miss under miss, allowing loads and stores to different addresses to be re-ordered to achieve high bandwidth and overlap miss latencies, while still ensuring that true memory dependencies are enforced.

The cache subsystem provides cache-coherent shared memory, atomic instructions (test-and-set), and memory fences (MF). The TILEPro cache system maintains coherence with I/O DMA accesses to memory, and allows I/O to read and write the on-chip caches directly.

Finally, the cache subsystem implements a software-programmable hardware direct memory access engine (DMA) and supports using portions of the L2 cache as a scratchpad memory.

Table 4-2. Supported Allocation Control for Tile Architecture

Attributes Description

No L1d allocation Lines are not allocated in the L1d cache (TILEPro only).

No L2 allocation Remotely homed lines are not allocated in the L2 cache

Pinned memory Hardware will lock the requested memory page in the L2 cache.

Hashed Lines on page are distributed across cores according to a hardware hash function (TILEPro only).



Cache Architecture

4.2.2 Cache Microarchitecture Table 4-3 lists the most important characteristics of the TILE64 and TILEPro cache subsystems.

Figure 5-2 shows the top level block diagram for the Tile cache subsystem. The processor engine can issue one load or one store per cycle. The L1D cache is checked for the requested data. If the L1D doesn't have the requested data, the request is delivered to the L2 cache. Stores update the L1D if the targeted cache block is present, and always write thru to the L2 cache. The L1I cache is supported by a hardware prefetching engine that predicts and fetches the most likely next instruction cache line.Misses in the L2 cache on a given tile are satisfied by caches in other tiles or from external memory. If the other caches do not have the requested cache line, then they in turn fetch it from external memory and deliver it to the requesting core.

Table 4-3. Cache Subsystems

TILE64 TILEPro

L1 instruction (L1I) cache 8 KB, direct-mapped 16 KB, direct-mapped

L1 instruction translation lookaside buffer 8 entries, fully associative 16 entries, fully associative

L1 data (L1D) cache 8 KB, two-way associative

L1 data translation lookaside buffer 16 entries, fully associative

L2 unified cache 64 KB, two-way associative 64 KB, four-way associative

Latency (load to use) 2 cycles L1D hit, 8 cycles local L2 hit,

30-60 cycles remote L2 hit, 80 cycles L2 miss to memory

Architecture Non-blocking, out-of-order, stall-on-use

DDC1 technology

1 Dynamic Distributed Cache

No Yes

Line Size L1I: 64BL1D: 16BL2: 64B

Allocate Policy L1I: Allocate on read missL1D: Allocate on load miss onlyL2: Allocate on load or store miss

Write Policy L1I: N/AL1D: Write through, Store update on hitL2: Writeback

Error Protection L1I: 64-bit parityL1D: 8-bit parityL2: 8-bit parity




The cache subsystem supports out of order retirement, meaning instructions subsequent to a load or store miss can write the destination register before the load or store completes. Architectural state is kept consistent, due to the issue logic that blocks subsequent instructions from using stale data. The L2 cache subsystem supports multiple outstanding memory operations and cache misses. The L2 cache subsystem maintains an outstanding miss file to track transactions launched from this tile to memory or to other tiles. Each tile can have up to eight outstanding load misses to external memory as well as four (two for TILE64) outstanding L2 writebacks.

Figure 5-2. Cache Engine Block Diagram

4.2.2.1 Dynamic Distributed Cached Shared Memory

The TILEPro uses the Dynamic Distributed Cache (DDC™) to provide a hardware-managed, cache-coherent approach to shared memory. Applications normally access dynamic distributed cached shared memory using loads and stores. DDC allows a page of shared memory to be homed on a specific tile (or distributed across many tiles), then cached remotely by other tiles. This mech-anism allows a tile to view the collection of on-chip caches of all tiles as a large shared, dynamic distributed cache. It promotes on-chip access and avoids the bottleneck of off-chip global mem-ory. This form of shared memory access is particularly useful when processes read and write shared data in a fine-grained, interleaved manner — such as with locks and other synchronization objects.

Figure 5-4 shows a read from tile A (the remote requesting tile) to a cacheline X, where cacheline X is homed at tile B (the home tile):

1. Tile A first checks its local caches for the cacheline X, and on a miss, sends a request for cache-line X to tile B.

2. Tile B receives the request for cacheline X and retrieves cacheline X from its L2.

3. Tile B then sends the full cacheline X back to tile A. Tile A installs cacheline X in its local L1 and L2 caches.

128

I$ Fill Data

Processor Engine

L1 Icache Subsystem

L1I$ ITLB

Processor Engine

36

64

DMAEngine L1D$ DTLB

L1 Dcache Subsystem

32 32 Load Data

L2 Cacheand

Controller

128

D$ Fill Data

32

TDN Egress

32

SwitchEngine

Demand / Prefetch Requests

L2 Cache Subsystem

Instruction Bundle

32Load/Store Virtual

Address

StoreData

32

36

Write Thru Data

Load/Store Physical Address

TDN Ingress

32

MDN Egress

MDN Ingress

32

32

CDN Egress

CDN Ingress

32



Cache Architecture

Figure 5-3. Request to Home Tile/Fill L2/L1 with Cacheline X

Figure 5-4. shows a write from tile A to a word (X[0]) in cacheline X, where cacheline X is again homed at tile B.

1. Tile A sends the write address and data to tile B.

2. Tile B receives the write address and data and checks the directory information for cacheline X. The directory indicates that tile C (the sharing tile) has a copy of cacheline X. Tile B updates cacheline X with the new value for word X[0].

3. Tile B sends an invalidate message to tile C.

4. Tile C receives the invalidation and invalidates cacheline X from its caches.

5. Tile C then sends an invalidation acknowledgement back to tile B.

6. Tile B receives the invalidation acknowledgement and sends a write acknowledgement back to tile A.

7. Tile A receives the write acknowledgement message and thus knows that the write to word X[0] has completed.

Read (X)

Requesting (Remote) Tile

L1L2

L1L2X

Home Tile

Fill (X)

XX

2

1

Legend:X = Cacheline XX[n] = The nth word in cacheline X

3




Figure 5-4. Write from Tile A to Word [0] in Cacheline X

4.2.2.2 Coherent and Direct-to-Cache I/O

TILEPro provides hardware cache coherence for I/O DMA accesses. On a write to memory from an I/O DMA engine, the hardware invalidates any cached copies of the line, and updates the cache with the newly written data.

Similarly, on a read to memory from an I/O DMA engine, the hardware checks the on-chip caches for the line and supplies it from there if found. The System Architecture Manual (UG103) describes these mechanisms in detail.

4.2.2.3 Striped Memory

TILEPro provides a boot time option to enable a “striped main memory” mode of operation. Striped main memory mode overrides the default mapping of physical memory pages to the four main memory controllers. In striped main memory mode, a physical page of memory is “striped” across the four controllers at an 8KB granularity. That is, a 64KB page would have the first quarter of the page located at memory controller 0, the second quarter at memory controller 1, the third quarter at memory controller 2, and the last quarter at memory controller 3. The striped main memory mode of operation uniformly spreads all physical memory pages across the controllers, thus balancing the load among the four controllers.

Write (X[0]=1)

Requesting (Remote) Tile

L1L2

L1L2X=0

Home Tile

4

1

L1L2X=0

Sharing Tile

X=0

7

WriteAck (X)

X is invalidatedfrom cache

X=1

Invalidate (X)3

InvalidateAck (X)5

Legend:X = Cacheline XX[n] = The nth word in cacheline X

6

2



CHAPTER 5 IMESH NETWORK

5.1 OverviewAll communication within the tile array and between the tiles and I/O devices takes place over the iMesh™ Interconnect. The iMesh Interconnect consists of two classes of networks, both sup-porting low latency and high bandwidth communication. The first class comprises a set of software visible networks for application level streaming and messaging, while the second con-sists of the networks used by the memory system to handle memory requests, exchange cache coherency commands and support high performance shared memory communication. Dedicated Switch Engines are used to implement the iMesh Interconnect, allowing for a complete decou-pling of data routing from the Processing Engines.

Due to the mesh network topology, every network intersects in every tile. Each tile contains a Switch Engine at the intersection that facilitates data transfers between tiles. The connections are logically independent, full-duplex, and flow-controlled. The Switch Engines connect directly to the Processing Engine via architecturally-defined interconnect registers. Within the tile, the Switch Engine contains all of the control and datapath for each of the network connections. The Switch Engines also implements buffering and flow-control within all the networks, so that the user does not have to worry about running all the tiles in a synchronous manner, rather each tile can process data asynchronously.

The Switch Engine contains six physical mesh networks. The Static network (STN) switches scalar data between tiles with very low latency. Data transfers over the static network are orchestrated. The other five are Dynamic Networks, which facilitate streaming and packet data transfer among tiles and I/O devices. Of the five dynamic networks, namely the UDN, TDN, MDN, CDN and IDN, only the User Dynamic Network (UDN) is visible to the user. The others are used to satisfy cache misses from external memory and other tiles, for DMA transfers, for I/O, and for various other system-related functions.

The Tilera iMesh contains five dynamic networks. Their implementation is essentially the same; they are differentiated by usage model.

• UDN - Tilera user-level APIs are built on the User Dynamic Network (UDN). User applications directly link in C library routines to provide a variety of convenient access modes. The UDN supports extremely efficient streaming through a register-level FIFO interface and hardware supported steering of distinct streams into separate FIFOs.

• IDN - The I/O Dynamic Network (IDN) is used primarily for transfers between tiles and I/O devices and between I/O devices and memory. A privilege-level protection mechanism limits access to OS-level code. Like the UDN, the IDN supports stream steering as well.

• MDN - The Memory Dynamic Network (MDN) is used for memory data transfer (resulting from loads, stores, prefetches, cache misses or DMA) between tiles themselves, and between tiles and external memory. Only the Cache Engine has a direct hardware connection to the MDN.

Chapter 5 iMesh Network



• CDN - Coherence Dynamic Network (CDN) is used to carry cache-coherence invalidate mes-sages.

• TDN - The tile Dynamic Network (TDN) use is similar to the MDN and supports memory data transfer between tiles. Only the Cache Engine has a direct hardware connection to the TDN.

Figure 5-1. Tilera iMesh

Mesh networks offer significantly lower latency and lower power than bus or ring architectures, and because the bandwidth of mesh networks scales as more tiles are added, the iMesh Intercon-nect allows a Tile Processor™ to scale to thousands of tiles without becoming a bottleneck. Mesh networks also allow the exploitation of locality, which offers even better power, latency and band-width characteristics for applications in the media and packet processing space that are able to exploit this locality.

5.2 Mesh NetworksIndividual tiles connect via multiple two-dimensional mesh networks. These are on-chip commu-nication devices that include: the Static Network or STN and Dynamic Networks, including the UDN, IDN, MDN, CDN, and TDN.

5.2.1 Static NetworkThe Static Network (STN) is designed for efficient movement of scalar operands between tiles. Instead of using a header or other destination information, the static network uses switching spec-ifications at each node (Switch Engine) to pass an operand node-to-node along its routing path to the final destination. The switching specification for a specific flow reside at each intermediate tile’s Switch Engine. The Static Network allows extremely low latency transfer of data between tiles.

UDN

STN

MDN

IDN

TDN

CDN

STN

TDN

MDN

UDN

CDN

SwitchEngine

CacheEngine

PE

IDN

UDN

STN

MDN

IDN

TDN

CDN

SwitchEngine

CacheEngine

PE

UDN

STN

MDN

IDN

TDN

CDN

STNMDN

UDN

TDNCDN

SwitchEngine

CacheEngine

PE

IDN

STN

TDN

MDN

UDN

CDN

SwitchEngine

CacheEngine

PE

IDN SwitchEngine

CacheEngine

PE

STNMDN

UDN

TDNCDN

SwitchEngine

CacheEngine

PE

IDN

UDN

STN

MDN

IDN

TDN

CDN

STN

TDN

MDN

UDN

CDN

SwitchEngine

CacheEngine

PE

IDN

UDN

STN

MDN

IDN

TDN

CDN

SwitchEngine

CacheEngine

PE

STN

TDN

MDN

UDN

CDN

IDN

STN

TDN

MDN

UDN

CDN

IDN

STN

TDN

MDN

UDN

CDN

IDN

UDN

STN

MDN

IDN

TDN

CDN

STNMDN

UDN

TDNCDN

SwitchEngine

CacheEngine

PE

IDN

STN

TDN

MDN

UDN

CDN

IDN

STN

TDN

MDN

UDN

CDN

IDN

STN

TDN

MDN

UDN

CDN

IDN

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN

UDN

STN

MDN

IDN

TDN

CDN



Mesh Networks

5.2.2 Dynamic NetworksAll on-chip dynamic data movement uses packet-based communication over the Dynamic Net-works. This includes inter-process communication, memory accesses and user-level DMA transfers, I/O, and interrupts. Any number of communication schemes including messaging and stream-based applications can be built on top of the architecture, inheriting all the intrinsic value of packet-based systems such as atomicity and ordering. Tilera provides an Tilera Multicore Com-ponents (TMC™) Library for programmers to perform data transfers over the UDN.

The dynamic networks are reliable and are packet routed, in other words, they support a “fire and forget” model. Tilera packets contain a header word that is sent in-band on the same commu-nication channel with the subsequent data and is transparent to the user. The header contains the routing information. Each packet is routed by the hardware tile-to-tile in a dimension ordered fashion — x-direction first, then y-direction — until it reaches the destination tile’s Switch Engine. Packet routing is pipelined so that portions of a packet can be sent out over a network link even before the rest of the packet has arrived at the switch. Pipelined routing reduces communication latency significantly.

The network channels include hardware flow control and buffering, which allows the tiles to run asynchronously. The network buffers the packets without involving the processing engines when the packets encounter contention for channel resources.

The User Dynamic Network (UDN) can be used by applications to send messages between tiles. The UDN supports low latency communication using a packet-switched mesh network. The UDN is designed to be used by user processes directly and is quite unique to the Tile processor™ archi-tecture in that user processes are capable of directly addressing each other via the network. In fact, tiles can use the UDN to send messages directly to other tiles without the intervention of a system level interface.

While the behavior of the UDN is similar to the STN, the packet is written directly to net-work-mapped, destination operand registers at the source tile and extracted from the network through network-mapped, source operand registers at the destination tiles. Register mapping reduces latency even further. The Tile Processor’s software-accessible networks are deadlock free. The networks collaborate to achieve freedom from deadlock without the costly buffering require-ments of other schemes through an innovative technique that includes both deadlock avoidance and deadlock recovery. The networks used for cache coherence and memory traffic are deadlock free by design.

5.2.2.1 Support for Stream Data Transfer

The Switch Engine supports fine-grain stream programming by providing hardware support for stream data transfers in both the UDN and IDN. The user can use streaming channels (which behave like highly efficient sockets) for communication between processes using the convenient TMC API. The hardware, compiler, and TMC run-time system combine to implement the follow-ing mechanisms to facilitate streaming channels. Data injected by a process into the source end of a channel can be read directly off the destination end of the channel. At the source tile, a tag iden-tifying the streaming channel is embedded in network packets. At the destination tile, the Switch Engine uses the tag to demultiplex the packet to one of several hardware stream FIFOs automati-cally. The FIFOs are in turn readable by the destination software using a register-mapped interface. Together these mechanisms enable high-bandwidth stream data transfers with over-heads of only a few cycles as compared to thousands of cycles in conventional systems.

Chapter 5 iMesh Network



Figure 5-2. Facilitating Stream Programming by Hardware Demultiplexing of Streams into Separate Data Queues

5.2.2.2 Multicore HardwallThe Switch Engine implements the Multicore Hardwall technology to protect multiple applications and operating systems from each other due to unwanted interactions as they run on various tiles in the Tile Processor. Multicore Hardwall technology allows the Hypervisor (which runs at the highest protection level) to set mode bits in the Switch Engines’ UDN routers in a way that prevents packets from entering or leaving the switch in any given direction. If a packet attempts to cross the established boundary, an interrupt is signaled and control is passed onto the Hypervisor.

5.3 iMesh Characteristics on the TILEProThe Tilera iMesh is architecturally defined. There are, however, some aspects of the iMesh that are specific to the TILEPro64 implementation:

• Network connections are 32-bit, full-duplex

• Single cycle latency from a tile’s Processor Engine to its Switch Engine

• Single cycle latency between adjacent tiles on the iMesh

• Zero cycle latency from a tile’s Switch Engine to its Processor Engine

• Packet lengths are 1 to 128 words (words are 32 bits) plus a header word

Tile i

add r55, r3, r4

Processor Engine

Tile j

sub r5, r3, r55

Processor Engine

Switch Engine

UDNSwitch

Tags

Switch Engine

UDNSwitch

Demuxed Stream Queue



CHAPTER 6 INPUT/OUTPUT

6.1 Architecture OverviewTilera’s iMesh™ terminates at the edge of the tile array, and interfaces to on-chip memory control-lers and I/O devices in a highly modular fashion by using I/O Shims. An I/O Shim is an interface module that bridges the internal networks to each of the memory controllers and I/O devices, and implements a packet-based API that the tile hardware or software uses to communicate with the devices.

A memory controller or I/O device that sits physically adjacent to a tile may connect to the tile’s Memory Dynamic Network (MDN) and I/O Dynamic Network (IDN) networks through I/O Shims, but all tiles have equivalent ability to access any I/O Shim. The iMesh shims translate the on-chip network packet protocol to the appropriate device protocols. I/O Shims contain all the hardware for:

• Connection to the iMesh

• Packet FIFOs for rate matching

• Connection to I/O protocol specific interfaces — MACs and PHY

• High bandwidth data movement (for PCIe, XGBE, and GbE)

•From memory to the I/O Shim over the MDN

•To memory from the I/O Shim over the MDN

• To Tile cache from the I/O shim over the TDN

• From Tile cache to the I/O shim over the TDN

•To a tile from the I/O Shim over the IDN or UDN

•From a tile to the I/O Shim over the IDN or UDN

• Programmable bindings to tiles for operating system driver-level services, such as interrupts and error handling

Scalable systems containing multiple Tile Processors ganged together through fabric chaining can also be configured.

Chapter 6 Input/Output



6.1.1 Ingress DataflowLink-level flow control, timing, and other L1/L2 layer actions for ingress packets arriving from off-chip are handled by the protocol-specific PHY and MAC logic. From the MAC, the ingress packets are stored in FIFOs in the I/O Shims. The I/O Shim ingress DMA engine transfers packets from the packet FIFOs to external memory over the MDN or to tile memory over the TDN, or directly to the program over the UDN or IDN. The DMA transfers are flow-controlled on chip. The software controlled transfers permit or enable many configuration options, including for-warding capabilities of a number of (user-programmable) bytes of packet header to one location and the bulk of the packet data to another location. In this way, packet flow can be directed based on content. Completion, framing, and error checking status are reported via notification packets sent on the IDN or UDN. The I/O Shim typically contains protocol-specific hardware, for exam-ple, the ability to perform CRC on or discard certain portions of a packet.

Figure 6-1. Ingress I/O Processing

IDMA

MAC

IPP

Processing

Tiles

DMA

Commands

I/O Shim

Memory

Shim

External

Memory

Packets

Packet Data

(IDN or UDN)Headers

(IDN or UDN)

Packet Data

(MDN)

Off-Chip

Packet Data

MDN

UDN

TDN



Architecture Overview

6.1.2 Egress DataflowThe Tilera MDE software configures the I/O Shim egress DMA engine to initiate transfers from external memory over the MDN or Tile cache via the TDN to the egress packet FIFO. Alternately, processing tiles can send some or all of the packet data directly to the I/O Shim over the IDN or UDN. Hardware controls packet flow to the protocol-specific MAC and PHY. Completion and other status are reported via packets sent on the IDN or UDN.

Figure 6-2. Transfers to External Memory

EDMA

MAC

Processing

Tiles

I/O Shim

Packets

DMA Commands/

Packet Data

(IDN or UDN)

Off-Chip

Packet Data

(MDN)

Packet Data

MDN

UDN

TDN

Memory

Shim

External

Memory




6.1.3 Direct-to-Cache I/OI/O transactions on TILEPro64 may take advantage of the Tiles’ cache coherency mechanisms. This simplifies software by eliminating the need for user-managed shared memory and reduces the required off-chip memory bandwidth. iDMA write traffic may use the hash-for-home and direct-to-Tile caching attributes, such that packet data is deposited directly into the distributed L3 cache or to a specific home Tile’s L2 cache. Similarly, eDMA can read packet data from the appro-priate home Tile’s cache.

The direct-to-cache/direct-from-cache I/O mechanisms reduce memory bandwidth and ease development of packet processing software by utilizing hardware cache coherence mechanisms.

6.1.4 UserIOTILEPro64 supports user-managed I/O for its high speed devices (PCIe, XAUI, and GbE). When user-managed I/O is enabled, configuration and DMA control utilizes the UDN instead of the IDN. This allows user-level drivers to communicate directly with the device while retaining the IDN for protected hypervisor and operating system communication.

Another optional feature of TILEPro64 is virtual address (VA) translation in the I/O. When VA translation is enabled, all DMA communication between the Tile and the I/O device uses virtual rather than physical addresses. This allows user-level device drivers to be confined to a specific region of memory and affords all of the protection and virtualization associated with virtual addressing.



TILEPro64 I/O

6.2 TILEPro64 I/OThe TILEPro64 supports the following primary external interfaces:

• Memory: Four memory interface channels, each supporting 64-bit DDR2 DRAM up to 800 Mbps, for a peak total bandwidth of 25.6 GB/s. The memory controllers are on-chip.

• 10Gb Ethernet: Two full-duplex XAUI-based 10Gb ports with integrated MACs

• PCIe: Two 4-lane PCI Express ports configurable as 4-lane, 2-lane or 1-lane (4x, 2x, 1x) with integrated MACs, supporting both root complex and endpoint modes

• 10/100/1000 Ethernet: Two on-board RGMII 10/100/1000 Ethernet MACs

• HPI: 16-bit host port interface

• Flexible I/O: 64 bits (4 banks of 16b) of dedicated Flexible I/O for programmable I/O and inter-rupt support, with frequency up to 150 MHz and streaming capability

• UART: dedicated two-pin interface

• I2C: One master I2C interface; one slave I2C interface

• Serial ROM: Dedicated SPI ROM port; shared I2C/SROM port on I2C master

• Test: IEEE 1149.1 JTAG/boundary scan port

The primary interfaces are illustrated in Figure 2-1: Tile Processor Hardware Architecture on 5.

6.2.1 Memory Interface: DDR2Four identical independent memory channels (0-3) with on-chip controllers with the following features on each channel:

• Up to 400 MHz memory clock and 800 Mbps data rate (PC2-6400 memory)

• 64 bits of data plus 8 bits optional ECC

• 132 external signal pins (data/address/command/control/clocks)

• Supports x8 devices, and x16 devices if ECC is not required

• ECC supports 1 bit scrubbing

• Supports up to 32 banks, up to 4 ranks

• Supports burst lengths of 4 or 8

• Fully programmable DRAM parameters

To optimize for memory bandwidth and latency, the memory controller uses a 16-entry CAM. The controller looks at all memory requests in the CAM and then reorders memory requests, if necessary, based on the following considerations:

• On the memory side, optimize for the selected DRAM page policy (open page or close page) to reduce the overhead caused by memory access (for example: precharge, activation, turn-around, and so on)

• On the tile side, load balancing and starvation avoidance, and so on

To support performance tuning, performance counters are provided for statistics of various events.




6.2.2 10Gb EthernetThe device contains two identical and independently controlled 10Gb Ethernet ports. Each 10Gb port has a standard XAUI serial interface with 4 differential pairs in each direction operating at 3.125GHz per pair and a fully compliant IEEE 802.3 Ethernet MAC with support for the following features:

• Support for jumbo frames

• HW support for all “typical” L1/L2 functions including VLAN tagging, SA/DA/Broadcast addressing

• Optional promiscuous mode for custom packet formatting.

• Optional IP, TCP, UDP checksum offload

• Flexible I/O model allows software customization of all headers and packet formatting

• Independent DMA engines for ingress and egress traffic

• 16 KB of dedicated buffering in each direction to offload and decouple the I/O from tile and memory operations

• Support for direct-to-memory and direct-to-tile DMA operations

• Extensive support for interrupt dispatch and exception handling

• Automatic flow control via pause frame generation and detection with programmable buffer high-water marks

• Cut through support for low-latency applications

• TCP/IP checksum offload

• Hardware-based statistics counters

• IPG tracking with stretch support

• Reduced power and power down modes supported in PHY

• Programmable CRC for custom XAUI protocols

6.2.3 PCIeTwo identical independent PCI Express ports support the following features:

• 4-lane full-duplex PCIe compliant interface (10Gbps line rate, 8Gbps data rate)

• Each PCIe port can be configured as 4-lane, 2-lane or 1-lane (4x, 2x, 1x)

• Full support for lane and polarity reversal to ease PCB routing

• Full support for overspeed mode to allow true 10Gbps data rate between other Tilera devices

• 100MHz or 125MHz PCIe reference clock with support for spread-spectrum clocking

• Fully compliant with the PCIe rev1.1 specification

• Each interface is configurable as root complex (host) or endpoint (device)

• Up to three base address registers (BARs): two memories and one I/O. The size of BAR and BAR-enable is programmable.

• Support for OEM of PCIe device via writable vendorID/deviceID configuration registers and other cap structures

• Fully programmable I/O model allows user defined custom mapping of PCIe address space into local memory or messaging space



TILEPro64 I/O


• Dedicated buffering in each direction to offload and decouple the I/O from tile and memory operations



• Store-and-forward model to simplify I/O software design

• Dedicated buffering for posted/non-posted/completion flows to allow construction of fully compliant and deadlock free PCIe subsystem

• Low power and power-down modes supported in both the PHY and the MAC

6.2.4 Ethernet MACsThere are two independent and identical 10/100/1000 Ethernet ports. Each port provides an RGMII interface to external PHYs and a fully-compliant IEEE 802.3 Ethernet MAC with support for the following features:

• 10/100/1000Mbps support (each port may operate at a different rate)

• Supports jumbo frames of up to 10240 bytes

• HW support for all “typical” L1/L2 functions including VLAN tagging, SA/DA/Broadcast addressing

• Optional IP, TCP, UDP checksum offload

• Supports full-duplex operation (Half-duplex operation not supported)

• MDIO interface for PHY management

• Optional promiscuous mode for custom packet formatting

• Flexible I/O model allows software customization of all headers and packet formatting


• 2KB of dedicated buffering in each direction to offload and decouple the I/O from tile and memory operations



• Automatic flow control via pause frame generation and detection with programmable buffer high-water marks

• Cut through support for low-latency applications

• Hardware-based statistics counters

• Programmable IPG stretch




6.2.5 Host Port InterfaceThe host port interface (HPI) provides a 16-bit-wide parallel port through which an external host processor (host) can directly access the TILEPro64. The HPI port shares pins with two of the FlexIO Banks. The maximum data transfer rate for the HPI is 20MBps.

6.2.6 Flexible I/OThe Flexible input/output interface allows the programmer to configure up to 64 pins as pro-gram-addressable input pins, program-addressable output pins, or as external interrupts. Flexible I/O is arranged in four groups of 16 with the following features:

• Pins may be configured as inputs, outputs, open-drain, or bi-directional

• Pins can be driven either via configuration writes or via packet data streaming over the I/O dynamic network

• Packet streaming allows support of high bandwidth interfaces up to 150MHz per pin

• Pin state can be sampled via configuration reads, interrupts, or data streaming over the I/O dynamic network

• Any pin can be configured to trigger interrupts

• Interrupts may be delivered to any of four programmable tile locations on a per-pin basis

• Support for three primary clocking modes

• Externally synchronous (via a dedicated Flexible I/O pin)

• Internally synchronous to core clock

• Internally synchronous to 100MHz reference clock

• Clock divider used to scale from one of the three clock sources

• Support for multiple drive/sample types including level, edge, pulse, pull-down, pull-up

• Controller-ganging supported for adjacent Flexible I/O controllers to allow synchronous oper-ation of more than 16 pins at high data rates

6.2.7 I2C InterfaceTwo I2C interfaces provide serial ports that are compliant with the I2C standard. One I2C interface operates in master mode (in a single master system). It can be used to access configuration regis-ters in external devices or PROM on DIMM and to interface to an I2C boot PROM. A second I2C interface operates in slave mode and can be used to interface with external I2C master devices (for example: controller, boot master, and so on). The I2C interface supports the following features:

• 7-bit slave addressing

• 8- or 16-bit device internal addressing

• Burst read or write operations of 1 to 64 bytes

• Supports up to 400Kbps bit rate

• Two external pins per I2C interface



TILEPro64 I/O

6.2.8 UART InterfaceThe two-bit UART interface is RS-232 compatible. The baud rate, data bits, stop bits and parity types are programmable. The UART interface supports 9600 baud by default. A large range of baud rates can be supported after the boot by programming the baud rate.

The UART port is one of the available level 1 boot ports. The level 1 boot loader code can be pre-sented to the Tile Processor through this port after a hardware reset. The UART port can also be used to access other devices on the RSHIM interface.

The UART port can be accessed by any tile from inside the chip, and can be accessed by other devices on the RSHIM interface.

6.2.9 SPI SERIAL ROM InterfaceThe SPI (Serial Peripheral Interface) Serial ROM interface provides an interface from which you can boot, write to and read from an off-chip SPI ROM. It supports the following:

• External ROM sizes between 512K bits and 128M bits

• Page programming

• Up to 15.625 MHz clock rate

• Four external pins:

• SCK (serial clock)

• CSN (chip select)

• MOSI (Master Output, Slave Input)

• MISO (Master Input, Slave Output)






CHAPTER 7 TILEPRO TEST, DEBUG, AND BOOT

The TILEPro64™ includes a rich set of Test, Debug, and Boot features.

7.1 TILEPro Test

7.2 JTAGAn IEEE 1149.1 JTAG test access port provides real-time debug, control and test capabilities to the following:

• PLL and clock gating logic

• Memory BIST

• eFuse farm

• Boundary scan chain

7.3 Boot SupportBooting the TILEPro device is fundamentally a three-step process.

The level-0 boot code is built into a ROM in each Tile, and consists of a small program that receives the level-1 boot code from an external device such as a serial ROM, I2C master (the TILEPro64 is the slave), flash memory, HPI, or PCIe host (one of the TILEPro64’s PCIe ports is strapped as PCIe endpoint).

The level-1 boot performs primary device initialization functions including memory controller configuration, physical address space mapping, and local I/O device discovery.

Once level-1 boot is complete, the level-2 boot (remaining Hypervisor, OS, application) can be performed over any of TILEPro’s interfaces including PCIe, 10/100/1000 Ethernet, 10GbE, UART, I2C, or Flexible I/O. The level-1 boot code must initialize the level-2 boot interface to enable the level-2 boot process.

Level-1 boot from SPI ROM (serial Flash with serial peripheral interface):

• 512Kbits to 128Mbits

• Dedicated SPI boot interface (4 pins)

• Up to 12Mbps

Level-1 boot from I2C SROM (serial EEPROM with I2C interface):

• 32Kbits to 1Mbit

• Uses shared I2C master interface (2 pins)

• Up to 400Kbps

Chapter 7 TILEPro Test, Debug, and Boot





G GLOSSARY

Term Definition

BARs Base address registers.

BIST Built in Self Test.

CAM Content Addressable Memory.

CPLD Complex PLD. A programmable logic device (PLD) that is made up of several simple PLDs (SPLDs) with a programmable switching matrix in between the logic blocks. CPLDs typically use EEPROM, flash memory or SRAM to hold the logic design interconnections.

DDC™ Dynamic Distributed Cache. A system for accelerating multicore coher-ent cache subsystem performance. Based on the concept of a distrib-uted L3 cache, a portion of which exists on each tile and is accessible to other tiles through the iMesh. A TLB directory structure exists on each tile - eliminating bottlenecks of centralized coherency management - mapping the locations of pages among the other tiles.

ECC Error-Correcting Code. A type of memory that corrects errors on the fly.

Fabric chaining The ability to cascade multiple Tilera chips together seamlessly in order to provide more processing power, memory, and I/O for an application. The architecture is designed to allow fabric chaining to be done trans-parently to the application such that major software rewrites are unnec-essary.

hardwall technology A microcode feature that can partition a TILE processor into multiple vir-tual machines, allowing different instances of Linux and their applica-tions to run on the chip and be isolated from each other.

host port interfaces (HPIs) A 16-bit-wide parallel port through which a host processor can directly access the CPU’s memory space. The host device functions as a mas-ter to the interface, which increases ease of access. The host and CPU can exchange information via internal or external memory. The host also has direct access to memory-mapped peripherals. Connectivity to the CPU's memory space is provided through the DMA controller.

Hypervisor services Provided to support two basic operations: install a new page table (per-formed on context switch), and flush the TLB (performed after invalidat-ing or changing a page table entry). On a page fault, the client receives an interrupt, and is responsible for taking appropriate action (such as making the necessary data available via appropriate changes to the page table, or terminating a user program which has used an invalid address).

Glossary



Interpacket Gap (IPG) Ethernet devices must allow a minimum idle period between transmis-sion of Ethernet frames known as the interframe gap (IFG) or inter-packet gap (IPG). It provides a brief recovery time between frames to allow devices to prepare for reception of the next frame.

MDIO Management interface I/O bidirectional pin. The management interface controls the behavior of the PHY.

Multicore DevelopmentEnvironment™ (MDE)

Multicore software programming environment.

promiscuous mode In computing, refers to a configuration of a network card wherein a set-ting is enabled so that the card passes all traffic it receives to the CPU rather than just packets addressed to it, a feature normally used for packet sniffing. Many operating systems require superuser privileges to enable promiscuous mode. A non-routing node in promiscuous mode can generally only monitor traffic to and from other nodes within the same collision domain (for Ethernet and Wireless LAN) or ring (for Token ring or FDDI).

RGMII Reduced Gigabit Media Independent Interface.

SPI ROM Serial Flash with serial peripheral interface.

UART (Universal Asynchronous Receiver Transmitter). The electronic circuit that makes up the serial port. Also known as “universal serial asynchro-nous receiver transmitter” (USART), it converts parallel bytes from the CPU into serial bits for transmission, and vice versa. It generates and strips the start and stop bits appended to each character.

VLIW architecture VLIW (Very Long Instruction Word). A microprocessor design technol-ogy. A chip with VLIW technology is capable of executing many opera-tions within one clock cycle. Essentially, a compiler reduces program instructions into basic operations that the processor can perform simul-taneously. The operations are put into a very long instruction word that the processor then takes apart and passes the operations off to the appropriate devices.

Term Definition


I INDEX

Numerics10Gb Ethernet 381GbE ports 8

Aarchitecture overview 33ASID 19atomic instructions 24

Bbare metal

programming environment 2BARs 45base address registers 45base address registers (BARs) 38BIST 45

for memory 43boot support 43boundary scan chain 43Branch Target Buffer (BTB) 20broadcast addressing 38

Ccache misses 24, 26cache sub-system 5cache-coherent shared memory 24CAM 45CDN 7checksums 6coherent I/O 28Complex PLD 45conditional transfer operations 20CPL 19CPLD 45Current Protection Level

See CPL

Ddata movement 33DDC 45DDR2 8, 37deadlock avoidance 31deadlock recovery 31destination register 26destination register operands 20distributed coherent cached shared memory 26

DRAMconnections to 6

DSP applicationsaccelerating 1

Dynamic Distributed Cache 45dynamic networks 31

EeFuse farm 43egress dataflow 35egress packet FIFO 35encryption

instructions to accelerate 6Ethernet MACs 39EX0 21EX1 21execute stages 21Execute0 20Execute1 20execution pipelines 19external interfaces

supported 37external interrupts 19

Ffabric chaining 33, 45Fetch 20fire and forget model 31Flexible I/O 40flexible I/O controllers 40Flexible I/O pins 19Flexible I/Os 8flexible I/Os 8

Ggentle slope C-based programming model 2

Hhashing 6host port interfaces

See HPIsHPI port 8HPIs 40, 45

defined 45Hypervisor 19, 32Hypervisor services 45

INDEX


II/O devices 29I/O PHYs

connections to 6I/O Shim 33

defined 33I/O Shims 33I2C interfaces 40IDN 7, 29, 33iMesh 29iMesh Interconnect 29iMesh™ Multicore technology 1ingress dataflow 34Instruction Level Parallelism (ILP) 11instruction set 12Instruction Set Architecture (ISA) 11interfaces

supported 8interframe gap (IFG) 46interpacket gap (IPG) 46interrupts

exceptions 19IPG 46

JJTAG 43

LL1 electrical interfaces

connections to 6L1 instruction and data caches 24L2 cache 24L2 cache subsystem 26L2 writebacks 26load balancing 37loads and stores 26

MMDE 46MDIO 46MDN 7, 29, 33memory

BIST 43interface 37load path 20

memory attributes 26Memory Dynamic Network

See MDNmemory fences (MF) 24mesh networks 5, 29, 30Minimum Protection Level (MPL) 19Moore’s Gap 1MulAdd operations 20Multicore Hardwall 2, 32

NNAP instruction 7networks

dynamic 31static 30

Oon-chip functions

high bandwidth data movement 33opcodes 12operating system support 19overspeed mode 38

PP0 19P1 19P2 19parallelism in applications

exploiting 1parallelism in networking applications 1parallelism in video applications 1PC 5, 20PCI Express

See PCIePCIe

address space 38ports 8, 38

pipeline 20latencies 21

pixel and frame parallel 1processing engine pipeline 19processor tiles 2program counter

See PCProgrammable bindings 33programmable IPG stretch 39promiscuous mode 39protection mechanisms 19

RRegisterFile 20RegisterFile (RF) 20registers 11RGMII 46RGMII interface 39

Sscratchpad memory 24session and packet parallel 1SIMD instructions for sub-word parallelism 6Special Purpose Registers

See SPRsSPI ROM 46SPRs 11, 19starvation avoidance 37


INDEX

Static Network (STN) 30static networks 30STN 7, 29, 30stream data transfer

support 31stream-based applications 31sum of absolute differences (SAD) 6supported memory modes 23, 24Switch Engine 30

TTDN 7, 29test-and-set (TNS) encoding 18thread-based shared memory programs 1three-way VLIW architecture 5tile array 6tile processor architecture 6TILE64 test 43TILE64Pro

specifications 32Tilera iMesh

illustrated 30TLBs 7

UUART 46

UART interface 41UDN 7, 29, 31User Dynamic Network

See UDN

Vvirtual address (VA) 36virtual memory 19VLAN tagging 38VLIW 5VLIW architecture

defined 46VLIW operations 6VLIW processor engine 11

WWB 20, 21WriteBack

See WB

XXAUI 8X-mode 12

YY-mode 18

INDEX