Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures 黃翔 Dept. of Electrical Engineering

Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip

Architectures

黃　翔 Dept. of Electrical Engineering

National Cheng Kung University

Tainan, Taiwan, R.O.C

2012.09.02

2

Outline Abstract IntroductionVirtualization Middleware Interconnection Design in Virtualization MiddlewareDynamic Mapping by Exploiting Permutation NetworksClassification of TasksScheduling of Task GroupsEnergy and Safety AspectsApplication Example and Discussion ResultsSummary and Outlook

3

AbstractWe exploit a dedicated Virtualization Middleware (VMW)

between an array of processors and independent software tasks. The usually strict and static processor-to-task binding is resolved.

By introducing a dynamically reconfigurable interconnection network based on permutation networks inside this Virtualization Middleware, an easy mapping and scheduling of software task groups may be achieved.

4

Introduction (1/2) Recent FPGAs allow the integration of up to several dozens of soft-

core processors.

Given these large logic resources, at a first glance, it may appear appropriate to allocate a dedicated processor for each software task in the system in order to gain maximum performance.

However, cost as well as power constraints may render this approach unsuited for most scenarios.

On the other hand, structuring applications into tasks and let them use the same (multi-core) processor, however, may lead to unwanted security-critical situations.

Therefore, an obvious solution is to use several independent processors that may share the burden of executing software tasks.

5

Introduction (2/2) Despite making things easier and faster, the employment of many

processors in a SoC rises the questions of how to develop, distribute and, last but not least, to schedule the software on all these processors.

In the embedded design world a lack of experience with multi-core SoC still faces several problems: Few mature design tools tailored to the needs of designers of multi-core SoC are

available. Methodologies aimed to design parallel or at least multi-core architectures often

lack a comprehensive design flow down to the hardware layer. Suited hardware architectures for embedded multi-processors that natively support

an easy mapping and scheduling of software tasks in a SoC are still almost barely available.

Thus, regarding the last point, a generic architecture is needed in order to be able to realize complex multiprocessor SoCs that do not waste logic resources, but otherwise assure both a safe and secure execution of software tasks.

6

Virtualization Middleware(1/5) In order to shift the execution of a task from one processor to

another, it is not sufficient to just reroute the connections between memories and processors.

When being executed, each task has a context residing inside the executing processor. This context consists of the program counter address and the content of

the processor registers such as the general purpose or status registers.

Therefore, this context has also to be considered for shifting the task.

The context of a task is extracted by a Code Injection Logic (CIL) that resides inside the VMW.

7

Virtualization Middleware(2/5)Within this module, a dedicated portion of the so-called

Virtualization Machine Code of the attached processor is stored.

To preserve the extracted context of a software task, i.e., the internal states of processors, a dedicated memory region inside the VMW is allocated for each task. This is called the Virtualization Context Memory as seen from Figure

2.

8

Virtualization Middleware(3/5)When a shift of a task execution is triggered, the CIL

containing the Virtualization Machine Code is multiplexed onto the instruction interface of the corresponding processor as depicted in Figure 2. The connection between instruction memory and processor is

interrupted.

The CIL then inserts nop-commands in order to empty the five stage pipeline of the processor.

After having computed the last regular instruction of the processor code, the connections from the data memory to the processor are also interrupted.

9

Virtualization Middleware(4/5)Now, the Virtualization Context Memory is multiplexed to the

data memory interface of the processor. Furthermore, the PC address register output of the processor is routed

to the Context Memory. After the program counter address of the next instruction to be

regularly fetched has been computed by the processor, this address is thereby stored inside the Context Memory.

As the instruction interface of the processor is now connected to the CIL, the dedicated Virtualization Machine Code stored inside the CIL is fetched by the processor. This code contains instructions which force the processor to dump all

of its register contents on its data memory interface. The task context that is being output by the processor is stored inside

the Context Memory.

10

Virtualization Middleware(5/5) After stopping the execution of a software task and

extracting its context, the connections of the data and instruction memories of a task may be rerouted to some other inactive processor.

To resume software execution the CIL again feeds a dedicated portion of Virtualization Machine Code into the processor. This code then loads the task context stored inside the Virtualization

Context Memory into the register set of the processor.

After having restored the task context, the connections to the data and instruction memories are restored.

An unconditional jump to the previously saved program counter address concludes the shift of the software task and resumes task execution on the new processor instance.

11

Interconnection Design In VMW (1/9) Permutation networks have already been proposed for

processor-to-software communication in the past. They consist of reconfigurable crossbar switches that are connected

by a static interconnect. In the advocated architecture, software tasks are viewed as the inputs

and processors as the outputs of a permutation network.

Crossbar switches are small routing elements with two inputs and two outputs each. They have two configurations as depicted in Figure 3.

• In their first configuration, each input is forwarded to its corresponding output.• In their second configuration, the inputs are connected to the outputs in a cross-

manner.

The configuration of a switch may be changed, i.e., reconfigured, during runtime.

12

Interconnection Design In VMW (2/9) In order to generate permutation networks, crossbar switches

are connected to each other using static interconnects. The design of these static interconnects determine the type of

permutation network employed. In this work, three types were evaluated.

• Butterfly network• Benes network• Max-Min network

13

Interconnection Design In VMW (3/9) Butterfly networks offer a relatively small resource

consumption and short combinatorial paths.

However, due to their low interconnectivity, some input-output combinations are not feasible. They are well-suited for scenarios with harsh resource constraints

and few dynamic binding configurations. A Butterfly network example is depicted in Figure 4.

Because of the severe limitations in possible I/O combinations, we did not consider Butterfly Networks for the synthesis results.

14

Interconnection Design In VMW (4/9)

15

Interconnection Design In VMW (5/9) A Benes network appears as a Butterfly network that is

doubled and mirrored. It has a resource consumption double as high as a Butterfly network

with the same number of inputs and outputs, but offers more flexibility.

A Benes network example is depicted in Figure 5.

However, with increasing numbers of inputs and outputs, routing becomes difficult.

Benes networks were discarded in favor of a network allowing for simple routing.

We, therefore, consider Max-Min networks only.

16

Interconnection Design In VMW (6/9) Max-Min networks may be seen as a sorting network if each

crossbar switch is used as a comparator. If the value inserted into the first input of a crossbar switch is lower

than the value of its second input, then both inputs are routed directly to their corresponding outputs.

In the other case, the crossbar switch is configured to output the inputs in a cross manner.

A Max-Min network example with eight inputs and outputs is depicted in Figure 6.

In doing so for the complete Max-Min network, at the output stage all values from the input ports are sorted.

Therefore, if every software task connected to an input of this network gets the number of the desired processor assigned, then routing the task to the dedicated processor is very easy by applying this sorting-like interconnect.

17

Interconnection Design In VMW (7/9)

18

Interconnection Design In VMW (8/9) The advantages of these networks are weakened by the fact

that their structure is not balanced. The number of crossbar switches that have to be passed to route from

an input to an output varies for different paths. This results in varying combinatorial path delays on the chip. Furthermore, they have the highest resource consumption of all

networks discussed in the scope of this work.

Permutation networks usually do not offer a full flexibility, i.e., do not permit all possible input-to-output combinations in the same time. Thus, some of these combinations cause so-called blockades. Such a situation occurs if two inputs of a crossbar switch need to use

the same output in order to establish their route. As visible from Figure 7, this usually is an undesired situation.

19

Interconnection Design In VMW (9/9) We will demonstrate how to make use of a certain type of

blockades in order to schedule tasks that are grouped in order to share a processor resource.

20

Dynamic Mapping by Exploiting Permutation Networks(1/5)

Mapping of tasks to processors of the advocated architecture is accomplished by defining a Binding Vector (BV). BVt denotes, which software tasks are assigned to which processor at

the point in time t. A BV contains a set of processors P with elements pi and a set of tasks

S with elements sj.

Software tasks may be assigned to processors using the following syntax: BVt = (pa : (sx), pb : (sy), . . . )

Furthermore, in a BV some software tasks may selected to form a task group SGi S, which is intended to be executed ⊂on the same processor: BVt = (pa : (SGx), pb : (sy), . . . ) Thus, tasks that will share a processor resource, e.g., the elements of

SGx, are called a task group.

21


An example of a BV for eight tasks being assigned to four processors in a permutation network with eight inputs and outputs each is given in the following assignment:

BV1 = (1:(A,D,F),3:(B,C),6:(G),7:(E:4,H)) tasks A, D, and F have to share processor 1 tasks B and C share processor 3 Processor 6 is exclusively dedicated to task G. Tasks E and H are then assigned to processor 7. Task E furthermore features an optional budget value of 4.

• It defines, how much processing time shall be granted to this task with respect to the other tasks sharing the same processor.

Based on these assignments, the mapping of the tasks to their corresponding processors through the interconnection network is accomplished by Algorithm 1.

22


23


Routing is performed by a routing logic added to the VMW.

Occurring blockades while applying the example BV are shown by highlighting the affected crossbar switches in Figure 6.

Depending on the type and complexity of the interconnection network, performing intensive back tracking, if a unwanted blockade was detected may not be feasible in terms of computing time.

24


Therefore, different solutions may be applicable if no solution is found after a given time. At first, the designer may define and compute a set of BVs right from the

start to determine whether they will be routable during system runtime or not.

The designer may also change the assignment of the software tasks or the number of processors employed.

Alternatively, he might switch to a permutation network with higher flexibility in order to achieve the desired binding.• In the current implementation, changing the interconnection network, however, requires a

re-synthesis of the system.

Furthermore, as a fall-back alternative, the old BV may remain active, if no routing for the new BV can be found.

During the operation of the system, updated BVs may be entered at any time - either by the user or by a system scheduler instance that has previously resolved detected task dependencies.

25

Classification of Tasks (1/2)Within this classification, tasks being executed in embedded

systems are assigned to three types.

1. Tasks of the first type have to run continuously to ensure a correct system execution. These tasks often have harsh real-time constraints. Examples are tasks in flight management systems on airplanes or

collision-avoidance systems in cars. Since they are time-critical, the scheduling of these tasks may be a risky

issue and therefore, if possible, they run on a dedicated processor.

2. Tasks of the second type run periodically. These tasks usually have no hard timing requirements. Therefore, they may be scheduled with other non-critical tasks in the

system or may even be completely halted for a certain amount of time. Examples are tasks, which periodically read out temperature sensor data

such as in the engine control of a car.

26

Classification of Tasks (2/2)3. Tasks of the third type are characterized by definite completion.

These tasks may perform a calculation, data transfer, or both and may terminate thereafter.

Examples are initialization routines for system start-up. If dependencies of other tasks, which wait for a task of this type to be

completed have been resolved, then this kind of task may easily be scheduled.

27

Scheduling of Task Groups (1/8)Scheduling software tasks sharing a processing resource in the

proposed architecture is done in a time division scheme.

The basic quantum of a time division step is a certain integer value in terms of the underlying clock cycle duration.

The optional budget parameter given in the BV determines a multiple of this basic quantum of the processing time. The higher the budget, the more processing time is granted to a task.

A task group scheduler manages the access of independent task groups to their corresponding processor resource. The architecture enhanced for this purpose is given in Figure 8.

28

Scheduling of Task Groups (2/8)

Figure 8: Detailed Virtualization Middleware with Task Group Scheduler, Routing Logic, and Binding Vector Interface.

29

Scheduling of Task Groups (3/8) In order to schedule tasks that are assigned to the same

processor the steps denoted in Algorithm 2 are executed.

Based on the budget value, a timer inside the task group scheduler controls, whether the task currently running has any processing time left in the current turn. If this is not the case, then the timer triggers a scheduling event.

For the given BV1, the task group scheduling procedure results in the execution sequence depicted on the left hand side of Figure 9. Processor 1 and 3 show the basic time division scheme, whereas on

processor 7, the higher budget assigned to task E leads to a longer processing time for each of its turns

30

Scheduling of Task Groups (4/8)Between each task switching, a virtualization procedure is

executed.

31

Scheduling of Task Groups (5/8)

32

Scheduling of Task Groups (6/8)The user or a system scheduling instance, which resolves task

dependencies and determines which tasks may run in parallel, may change the binding vector at any time.

Algorithm 3 is executed in case that a new BV becomes available.

33

Scheduling of Task Groups (7/8)After having updated the Max-Min network of Figure 6 with the

new BV: BV2 = (1:(D:2,E),3:(B:2,G:5),6:(A:3,H),8:(C,F:3)) The task execution sequence generated by the task group scheduler is

given in Figure 9 b). Note that on processor 7, task E is interrupted by the BV update although

it had some granted processing time left. After the BV update, tasks are now being executed with new time budgets

and, partially, on other processors than before.

As the VMW encapsulates the memory controllers of data and instruction memories, it is possible for the VMW to read out instructions transferred from memories to processors. By exploiting this feature, a so-called self-scheduling of task groups may

be enabled. The proposed architecture provides dedicated scheduling instructions that

trigger scheduling events and may be inserted into the software tasks.

34

Scheduling of Task Groups (8/8)Given a task graph, a linear execution order of tasks may be

derived. We assume each task to be of the second or third type.

If those tasks feature the dedicated scheduling instructions, they are able to indicate the end of their current computation. This evokes the next task of the group by a virtualization event.

By means of this scheme, self-organizing list scheduling is achieved and the main scheduler in the VMW can be skipped completely.

35

Energy and Safety Aspects (1/3)Some processors may remain unused depending on the

contents of the BV. For the example BV1 in Figure 6, these are the processors 2, 4, 5, and

8. Without modification, they would remain in an infinite loop trying to

fetch instructions. This behavior, however, consumes energy.

Therefore, each processor which is currently unused, is automatically deactivated by disabling its clock input.

If shortcomings in energy supply force the system to save energy, each of the processors currently being active may be independently scaled down to one of four clock ratios, which are user-definable.

36

Energy and Safety Aspects (2/3) If a shortcoming in energy supply is only transient, then a temporary

re-scheduling may be applied. By defining a new binding vector, less time critical tasks may be grouped

together to share a processor. Consequently, it takes more time for these tasks to be completed. Some processors will then remain unused and may temporarily be deactivated.

If energy supply recovers to the normal level, then the original BV may be restored and the temporary unused processors may be reactivated.

Related to the method described above, an update of the BV during a temporary shortcoming in energy supply may also exclude several tasks from being executed.

Alternatively, instead of excluding them from the BV, their budget value may be decreased as well.

37

Energy and Safety Aspects (3/3)One of the fundamental safety and security concepts in

embedded systems design is to never let run software tasks relevant to security together with other software modules on the same resource. Failures or intentionally inserted exploits in other software parts may lead

to unwanted behavior of the software in terms of security.

However, by exploiting the advocated virtualization approach, software tasks are physically separated all the time. No processor is able to address data from software which is not currently

bound to it. The context information of a task inside the virtualization memory is

strictly bound to its task. Even a harmful task, which shares its processor resource with a task

relevant to security, cannot access any information of the security-relevant task, because sleeping tasks (?) as well as their corresponding Virtualization Context Memory areas are not linked to any other memory block or the processor.

38

Application Example and Discussion of Results (1/4)

As an example to demonstrate advantages of the proposed virtualization approach, a symmetric encryption and decryption scenario was implemented.

Various software tasks independently encrypt and decrypt data stored in memory using the AES-128 encryption scheme.

Pre-calculated results stored inside the tasks’ data memories are used for comparison in order to determine whether the result of a computation has been corrupted by the virtualization and reconfiguration procedures.

39


For task group scheduling, the timing overhead is given in Table 1.

For a BV update, timing overhead is marginally larger, as listed in Table 2.

40


The resource overhead generated by the proposed architecture is given in Table 3. For comparison reasons, the resource consumption of a MicroBlaze

soft-core processor is included. Mainly due to the rather long paths in the combinatorial permutation

network, the resulting maximum clock frequency of this implementation is 41MHz in this case.

In contrast, without the VMW, a Microblaze processor may run on up to 125 MHz.

41


The proposed approach is able to dynamically set-up various software-to-processor bindings and to schedule between independent software task groups.

However, a disadvantage of the interconnection networks as proposed in this paper are long combinatorial paths between inputs and outputs of the network. This lowers the achievable maximum clock frequency of the system.

In order to use a dozen or more processors in the proposed architecture, presumably other interconnection types have to be considered.

42

Summary and OutlookThe architecture exploits a dedicated virtualization

middleware between processors and tasks featuring a dynamically reconfigurable interconnection network that is used for task group scheduling.

The execution of software tasks may be shifted to another processor at any time.

Moreover, aspects that cover the modifications of permutation networks in order to support a number of software tasks that considerably exceeds the number of processors in the system are being evaluated.

Documents

Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures 黃 翔 Dept. of Electrical Engineering

Hardware Virtualization-driven Software Task Switching in Reconfigurable Multi-Processor System-on-Chip Architectures 黃翔 Dept. of Electrical Engineering