Exploration and Implementation of Wireless Protocol Platformsbwrcs.eecs.berkeley.edu/faculty/jan/JansWeb/ewExternalFiles/03T_SuetFeiLi.pdf · and flexible implementation of wireless

Exploration and Implementation of Wireless Protocol Platforms

by

Suet-Fei Li

B.S. (University of Wisconsin-Madison) 1995

A dissertation submitted in partial satisfaction of the requirements for the degree of

Doctor of Philosophy

in

Engineering- Electrical Engineering And Computer Science

in the

GRADUATE DIVISION

Of the

UNIVERSITY OF CALIFORNIA, BERKELEY

Committee in charge:

Professor Jan Rabaey, Chair

Professor Randy Katz Professor Paul Wright

Fall 2003

The dissertation of Suet-Fei Li is approved:

Chair Date

Date

Date

University of California, Berkeley Fall 2003

Abstract

Exploration and Implementation of Wireless Protocol Platforms

by

Suet-Fei Li

Doctor of Philosophy in Electrical Engineering and Computer Science

University of California, Berkeley


The focus of the thesis research is on the implementation of flexible energy-efficient

wireless protocols, and the corresponding design methodologies. In the first part of the

thesis, we propose a formal top-down, platform-based design methodology, targeting

complex systems with a high level of integration and heterogeneousity. Our methodology

relies on a formal Model of Computation (MOC). It supports architecture exploration,

meets the application’s need on flexibility while achieving energy efficient solutions.

Using PicoRadio as the design driver, the proposed formal top-down design methodology

yields superior results compared to traditional bottom-up ad-hoc approaches

In the second half of the thesis, we focus on energy-efficient management for event-

driven heterogeneous systems. Traditional Operating Systems, acting as the system

manager and scheduler, are not efficient or in many cases not sufficient for the targeted

types of complex real time, power-critical domain specific systems. Our proposed

solution utilizes a system management framework; it exploits the reactive event-driven

nature of the systems, and deploys aggressive power management. The hierarchical

structure of the framework enhances design scalability, supports concurrency, and

enables power control at various granularities. The scope of our power management

algorithm is not limited to individual nodes; instead, it aims to encompass the interest of

the network as a whole. State space partitioning is deployed to execute our power

management algorithm in two phases: Network level power management and the node

level power scheduling.

We have studied different power management algorithms for the network level.

Adaptive algorithms seem to be good solutions since they are able to explore the

temporal correlations in the traffic streams, handle environmental changes and are

relatively simple to implement. However, simple constant threshold algorithms perform

better for critical controller nodes and systems with high wakeup overhead. Our

experimentation on the various adaptive algorithms lead us to speculate that there is a

performance limit to any adaptive algorithm that only has the knowledge of the recent

inter-arrival history. A more “global” approach that incorporates information on the

network neighborhood is needed to achieve major breakthroughs. In the future, we would

like to explore such approaches by appending dedicated power management fields to

existing packets, as well as adjusting the sleep thresholds based on known topology

information.


__________________________

1 INTRODUCTION.......................................................................................................... 1 1.1 CHALLENGES IN WIRELESS PROTOCOL IMPLEMENTATION......................................... 1 1.2 RESEARCH STATEMENT AND MAJOR CONTRIBUTIONS................................................ 3 1.3 THESIS ROADMAP ...................................................................................................... 5

2. PLATFORM-BASED DESIGN METHODOLOGY FOR WIRELESS PROTOCOL PROCESSOR ............................................................................................ 9

2.1 MODEL OF COMPUTATION ....................................................................................... 10 2.2 CONCEPT OF PLATFORM .......................................................................................... 12 2.3 THREE-PHASE PLATFORM-BASED DESIGN FLOW..................................................... 14

2.3.1 Phase I – Platform Conception...................................................................... 16 2.3.1.1 Functional Profiling ........................................................................................ 17

Network Layer Profiling ................................................................................... 18 MAC Layer Profiling........................................................................................ 21 Summary of Profiling Results........................................................................... 21

2.3.1.2 Architecture Exploration................................................................................. 22

Traditional Reconfigurable Architectures......................................................... 23 Hybrid Architecture for Protocol Processing [19] ............................................ 24

2.3.1.3 Architecture Library........................................................................................ 27

2.3.2 Phase II – Platform Instantiation ................................................................... 28 2.3.2.1 PicoRadio II Case Study ................................................................................. 29

2.3.3 Phase III – Implementation.............................................................................. 32 2.4 DESIGN ITERATION................................................................................................... 33 2.5 METHODOLOGIES COMPARISON............................................................................... 36

3. REACTIVE OPERATING SYSTEMS -- THE SOFTWARE MANAGEMENT LAYER............................................................................................................................. 38

3.1 REACTIVE SYSTEM BEHAVIOR ................................................................................. 39 3.2 INADEQUACY OF TRADITIONAL GENERAL-PURPOSE OS’S ...................................... 40 3.3 EVENT-DRIVEN OS................................................................................................... 41 3.4 COMPARISON RESULTS ............................................................................................ 42 3.5 REQUIRED EXTENSION OF TINYOS.......................................................................... 44

4. HIERARCHICAL POWER MANAGEMENT FRAMEWORK .......................... 47 4.1 THE GLOBAL POWER SCHEDULER AND SYSTEM MANAGER .................................... 48 4.2 EXISTING WORKS ON POWER MANAGEMENT POLICIES........................................... 50

4.2.1 Stationary Statistical Power Management Policy ........................................... 52 4.2.2 Adaptive Power Management Policy For Non-Stationary Traffic .................. 55 4.2.3 Dynamic Voltage Scaling (DVS)...................................................................... 56

4.3 PROPOSED POWER MANAGEMENT ALGORITHM FOR SENSOR NETWORKS ............... 57

4.3.1 Formulating Power Control Policy For PicoRadio Network.......................... 57 4.3.2 Requirements For PicoRadio Network Power Control Policy ........................ 58 4.3.3 Proposed Power Control Algorithms............................................................... 60

5. NODE LEVEL POWER MANAGEMENT ............................................................. 63 5.1 HIERARCHICAL NODE-LEVEL POWER MANAGEMENT ARCHITECTURE..................... 63 5.2 GLOBAL POWER SCHEDULER ................................................................................... 65 5.3 POWER STATES TRANSITIONS FOR SYSTEM BLOCKS................................................ 67 5.4 NODE LEVEL POWER SCHEDULING .......................................................................... 69

5.4.1 Predictive Look-Ahead Scheduling.................................................................. 72 5.4.2 Implementation of Predictive Scheduling ........................................................ 74 5.4.3 Power Scheduling Without Predictive Wakeups.............................................. 78

5.5 INCORPORATING DYNAMIC VOLTAGE SCHEDULING (DVS)..................................... 78 5.6 THE STATEFLOW - SIMULINK ESTIMATION-SIMULATION FRAMEWORK ................... 79

6. NETWORK LEVEL POWER MANAGEMENT ................................................... 83 6.1 TRAFFIC CONSIDERATIONS ...................................................................................... 83 6.2 CONSTANT THRESHOLD ALGORITHM ....................................................................... 91 6.3 SINHA & CHANDRAKASAN..................................................................................... 101 6.4 MODIFIED HWANG AND WU’S [32] ........................................................................ 103 6.5 ADAPTIVE DYNAMIC THRESHOLD ALGORITHMS ................................................... 105

6.5.1 Improve Adaptive Algorithms By Exploiting Special Characteristics Of The Sensor Network ....................................................................................................... 108

6.6 EVALUATIONS OF VARIOUS POWER MANAGEMENT ALGORITHMS FOR SENSOR NETWORK APPLICATIONS ............................................................................................ 109 6.7 SERVICE TIME CONSIDERATION ............................................................................. 110 6.8 IMPLEMENTATION COST OF THE PM ..................................................................... 111

7. CONCLUSIONS AND FUTURE WORKS............................................................ 111 7.1 SUMMARY OF THESIS RESEARCH AND CONTRIBUTIONS......................................... 112 7.2 LESSONS LEARNED AND FUTURE RESEARCH OPPORTUNITIES ............................... 114

Platform-based design methodology for protocol processing ................................ 115 Node level power management ............................................................................... 115 Network level power management .......................................................................... 116

8. REFERENCES.......................................................................................................... 118

1

1 Introduction

1.1 Challenges in Wireless Protocol Implementation

The implementation of small, mobile, low-cost, energy conscious devices has

created unique challenges for today’s designers. Limiting battery lifetimes make

energy efficiency a most critical design metric and the real time nature of applications

impose strict performance constraints. The drive for miniaturization and inexpensive

fabrication calls for an unprecedented high level of integration and system

heterogeneity. Rapidly shrinking the design time caused by market pressure due to

fierce competition, combined with the need to support multiple wireless standards,

favors a system architecture that is highly reusable and programmable. To meet these

conflicting and unforgiving constraints, we must optimize the designs among all the

above competing criteria: energy, performance, flexibility and cost, while prmanaging

the ever-increasing complexity.

This thesis research is mostly interested the trade-off between energy and

flexibility in architecture implementation. Performance is viewed more as a design

constraint and not as optimization criteria. Figure 2 shows that vast playing ground to

be explored in the energy-flexibility trade-off game. Four different types of

architecture in ascending order of flexibility are presented: dedicated direct mapped

ASIC, hardware reconfigurable DSPs, software programmable Digital Signal

Processors (DSP) and embedded general-purpose microprocessor. Implementing the

same DSP algorithm, energy efficiency increases by 4 orders of magnitude as we go

from the least flexible ASIC to the most flexible embedded processor.

2

Figure 2 shows the energy-flexibility trade-off for a particular wireless protocol

example. The MAC layer of a wireless protocol is implemented in ASIC, FPGA and

the ARM8 embedded processor. Similar to Figure 2, we observe four orders of

magnitude increase in energy consumption from ASIC to embedded microprocessor.

To address these challenges, we must rethink the different aspects of the

ASIC FPGA ARM8Power 0.26mW 2.1mW 114mWEnergy 10.2pJ/op 81.4pJ/op n*457pJ/op

Embedded ProcessorsSA1100.4 MIPS/mW

ASIPsDSPs 2 V DSP: 3 MOPS/mW

DedicatedHW

Flexibility (Coverage)

Ene

rgy

Eff

icie

ncy

MO

PS/m

W(o

r M

IPS/

mW

)

0.1

1

10

100

1000

ReconfigurableProcessor/Logic

Pleiades10-80 MOPS/mW

Embedded ProcessorsSA1100.4 MIPS/mWEmbedded ProcessorsSA1100.4 MIPS/mW

ASIPsDSPs 2 V DSP: 3 MOPS/mWASIPsDSPs 2 V DSP: 3 MOPS/mW

DedicatedHW

Flexibility (Coverage)

Ene

rgy

Eff

icie

ncy

MO

PS/m

W(o

r M

IPS/

mW

)

0.1

1

10

100

1000





Figure 2 Energy vs. Flexibility trade-off for various architectures. MOPs/mW means Million of Operations per micro Watt. Dedicated HW is direct mapped ASIC. Pleiades is a re-configurable hardware Digital Signal Processors (DSP). ASIPs/DSPs are software programmable DSPs. SA110 is the StrongARM processor.

ASIC: 1V, 0.25 um CMOS process FPGA: 1.5 V 0.25 um CMOS low-energy FPGA ARM8: 1 V 25 MHz processor; n = 13 Ratio: 1 - 8 - >> 400 Figure 1 Energy efficiency vs. flexibility trade-off for a wireless protocol implementation

3

traditional approach to system design, from methodology to architecture. This thesis

research targets the issues of energy-efficiency, flexibility and design complexity by

proposing a novel design methodology and carefully studying the software and

hardware architecture. In particular, we focus on the management layer of reactive

heterogeneous systems. In the context of power-critical systems, its role lies in system

resource scheduling and power management. The main research contributions of the

thesis are highlighted in the following section.

1.2 Research statement and Major contributions

A typical wireless system has an analog front-end and a digital back-end. The

digital part consists of base-band and protocol processing units. My research

concentrates on the “protocol” components, which are operations to ensure proper

delivery of the packets given underlining network architecture and physical medium.

Relevant operations include routing, packet processing, classification and Medium

Access Control (MAC). Stream-processing and direct manipulation of data in the

communication pipeline such as data compression, decompression, encryption and

decryption, coding are not relevant. The scope of the thesis is on energy efficient

and flexible implementation of wireless protocols and the accompanying novel

design methodology.

The major contributions of this thesis can be summarized in two parts.

We propose a formal top-down platform-based design methodology for

protocol implementation. Most protocol design methodologies currently in use

are inadequate, either because they do not rely upon formal techniques and

4

therefore do not guarantee correctness, or because they do not provide

sufficient support for performance analysis and design exploration and

therefore often lead to sub-optimal implementations. Our methodology relies

on a formal Model of Computation (MOC). It supports architecture

exploration and meets the application’s need on flexibility while achieving

energy efficient solutions. The methodology specifically targets complex

systems with a high level of integration and hetegeneousity. Based on the

concepts of platform-based design, we divide the design process into three

distinct phases: Platform Conception, Platform Instantiation and

Implementation. A complex real time system, the PicoRadio [12] platform, is

used as the case study to help guide the entire design process. Experiments are

conducted at a variety of levels to illustrate how to apply the design

methodology to devise an architecture that is optimized for size, cost, and

most importantly, energy. The proposed formal top-down design methodology

yields superior results compared to traditional bottom-up ad-hoc approaches.

We propose a hierarchical power management framework for low-power

reactive heterogeneous systems. To achieve the ultimate energy efficiency,

power management should be implemented at every level of the design

hierarchy, from device to system levels. Power saving is normally the highest

at the top level, that is, the system level. Traditional Operating Systems, acting

as the system manager and scheduler, do not usually provide power control

services. Furthermore, as they were developed for broad applications, they

are not efficient or in many cases not sufficient for the targeted types of

5

complex real time, power-critical domain specific systems. Our proposed

solution is developed to exploit the reactive event-driven nature of the domain

and has built-in aggressive power management. The hierarchical structure of

the framework enhances design scalability, supports concurrency in both the

application domain and architecture and enables power control at various

granularities. Our power management algorithm executes in two phases:

Network level algorithm first treats the whole node as one entity and try to

decide when the whole node should go to sleep; then once the node is turned

on, the node level algorithm decides on the scheduling of the various modules

inside the node.

To validate research concepts, I have participated in the development of the

PicoRadio II and PicoRadio III chips. PicoRadio II was designed using our

proposed design methodology. PicoRadio III deploys a power manager to

demonstrate the reactive management concepts discussed in the thesis.

1.3 Thesis Roadmap

Figure 3 shows the general organization of reminder of the thesis.

6

Phase II Platform Instantiation

Phase III Implementation

Phase II Platform ConceptionChapter 23-phase Platform Based Design Methodology

Chapter 3Reactive OS

Lesson learned:OS support is crucial.What is the right OS for reactive systems?

Chapter 4Hierarchical Power Management Framework

Chapter 8Conclusion

Chapter 5Node Level Power management

Chapter 5Network Level Power management

Recap contributionsLesson learns and future works

Propose management framework based on TinyOS

OS as the global power manager and schedulerExisting power control algorithmsPropose power management algorithm for

PicoRadio network

Hierarchical architectureNode level power

schedulingStateflow-Simulink

simulation framework

Sensor network traffic analysis

Evaluating various power control algorithms in OMNet++

2-level power management

Thesis Roadmap

Inadequacy of general purpose OSComparing event driven and general purpose OSTinyOS needs to be extended for heterogeneous

systems

Phase II Platform Instantiation

Phase III Implementation

Phase II Platform ConceptionChapter 23-phase Platform Based Design Methodology

Chapter 3Reactive OS

Lesson learned:OS support is crucial.What is the right OS for reactive systems?

Chapter 4Hierarchical Power Management Framework

Chapter 8Conclusion

Chapter 5Node Level Power management

Chapter 5Network Level Power management

Recap contributionsLesson learns and future works

Propose management framework based on TinyOS

OS as the global power manager and schedulerExisting power control algorithmsPropose power management algorithm for

PicoRadio network

Hierarchical architectureNode level power

schedulingStateflow-Simulink

simulation framework

Sensor network traffic analysis

Evaluating various power control algorithms in OMNet++

2-level power management

Thesis Roadmap

Inadequacy of general purpose OSComparing event driven and general purpose OSTinyOS needs to be extended for heterogeneous

systems

Figure 3 Thesis Roadmap

7

Chapter 2 introduces our formal top-down platform-based design methodology

for protocol implementation. The iterative design process is divided into three distinct

phases: Platform Conception, Platform Instantiation and Implementation. The

PicoRadio platform is used as the case study to help guide the entire design process.

At the end of the implementation phase, however, we discover that our design falls

short of the original goal. It is quite inefficient and far from meeting our application

specification of low-cost and low power. The efficiency is due to two major faults in

the design process: modeling and software synthesis using ECOS, a general purpose

OS. The second fault can be greatly improved by replacing ECOS with TinyOS, a

reactive OS that better matches the application. After repairing these faults, we are

able to show that the proposed formal top-down design methodology yields superior

results compared to traditional bottom-up ad-hoc approaches.

The design flow experiment has drawn our attention to the vast impact of the OS.

We will devote the rest of the thesis to search for the “right” OS for our targeted

application: the wireless communication systems.

In Chapter 3, we discuss how general-purpose multi-tasking OS is less suitable

for our targeted application that reactive OS, which is developed to exploit the

reactive event-driven nature of the domain. We present a comparison between the

two. Our results indicate than the event driven OS achieves an 8x improvement in

performance, 2x and 30x improvement in instruction and data memory requirement,

and a 12x reduction in power over its general-purpose counterpart. However, the

existing TinyOS has many limitations and has to be extended.

8

In Chapter 4, based on the attractive concepts of TinyOS, we propose a power

management framework that specifically targets reactive heterogeneous systems. We

discuss some existing works on power management and proceed to propose our

power management algorithm for sensor networks. To handle complexity in system

modeling, the algorithm executes in two phases: Network level algorithm first treats

the whole node as one entity and try to decide when the whole node should go to

sleep; then once the node is turned on, the node level algorithm determines on the

scheduling of the various modules inside the node.

Chapter 5 covers the node level power control algorithm. We present the

hierarchical architecture and discuss power scheduling related issues. At the node

level, power scheduling makes decisions on the sequence and exact timing of the

block wakeups and sleeps. The goal is to minimize the overall power consumption

while meeting the performance and resource constraints. We use the Stateflow-

simulation environment as our simulation framework.

Chapter 6 covers the network level power control algorithm. We first try to

understand the nature of the network traffic. Then we simulate the various power

control policies in a typical sensor network setting using the OMNet++ simulator.

From our experiments, adaptive algorithms seem to be good solutions since they are

able to explore the temporal correlations in the traffic stream, handle environmental

changes and are relatively simple to implement. However, simple constant threshold

algorithms perform better for some “difficult” cases.

9

Chapter 7 concludes the thesis. We recapitulate the major research results and

contributions. We also discuss the lessons learned and identify the open questions and

opportunities for further research.

2. Platform-based design methodology for wireless

protocol processor

Following a formal design methodology is vital to protocol implementation. Most

existing methodologies are ad-hoc in nature. Without relying upon formal techniques,

they often do not provide sufficient support for performance analysis and design

exploration, and tend to lead to sub-optimal implementations. In our proposed

methodology, we capture the functional behavior of the design with a high-level

abstraction. Specifying the design with a formal Model of Computation (MOC)

enables us to ably apply design exploration and synthesis later in the design flow to

produce flexible and energy-efficient implementations.

Based on the concepts of platform-based design, we divide the design process into

three distinct phases: Platform Conception, Platform Instantiation and

Implementation. A complex real time system, the PicoRadio [12] platform, is used as

the case study to help guide the entire design process. Experiments are conducted at a

variety of levels to illustrate how to apply the design methodology to devise an

architecture that is optimized for size, cost, and most importantly, energy.

In this chapter, we start by introducing the concepts of Models of Computation

and platform. Then we describe in detail our three-phase platform-based design flow,

using PicoRadio as the design driver. To illustrate the effectiveness of our

10

methodology, we conclude this chapter by presenting a comparison between a design

implemented with the traditional ad-hoc methodology and one implemented with our

formal methodology.

2.1 Model of Computation

The process of system design starts with the correct capturing of system behavior.

Traditionally, the choice of the language used for capturing functional specifications

is often informal and application dependent. Natural languages, Matlab, C and C++

are all popular forms of design capture. However, these languages often lack the

semantic constructs to be able to specify concurrency. We promote a more formal

approach to choose the functional specification languages based on their underlying

mathematical model, which is called model of computation (MOC) [1]. MOC are the

rules of interaction of components and the semantics of the composition.,

computation and concurrency [2]. In fact, concurrency models are the most important

differentiating factors among models of computation. A popular model of

computation is threads, where a set of sequential processes operates on the same data.

Other examples of MOCs are: Communicating Sequential Processes [3], the pi

calculus [4], dataflow [5], process networks [6], discrete events [7], Finite State

Machines (FSM) and the synchronous/reactive model [8] etc.

11

The appropriate MOC for the protocol processing is Concurrent Extended Finite

State Machines (CEFSM) [9]. CEFSM models a network of communicating extended

finite state machines (EFSM), which are finite state machines that can have complex

actions on transitions (Variable assignments, computation etc). EFSM can effectively

express both control and the computation found in datapath operations. Figure 4

illustrates how CEFSM naturally models protocol processing. Each layer

(component) in the protocol stack is modeled as an EFSM. The communication

between EFSMs is asynchronous to accommodate components working at different

rates: Lower layers of the stack typically run much faster than the higher layers. The

asynchronous communication is supported through connecting queues between the

components.

The formal capturing of functional behavior enables us to efficiently apply

verification and synthesis later in the design process. Verification and synthesis are

the most effective if complexity is handled by formalization, abstraction and

C=>GC=>G

EFSM

Concurrent EFSMsProtocol stack

C=>GC=>G

EFSM

Concurrent EFSMsProtocol stack

Figure 4 EFSM as the MOC for protocol processing. Each layer in the protocol stack is modeled as an EFSM, the communications between the EFSMs are through queues.

12

decomposition [10]. Specifying the design with a high-level abstraction allows the

freedom to explore a wide variety of implementations.

2.2 Concept Of Platform

As mentioned in Chapter 1, the need for shorter design time and greater design

complexity has made it necessary to look to new design methodologies that support

design reuse. Platform-based design [11] facilitates design reuse by abstracting

hardware to a higher level (system platform) that is visible to the application

software. A system platform has three components: hardware platform, software

platform and interconnect.

PlatformDesign-Space

Exploration

PlatformSpecification

Architectural Space

Application SpaceApplication Instance

Platform Instance

SystemPlatform

PlatformDesign-Space

Exploration

PlatformSpecification

Architectural Space

Application SpaceApplication Instance

Platform Instance

SystemPlatform

Figure 5 Platform-based design methodology

13

The hardware platform should comprise a family of flexible (parameterizable)

architectures that adequately support the functions in the application space with

performance/power models. A software platform is needed to abstract the hardware

platform into a programmer’s model to allow effective mapping. It is usually in the

form of a Real Time Operating System (RTOS), which is responsible for the

scheduling of the computational resources and of the communication between them.

In essence, it is a hardware platform “manager”. Inter-communication strategies

designate the interconnection between the architecture modules.

Once a system platform has been identified for the application space and the

architecture space, the final chip design involves design exploration within the system

platform to determine the best mapping of application to architecture. Figure 5

graphically captures the platform-based design concept.

Figure 6 shows an example of a platform. Its heterogeneous architecture combines

programmable (microprocessor), flexible (FPGA), and application-specific modules.

The mixed-mode platform processes both analog and digital signals and is DSP and

control intensive (FSM processing).

14

2.3 Three-Phase Platform-Based Design Flow

Our platform-based design flow can be split into three phases, as shown in Figure

7. In phase I, the system platform is conceived through consideration of the

application domain and the available architectural modules. Phase II performs the

design exploration to find a suitable platform instance for a given set of target

applications and constraints. Lastly, phase III completes the final implementation

(hardware and software synthesis) of a specific application onto the platform instance.

Note that the design flow is iterative. If the final implementation from Phase III does

not meet the design specification, we need to go back to either Phase I and/or Phase II

to refine the platform or the mapping until a satisfactory implementation is obtained.

ReconfigurableDataPath

ReconfigurableState Machines

Embedded uP+ DSPs

FPGA

DedicatedDSP

ReconfigurableDataPath

ReconfigurableState Machines

Embedded uP+ DSPs

FPGA

DedicatedDSP

Figure 6 Example of a platform

15

To facilitate further understanding of the different phases of the design methodology,

we will present the design of the PicoRadio platform as case study.

PicoRadio [12] is an ad hoc, sensor-based wireless network that comprises

hundreds of programmable and ultra-low power communicating nodes. PicoRadio

applications have the following characteristics: low-data rate, ultra-low power budget,

and mostly passive event-driven computation. Reactivity is triggered by external

events such as sensor data acquisition, transceiver I/O, timer expiration, and other

environmental occurrences. The chosen MOC for the PicoRadio protocol stack is

Concurrent Extended Finite State Machines (CEFSM).

Like most major projects, the PicoRadio design has progressed through different

versions. The second version of the PicoRadio (PicoRadio II) design is far more

Kernel Extraction via

FunctionalProfiling

Fabric Exploration

Configurable Platform

Phase I

Mapping

Performance Evaluation

Phase II

Implementation

Phase III

Functional Specification.


FunctionalProfiling

Fabric Exploration

Configurable Platform

Phase I

Mapping


Phase II

Implementation

Phase III


Figure 7 Platform-based design flow

16

simplistic than the ultimate design; it is to provide a learning experience to address

the methodology, tools and integration issues. PicoRadio II has been completed and

tested and will be used as the case study for the design flow.

2.3.1 Phase I – Platform Conception

The first step in platform-based design is to conceive a system platform that

identifies a set of architectural modules to support the class of functions in our

applications domain.

A typical platform for wireless systems consists of programmable processors,

reconfigurable logic, dedicated logic, memories, and peripherals. To construct

hardware platform that supports the key functions of the application domain is a two-

fold process, proceeding in lock-step: We need to

(1) identify the key functions and their constraints,

and (2) explore the available architecture modules

and their performance behavior. The former is

achieved through functional profiling of a suite of

candidate applications, and extracting a set of key

operations (kernels) common to these applications.

The latter requires architecture exploration of

existing implementation fabrics to obtain first-order

performance, energy and area estimations for these

basic modules. Figure 8, taken from Figure 7

depicts this duality. The output of phase I is a


FunctionalProfiling

Fabric Exploration

Phase I


FunctionalProfiling

Fabric Exploration

Phase I

Figure 8 Phase 1 – Platform Conception

17

library of architectural modules with corresponding performance, energy and area

prediction models. In the following sections, we present these two steps in greater

detail in the context of our case study.

2.3.1.1 Functional Profiling

Before starting the implementation process, we need to gain an in-depth

understanding of our application space. A highly efficient implementation could

only be realized if the performance critical operations in the applications are

classified and specially targeted. Functional profiling explores regularity and extracts

common operations (kernel extractions) in the application. The important issues of

functional profiling are profiling granularity and classification and interpretation of

the collected data. If the granularity is either too coarse or too fine, regularity and

commonality may not be fully exposed. To reach an optimal granularity, some

reorganization of the application code (for example, insertion of some wrapper

functions) is often needed. To classify and interpret profiling data in a meaningful

fashion requires some insight into the class of application algorithms.

We use the list of critical operations identified by the high performance wired

network processor (WNP) community as the initial guideline for profiling wireless

applications. It consists of parsing, searching (table lookup), packet modifying and

re-assembly.

We have conducted experiments to profile both the network and the MAC layers

of the protocol stacks. Network layer profiling is performed on a mobile ad hoc

network application that supports different network protocols. MAC layer profiling is

performed on a distributed multi-channel MAC. For both experiments, the application

18

programs are written in OPNET Radio Modeler from Millennium 3 Technologies

[13].

Network Layer Profiling

The model has simple MAC and physical layers, but a rather sophisticated

network layer that enables us to explore different types of routing protocols. Node

distribution and mobility can also be specified. Routing protocols has a significant

impact on implementation parameters such as routing and forwarding table sizes.

The distribution of nodes in the network affects the network activity and hence the

protocol performance. We studied four different scenarios with two different routing

protocols and two types of node distributions. The first protocol is the Ad-hoc On-

Demand Distance Vector Routing (AODV) [14], a reactive protocol. The second is

the Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) [15], a

proactive protocol. The first type of network distribution is a uniform grid of nodes in

which the neighbors can hear each other. The second is a randomly generated

distribution. The nodes have fewer neighbors on average in the random distribution

case and generate less network traffic.

19

OPNET only produces profiling information at the level of the leaf funtions. The

generated fine grained profiling data are grouped and classified using the WNP

guidelines. The results, presented in Figure 9 and Figure 10, look very similar to the

WNP cases. The kernel extracted are: searching (table look-up), packet processing

(parsing, modification, assembly etc) and memory (queues, buffers) management.

We are both interested in the total time these operations consume (Figure 9) and

the number operations (Figure 10) performed. Searching consumes 20%-45% of the

total time, and 26%-46% of total operations performed are searches. Packet

manipulation (parsing, modification, re-assembly, etc.) consumes 18%-28% of the

total time. However, only 4%-9% of the total operations performed are packet

manipulation. This implies the duration of packet manipulation operation is longer

than average.

0

5

10

15

20

25

30

35

40

45

%

Search Memory Mgmt. PacketDisassembly

PacketAssembly

Timers Others

Total time percentage breakdown from different scenarios

aodv_random45dsdv_random35aodv_uniform40dsdv_uniform40

Figure 9 Functional Profiling of the network layer

20

0

5

10

15

20

25

30

35

40

45

50

%

Search Memory Mgmt. PacketDisassembly

PacketAssembly

Timers Others

Number of operations percentage breakdown

aodv_random45dsdv_random35aodv_uniform40dsdv_uniform40

Figure 10 Functional profiling of the network layer

0

20

40

60

80

100

120

Search Queue Proc. PacketDisassambly

PacketAssembly

Timer Proc.

Number of Activation: Distributed multi-channel MAC

Figure 11 Profiling a multiple channel MAC

21

MAC Layer Profiling

The kernels identified in our MAC profiling experiment are similar to these of

network layer: searching, queues management, packet disassembly (parsing, pattern

matching), packet assembly and timer processing (see Figure 11). We have made the

following assumptions: the delivery request rate is 10 packets per second and there

are at most three delivery attempts.

Summary of Profiling Results

Table 1 summarized the profiling experiments conducted in previous sections. As

we go down the stack from application to physical, the processing speed increases

and the processing granularities decreases. Kernels are classified as either control

dominated or data dominated. Kernels are data dominated if the complexity mostly

CRC/Verification;Complex multipliers; Match-filters; FIR filters; Correlators; Magnitude-squarers; CORDICs

Synchronization; TimersSegmentation;Assembly/diassembly

Physical

Localization AlgorithmQueue management; Packet assembly/disassembly; Timers;Channel assignment Table lookup;

MAC

Packet processing (parsing, modification, assembly, disassembly)Routing/forwarding table lookup; Timers

Network

Encryption; Decryption; Compression;Decompression

Application/Transport

Data KernelsData KernelsControl KernelsControl KernelsLayersLayers

CRC/Verification;Complex multipliers; Match-filters; FIR filters; Correlators; Magnitude-squarers; CORDICs

Synchronization; TimersSegmentation;Assembly/diassembly

Physical

Localization AlgorithmQueue management; Packet assembly/disassembly; Timers;Channel assignment Table lookup;

MAC

Packet processing (parsing, modification, assembly, disassembly)Routing/forwarding table lookup; Timers

Network

Encryption; Decryption; Compression;Decompression

Application/Transport

Data KernelsData KernelsControl KernelsControl KernelsLayersLayers

BitsNano sec

PacketsMicro Sec.

PacketsMilli sec.

Source dataSec.

Processing speedProcessing speed

BitsNano sec

PacketsMicro Sec.

PacketsMilli sec.

Source dataSec.

Processing speedProcessing speed

Table 1 Summery of profiling results

22

comes from data processing, and control dominated if the complexity mostly comes

from control structures.

The classification of the functional kernels into control and data processing

operations is very important for the realization of an efficient implementation. The

different nature of control and data processing operations intuitively leads to different

“optimal” implementation structures. An efficient architecture for protocol

implementation should contain a mixture of these different implementation structures.

The exact proportion of the different structures depends on the ratio between control

and data in the application. In the following section, we will introduce a “hybrid

architecture” that is based on this concept.

From the profiling results, we can conclude that the implementation of dedicated

engines that perform packet processing, table searching and queue management may

greatly improve the overall system performance. Furthermore, since searching is

mostly routing and forwarding table lookups, a routing protocol that does not involve

the maintenance of large tables (or system states) should lead to a cheaper

implementation.

2.3.1.2 Architecture Exploration

To ensure the most usable system platform for the remainder of the design process,

the components of the hardware platform should provide sufficient coverage of the

functions of the application domain. Functional profiling identifies the most crucial

functions that we need to address. Architectural exploration generates a set of

architectural modules that best supports these functions.

23

To enable software reuse and allow application exploration, a wireless platform

must provide some degree of programmability. Traditional computing architectures

range from microprocessors to ASICs, but most fail to meet the energy requirement

or the flexibility requirement. There also exist configurable processors [16] [17], but

too often they are designed to specialize for applications other than communication

protocol processing. This leaves us with the family of reconfigurable logic

architectures, which are sufficiently low-level to allow low-energy circuit techniques,

while providing some degree of flexibility through reconfiguration. Also, they are not

limited to certain application sets.

Traditional Reconfigurable Architectures

Traditional reconfigurable architectures come in two flavors: field-programmable

gate array (FPGA) and programmable logic device (PLD) [18]. FPGA and PLD

architectures differ significantly in granularity. Using look-up table (LUT)

technology, FPGAs can efficiently implement any arbitrary logic with few inputs.

Since the LUTs are easily chained together to implement multilevel logic, this

architecture is well suited for complex operations such as arithmetic and signal

processing. On the other hand, the PLDs use programmable array logic (PAL) blocks

that can each implement sum-of-product logic of many inputs but limited output.

Thus, PLD structures are suitable for control FSMs. We performed experiments

mapping benchmarks from PicoRadio II to commercial FPGA and PLD chips, and

measured their utilization based on equivalent gate count. The results, shown in

Figure 12, are consistent with the theoretical claims.

24

Hybrid Architecture for Protocol Processing [19]

Since PicoRadio protocol stack takes the form of EFSMs, we are constructing a

reconfigurable architecture using both PAL and LUT blocks for control and datapath

respectively. By utilizing each structure on functions that they are best suited for, we

can achieve the best performance in the combined structure.

The architecture uses hybrid cells, each consisting of a small PAL block for

control, and a small array of LUTs and flip-flops (FFs) for data processing. Figure 4

shows a block diagram of this architecture. Each cell in this structure comprises a

PAL block and a small array of LUTs and FFs; thus, each cell corresponds to a small

FSM. Since protocols have many interacting FSMs, the architecture shall have an

array of these hybrid cells.

Figure 13 shows a detailed block diagram of a hybrid cell. Since the data

processing elements are isolated in the LUT portion of the cell, the FSM must

generate control signals that feed into the LUTs. The data inputs go directly to the

0

0.2

0.4

0.6

0.8

1

1.2

PhysSend (FSM) Remote (FSM) GenSync (Data) MergeInteger(Data)

Nor

mal

ized

Util

izat

ion

FPGA

PLD

Figure 12 Implementation results of wireless protocol blocks.

25

LUTs, and the control inputs go directly to the PAL. Similarly, the data outputs come

from the LUTs, and the control outputs come from the PAL.

Since control and data outputs, as well as any internal control signals may be used

in the control plane, these signals must be fed back into the PAL block. Layout of the

block should be considered carefully to minimize the lengths of these feedback

signals.

As mentioned in Section 2.3.1, to simplify performance evaluation of the

architectures, an architecture should have estimation models that provide first-order

performance numbers for a given function. Based on empirical data from Spice

simulations, we obtained cost estimates and deduced a set of prediction equations for

the power costs of FPGA and PAL structure. Figure 14 shows these equations. The

estimates are based on 0.25µm technology on a 1.0V supply. Estimation of the data

PAL LUTs & FFs

CONTROL DATAPATH

PAL LUTs

Data Input

Data Output

Ctrl Signals

Ctrl Input

Ctrl Outputs

Figure 13: Basic block diagram of the hybrid architecture.

26

portion is based on the energy consumed by LUTs and FFs. Estimation of the control

processor is based on a dynamic logic PAL implementation, which has significantly

lower energy consumption than a traditional sense amp based implementation. Note

that interconnect power is not included, which is a degree of error that we allow for

the sake of simplicity.

With the power models, we can obtain first-order performance results to see how

well the architecture works for our applications. Using the mac_a design from the

TCI project as benchmark, we estimated its power consumption under three different

scenarios: purely FPGA implementation, purely PAL implementation and the

proposed hybrid approach. The results, shown in Figure 15, suggest that the hybrid

architecture out-performs the other two scenarios. We can expect even greater gain

when the power dissipation of the PAL reduces as our research in low-energy PAL

matures.

LUT Power Estimation LUT Power = (LUTs * 2.2uW) + (FF * 0.5uW)

PAL Power Estimation

PAL Power = (P-terms * Inputs * 0.05uW) + (Outputs * 0.7uW)

Figure 14 Power estimation equations for LUT and PAL implementations.

27

2.3.1.3 Architecture Library The output of phase I is a library of architectural modules with corresponding

0

200

400

600

800

1000

1200

1400

1600

1800

FPGA PAL HybridArchitectures

Pow

er D

iss.

(uW

)P(PAL)P(FPGA)

Figure 15: Power comparison of different architectures.

inst,LD,2inst,LI,1inst,ST,2inst,OP.c,2inst,OP.s,3inst,OP.i,1inst,OP.l,1inst,OP.f,1inst,OP.d,6

inst,DIV.i,118inst,DIV.l,122inst,DIV.f,145inst,DIV.d,155inst,IF,5inst,GOTO,2inst,SUB,19inst,RET,21

inst,MUL.c,9inst,MUL.s,10inst,MUL.i,18inst,MUL.l,22inst,MUL.f,45inst,MUL.d,55inst,DIV.c,19inst,DIV.s,110

SONICS modelSONICS modelInitiator

CoreInitiatorAgent

Inte

rcon

nect

OCP

TargetAgent

TargetCore

OCP

Arbiter

OS

Proc.

Conf.Logic

Interconnect model

ASIC

Delay Model

Power model

Type of scheduler

Delay model: overhead for task management

PAL=(3.54 + M * 6.75 + (P-M) * 1.02 + I * 4.44) * ToggleRate * Vdd^2

FPGA =(L * 4.18 + F * 0.95) *ToggleRate * Vdd^2








CoreInitiatorAgent

Inte

rcon

nect

OCP

TargetAgent

TargetCore

OCP

Arbiter


CoreInitiatorAgent

Inte

rcon

nect

OCP

TargetAgent

TargetCore

OCP

Arbiter

InitiatorCore

InitiatorAgent

Inte

rcon

nect

OCP

TargetAgent

TargetCore

OCP

Arbiter

OSOS

Proc.Proc.

Conf.LogicConf.Logic

Interconnect model

ASICASIC

Delay Model

Power model

Type of scheduler

Delay model: overhead for task management





Figure 16 Architecture library

28

performance, energy and area prediction models. Figure 16 shows an example library

with hardware, software and interconnect platform components. The hardware

platform components shown are microprocessor, ASIC and configurable logic. Each

model is annotated with estimation models. The OS is a typical software platform

component. Critical parameters in estimating OS performance consist of but are not

limited to: the type of scheduler, task management overhead etc. The interconnect

platform shown is a SONIC bus model.

2.3.2 Phase II – Platform Instantiation

Once a system platform has been defined, we

need to explore within the system platform to find

a platform instance that is suitable for a given set

of applications and constraints. To do so, we

employ the Y-chart approach [20], which involves

an iterative process of mapping functions to

parameterized architectural modules, and

evaluating the performance of the resulting

platform under the given set of functional

constraints. This process is illustrated in Figure

17.

To fully explore the design space, we need to re-emphasize the need for separation

of functional and implementation concerns. Doing so provides the designer with the

greatest degree of freedom in choosing the best solution. An important consideration

Configurable

Platform

Mapping


Phase II


Configurable

Platform

Mapping


Phase II


Figure 17 Phase II- Platform Instantiation

29

is the need for a purely functional design specification with an underlying formal

mathematical model, as stated in Section 2.1.

Given that a well-defined system platform is in place, platform analysis becomes a

relatively simple process. In the system platform, we have available a library of

architectural modules with corresponding performance, energy and area prediction

models. From these modules, we can construct different platforms with varying

performance for our target applications. By examining multiple platforms, we can

quickly converge to an optimal platform instance that satisfies the set of design

constraints of our applications.

2.3.2.1 PicoRadio II Case Study

While designing the PicoRadio II platform, we used the Visual Component Co-

design (VCC) tool from Cadence Design System [21] to perform architecture

exploration. The behavior description of the protocol design is specified formally

with CEFSM, as shown in the upper portion of Figure 18. The design is described

hierarchically: the top level consists of data (De)compression blocks (MULAW), an

User Interface (UI) and the protocol stack. The protocol stack expands to include the

transport layer, the MAC layers, and the physical layer (Transmit, Receive, Time base

and Synchronization blocks). VCC provides an architecture library containing

characterizable architecture modules such as microprocessors, ASIC blocks,

operating systems and interconnects. An architecture platform constructed using the

library, as shown in Figure 18, consists of an ARM7 Thumb microprocessor running

the ECOS operating system [22] and an ASIC block connected with the TDMI

30

interconnect bus. In this particular implementation, we have mapped UI and transport

layer onto software (ECOS) and the rest of the protocol stack onto hardware (ASIC).

The qualities of the mappings could be evaluated by the VCC performance

evaluation tool. Figure 19 shows an experiment of the software mapping. If we only

map the UI and the transport onto software, the processor can run at low frequency of

1MHz and still be under-utilized. If we speed up the processor to 11MHz, we can also

map MULAW onto software and increase the processor utilization to 32%. However,

we cannot map the entire MAC layer onto software even if we drastically increase the

processor speed to 2 GHz. In fact, processor utilization drops and event losses and

timing violation occur as we try to accommodate more of MAC onto a faster

processor. The reason is that OS related overheads dominate the processor as we

Architecture

Behavior

Architecture

Behavior

Figure 18 Architecture exploration with VCC design tool

31

increase the processing speed. The issue of OS overhead will be discussed in great

details later in the thesis.

ProcessorUtilization

ClockFrequency

ProcessorUtilization

ClockFrequency

ARM@2GHz

User Interface

ARM@11MHz

32.7%

MulawTransport

User Interface

ARM@1MHz

5.46%

Transport

ARM@200MHz

2.7%User Interface

MulawTransport

0.5 MAC

User Interface

MulawTransport

0.9 MAC

User Interface

ARM@11MHz

32.7%

MulawTransport

User Interface

ARM@11MHz

32.7%

MulawTransport

User Interface

ARM@1MHz

5.46%

Transport

ARM@200MHz

2.7%User Interface

MulawTransport

0.5 MAC

User Interface

MulawTransport

0.9 MAC

Figure 19 Performance evaluation

32

2.3.3 Phase III – Implementation

Once we have a platform that meets all of the design constraints, we are ready for

the final implementation of the design. In this phase, we perform hardware and

software synthesis to implement a specific application

onto the platform instance. In the software synthesis

process, generation and compilation of application code

is performed to translate high-level description into

executables. A real-time operating system (RTOS) is

selected to handle the synchronization and

communication in the system. In the hardware synthesis

process, the hardware-mapped application blocks are

transformed from high-level description to synthesizable

HDL.

Currently for most existing embedded systems,

software synthesis typically means transforming an application with inherence

concurrency into a sequential program running on a uni-processor. The application

code generation is accomplished by turning each concurrent system component in the

specification into a task. Communication wrapper functions are then generated to

connect the tasks and the RTOS, which manages tasks communication and

scheduling.

Implementation

Phase III

Implementation

Phase III

From Phase II Figure 20 Phase III--Implementation

33

Figure 21 shows the software synthesis process for PicoRadio II using VCC. Ecos

was chosen as the embedded OS due to its availability and efficiency. The right hand

side of the figure shows a snapshot of the generated code. This code is reached after

system initiation to start up the application program. A thread is created for each

component mapped to software and subsequently suspended. After starting up the OS

scheduler, the application threads are then resumed and the application starts running.

In hardware synthesis, high-level application specifications need to be translated

into synthesizable languages, which are the entry languages of synthesis tools that

generate silicon. Most tools take HDL as the entry description language. In PicoRadio

II, the hardware-mapped MAC and Physical layers are implemented with standard

cells.

2.4 Design Iteration

At the end of the implementation phase, we need to ask ourselves, whether what

void cyg_user_start(void ){ ……

cyg_thread_create(0, task_ui_2_, 0, …)cyg_thread_create(0, task_transport_1_transport_bs…)cyg_thread_create(0,task_transport_1_transport_remote_,…)

…..

cyg_thread_resume(task_ui_2__handle);cyg_thread_resume(task_transport_1_transport_bs__handle);cyg_thread_resume(task_transport_1_transport_remote__handle);

…..}

Figure 21 Software synthesis for PicoRadio II

34

we get is what we want. In other words, whether the implementation fulfills the

design specification and meets the constraints. If not, we need to go back to Phase I

and/or Phase II to iteratively refine the platform or the mapping until a satisfactory

implementation is obtained.

The final chip layout of from PicoRadio II is shown in Figure 22. We notice that

the software portion of the architecture including the processor and its memory blocks

occupies more than 70% of the total area. This is especially inefficient considering

that the processor is greatly under-utilized (utilization < 7%). Reason being that the

software-implemented UI and transport layers run at much lower activity and rate

(user request and packet level processing) than the hardware-implemented MAC and

physical layers (bit level processing).

35

Careful analysis of the software code reveals that of the total 10K byte instruction

code size, about 50% is communication overhead. The massive data memory size of

54K is a result of the communication overhead, expensive scheduler overhead,

memory management, and stack allocations.

The above design is quite inefficient and far from meeting our application

specification of low-cost and low power. The efficiency is due to two major faults in

our design process. The first fault lies in the Phase I. In phase I, we did not include

memory modeling in our processor model. Hence in Phase II, the software

components are mapped to a processor that has NO instruction and program memory,

which results in a gross underestimation of the cost of the design. The second fault

3.6 mm

2.8 mm

36

Wiri

ng

Cha

nnel

64 KB Instruction SRAM

64 KB Data SRAM

Sonics

I/O

tci_protXtensaIC

ache

audio data

flash interface387

6 TAP

40 I/O

PPI

Wiri

ng

Cha

nnel

Wiring Channel

Figure 22 PicoRadio II floorplan

36

lies in Phase III, the ineffective software code laden with OS overheads is the result

of a poorly chosen OS and synthesis process.

The obvious correction to the first fault is to incorporate memory requirement in

processor modeling and performance estimation. The second fault can be greatly

improved by replacing ECOS, a general purpose OS with one that better matches the

application. TinyOS, an event driven OS specifically developed to target event-driven

systems is a promising candidate. We have re-run the software synthesis process with

TinyOS as the RTOS and were able to reduce the memory requirement drastically.

The application code size has gone from 10K to 5K, with the original 50% overhead

nearly eliminated. The OS now only occupies 3K of memory space and the data

memory size has gone from 54K to a merely 3K. The details of this experiment will

be presented in the next section --- Reactive Operating System.

The software synthesis experiment of PicoRadio II has drawn our attention to the

vast impact of the OS in the system. We will devote the rest of the thesis to search for

the “right” OS for our targeted application: the wireless communication systems. The

existing TinyOS [23] will be examined in detail and then extended to construct a

hierarchical system management framework.

2.5 Methodologies Comparison

To demonstrate the effectiveness of our methodology, we will present a

comparison between a design implemented with the traditional ad-hoc methodology

and one implemented with our formal methodology.

PicoRadio I is the earliest prototyping version of PicoRadio. Used to

demonstration the feasibility of sensor networking, it was an ad-hoc design built out

37

of off-the shelf components. PicoRadio test board consists of a StrongARM processor

and Xilinx FPGAs and implements a protocol very similar to that of PicoRadio. Since

PicoRadio I functions as the test-bed for the PicoRadio project, design methodology

and optimization are not its primary concerns.

Obvious it is unfair to compare the cost and performance of PicoRadio I to that of

PicoRadio II. However, their vast differences in implementation costs demonstrate

what a formal methodology can accomplish. As shown in Table 2, there is one to two

orders of magnitude improvement in software costs. It is rather difficult to compare

the hardware costs due to the differences in implementation fabrics: an ASIC gate

typically “costs” less than a FPGA gate. Even if the hardware portion of PicoRadio II

is three times more expensive that that of PicoRadio I, the winning edge it has in

software should well compensate for it.

8K

200K

Instruc.Mem

Software

3K

225K

Data Mem

72377 ASIC

21200 FPGA

Hardware(equivalent gate count)

PicoRadio II

PicoRadio I

8K

200K

Instruc.Mem

Software

3K

225K

Data Mem

72377 ASIC

21200 FPGA

Hardware(equivalent gate count)

PicoRadio II

PicoRadio I

Table 2 Implementation cost comparison between PicoRadio I and PicoRadio II

38

3. Reactive Operating Systems -- the Software

Management Layer

As we have seen in Section 2.4, OS support is crucial for the design of ultra-low

energy communication systems. These systems, reactive in nature, tend to have high

level of integration and system heterogeneity. General-purpose operating systems

developed for broad application are increasingly less suitable for these types of

complex real time, power-critical, domain specific systems implemented on advanced

heterogeneous architectures. The current practice of developing the OS and the

application independently, in particular the paradigm of blindly treating a task as a

random process, is unlikely to yield efficient implementation [25].

What is needed is an OS that is intimately coupled to, aware of, and interactive

with its managed applications. Specifically, a capable but “lean” OS that is developed

to target the nature of these reactive event-driven embedded systems. “Capable” in a

sense that it provides adequate supports for concurrency in the both the application

and architecture; “lean” in a sense that it executes with minimal overhead. Since

power is the most critical factor in design, aggressive power management schemes

should be deployed to drive down the overall system energy expenditure.

Instead of a general-purpose operating system, an OS that more closely “matches”

the application greatly improves the opportunity to efficient final implementation. By

match we mean to have Models of Computation (MOC) that are similar to that of the

application. Since our targeted applications are event-driven reactive systems, we will

first introduce the basic properties of reactive systems. Then we will describe the

characteristics of a traditional general-purpose multi-tasking OS and an event-driven

39

OS in detail, and present a comparison between them in terms of MOC, generality,

communication, concurrency support, and memory and performance overhead. The

software implementation of PicoRadio II is used as the case study for both.

3.1 Reactive System Behavior

Reactive systems perform tasks in response to input events. A system can

generate events either actively or in response to the environment. A system is purely

reactive if it is invoked only to respond to events [24].

Antenna

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)

Sensor/actuatorinterface

Locationing

Aggregation/forwarding

User interface

Sensor/actuators

Power control

RangingReactive

radio

Energy train

Antenna

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)


Locationing


User interface

Sensor/actuators

Power control

RangingReactive

radio

Energy train

Figure 23 Reactive system example – PicoRadio network

PicoRadio is an example of a reactive system. Reactivity is triggered by external

events such as sensor data acquisition, transceiver I/O, timer expiration, and other

environmental occurrences. Both the communications between nodes and inside

nodes are predominantly asynchronous.

Figure 23 is a behavior diagram of PicoRadio sensor node. It shows the different

components in the system and the interactions between them through events handling

40

(Events are represented as arrows). Communication between components is purely

reactive. External events cause the generation and propagation of internal events.

3.2 Inadequacy Of Traditional General-Purpose OS’s

The general-purpose multi-tasking OS was originally developed for the PC

platform and later adapted for general embedded systems. It is good for supporting

several mostly independent applications running in virtual concurrency. Suspending

and resuming amongst the processes when appropriate provide support for multi-

tasking and/or multi-threading. Inter-task communication involves context switching

which can become an expensive overhead with increased switching frequency. This

overhead is tolerable for PC applications since the communication and hence

switching frequency is typically low when compared to the computation block

granularity. Moreover, as these overheads grow, the wasted energy expenditures are

of relatively little concern for these virtually infinite energy systems. As general-

purpose OSs do not target low power applications, they have no built-in energy

management mechanisms and any employed are wholly deferred to the application

with its limited system scope.

It is apparent that the MOC of the general-purpose OS is quite different from that

of the protocol stack. The processes across the layered protocol stack are not

independent. They are coupled and activated and deactivated with events from

neighboring processes. In other words, the communication frequency is high amongst

neighbors and high overheads are far less tolerable. As we have seen in the software

synthesis experiment with eCOS, described in Section 2.4, this MOC “mismatch”

results in major inefficiencies.

41

3.3 Event-driven OS

Event-driven OS is designed to specifically target event-driven communication

systems. Its MOC is CEFSM, which matches that of the protocol processing system.

This match drastically reduces the communication overhead as well as other OS

related costs. Because it is not designed to support a broad range of general

applications, it can cut down on expensive OS services such as dynamic memory

allocation, virtual memory, etc. In addition, unnecessary performance-degrading

polling is eliminated and context switching is minimized and very efficiently

implemented.

TinyOS [23] is a rather successful example of event-driven OS. In TinyOS, an

application is written as a graph of components. For the PicoRadio II example,

components would be the layers in the protocol stack. Each component has command

and event handlers that process commands and events from other components, tasks

that provide a mechanism for threaded description, and a static frame that stores

internal state and local variables.

The TinyOS system operation can be briefly described as following: external

events from the RF transceivers or sensors propagate from the lowest layers up the

component graph until handled by the higher layers. To prevent event loss, the system

must process incoming events faster than their arrival rate. Threaded behavioral

description is supported via tasks, which are operations in the event or command

handlers that require a “significant” number of processor cycles. Tasks are pre-

empted by the arrival of an incoming event and are dispatched from a task queue.

TinyOS uses a simple FIFO task scheduler. Built-in power control is exercised by

42

shutting down the CPU when no tasks are present in the system after all event

processing.

3.4 Comparison Results

In this section, we will present a comparison between the general-purpose OS

(eCOS) and the event-driven OS (TinyOS) in three important performance metrics:

memory requirement, performance, and power.

Table 3 summarizes the contrast between the two OS’s as presented in Section 2.2

and 2.3. By trading off generality for performance and code size, TinyOS can better

target event-driven systems. Table 4 shows the memory requirement comparison

between the two OS’s. With the same processor selection (16 bit ARM7), TinyOS

needs half the instruction memory and one-thirtieth the data memory. Studies showed

FrequentInfrequentCommunication Frequency

SmallLargeMemory Requirement

SmallLargeCommunication Overhead

Target event driven systemsGeneralGenerality

Communicating EFSMsMulti-taskingMOC

Event-driven OSGeneral purpose OS

FrequentInfrequentCommunication Frequency

SmallLargeMemory Requirement

SmallLargeCommunication Overhead

Target event driven systemsGeneralGenerality

Communicating EFSMsMulti-taskingMOC

Event-driven OSGeneral purpose OS

Table 3 General comparisons.

709317627408 bit RISCEvent-driven

280080005312ARM7 thumbEvent-driven

549882232410,096ARM7 thumbGeneral Purpose

Data memTotal instruction mem ApplicationProcessorOS type

709317627408 bit RISCEvent-driven

280080005312ARM7 thumbEvent-driven

549882232410,096ARM7 thumbGeneral Purpose

Data memTotal instruction mem ApplicationProcessorOS type

Table 4 Memory requirements comparison.

43

that the power consumption of SRAM scales roughly as the square root of the

capacity [26]. This implies that with TinyOS, instruction memory power can be

reduced by 1.6x, and data memory power by 4.2x. Using a simpler processor such as

8-bit RISC could further reduce memory size and power consumption.

Figure 24 presents the performance comparison. The left graph compares the total

processor cycle count: 16365 vs. 2554. TinyOS shows a factor of eight

improvements, which translates directly to a factor of eight reductions in processor

power consumption. The right graph compares the OS overhead (the lowest portion

of the bars) as a percentage of the total cycles. As an indication of its inefficiency, the

general-purpose OS has an OS overhead of 86% while TinyOS has 10%.

Now let us calculate how much power is actually saved considering both the

processor and its memory blocks. With a 0.18µm technology and a supply voltage of

1.8V, an ARM7 consumes 0.25mW/MHz. For a memory size of 64KB, read per

Total cycle count at 1MHz

02000400060008000

1000012000140001600018000

Gen. OS TinyOS

Tota

l Cyc

le C

ount

Figure 24: General-purpose versus event-driven OS. Key at right identifies system components.

86.9

Percentage breakdown at 1MHz

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Gen. OS TinyOS

transport_remotetransport_bsmerge2merger1data converterUIOS

10.

85.9

44

access consumes 0.407mW/MHz and write consumes 0.447mW/MHz. Assume that

10% of the instructions involve memory read operations and 10% memory writes and

apply memory size as well as processor cycle count scaling, the power consumption

for the two OSs are: 0.608mW/MHz and 0.053Mw/MHz. That is, TinyOS

demonstrates a factor of 12 improvements in power.

It should be emphasized that TinyOS is only superior to ECOS for a very specific

class of event-driven applications. ECOS, capable of supporting general applications,

is fully equipped with classical OS utilities such as memory management, re-entrant

interrupt services, a multi-threaded preemptive scheduler etc. While the TinyOS

kernel is just a simple non-preemptive scheduler, with no support for memory

management and priority interrupts. By forcing the application to be written in a

“disciplined” manner, it can delegate much of the inter-component communication

and synchronization to the application program itself. As compared to ECOS, TinyOS

is just more specialized and therefore more suitable for event-driven applications that

may not require all the expensive OS services.

3.5 Required Extension Of TinyOS

Our driving research goal is to design an energy efficient OS for domain specific

heterogeneous architectures. We believe that some basic TinyOS concepts are very

attractive and can be adopted to reach such a goal. However, primarily developed for

an uni-processor architecture, TinyOS has its limitations and is insufficient to fulfill

the ambitious software management role demanded by low power heterogeneous

systems. It has to be properly extended to the system level to include management of

45

not only computation on the embedded processor, but also computation on the

optimized architecture modules.

TinyOS concepts are very attractive in the following aspects:

It has a MOC that closely matches that of reactive systems. Its event-driven

asynchronous characteristics can naturally support the interactions between

modules of vastly different behaviors and processing speeds in a

heterogeneous system.

Its simplicity reduces overheads and leads to more power efficient

implementation.

It provides some support for multiple flows of control (through the usage of

“tasks”).

The inadequacies of TinyOS, which prevent it from fully realizing the role of

managing power-critical heterogeneous systems, can be summarized as following:

Software centric approach does not allow full exploration of the integrated,

heterogeneous system architecture. It is primarily designed for uni-processor

architectures. All components except for the lowest layers of the application

are implemented in software. Low-level hardware components are required to

have a software wrapper to interact with the scheduler and the rest of the

system. All software components need to share resources such as CPU.

Whereas in a heterogeneous system, components can be mapped to separate

hardware blocks and be running simultaneously. Limited by the sequential

execution model of software, TinyOS does not take advantage of the

concurrency offered by the hardware components.

46

The application is described as a flat component graph. It may not scale well

as the number of system components increase with design complexity.

Limited support for concurrency. In TinyOS, concurrency is supported via

tasks, which can reside in the components. A task has to run to completion

and cannot be pre-empted by another task. So there is no real support for

multi-threading. Essentially there is only one thread active at any given time

in the system. The lack of support for multi-threading makes specifying

complex concurrent behavior awkward. In addition, since the system is only

pre-emptive while executing tasks, event loss will occur if the event inter-

arrival rate is greater than the processing speed of the processor at any given

time period.

Rudimentary power management scheme is inadequate for very power critical

applications.

The last limitation is due to the fact that TinyOS have no access to customized

power-efficient blocks and can only apply power control at the highest level. To drive

down the overall system energy expenditure, power management should be applied at

all levels of the design hierarchy: System level, architecture module level, circuit

level and device level [12]. By carefully incorporating power management into the

individual architectural modules, we can push power management down the design

hierarchy.

In the next chapter, we propose a hierarchical event-driven power management

framework that incorporates the attractive TinyOS concepts and strives to overcome

its limitations.

47

4. Hierarchical Power Management Framework

In this chapter, we will propose our own power management framework for

PicoRadio network. Our proposed solution utilizes a system management framework;

it exploits the reactive event-driven nature of the systems, and deploys aggressive

power management. While TinyOS targets a software centric architecture and

schedules only software tasks, our vision of the OS is as the global scheduler that

manages the entire heterogeneous system. Replacing TinyOS’s flat component graph

structure, we adopt a hierarchical composition that enhances design scalability,

supports concurrency, and enables power control at various granularities. In addition,

since we have the luxury of co-designing the OS with the architecture platform, we

can integrate power management into the individual architectural modules. Given that

the OS has the global “view” of the system, it can perform global power management

to optimize the overall system power consumption.

Most state-of-art power management systems handle only stand-alone devices.

The scope of our power management algorithm, however, is not limited to individual

nodes; instead, it aims to encompass the interest of the network as a whole. Our

power management algorithm executes in two phases: Network level algorithm first

treats the whole node as one entity and try to decide when the whole node should go

to sleep; then once the node is turned on, the node level algorithm determines on the

scheduling of the various modules inside the node.

The rest of this chapter is organized as following: we will first describe our vision

of the role of the global power scheduler and system manager; then we will discuss

some of the existing works on power management; lastly we will propose our two-

48

level power management algorithm. The details of the node-level power and

network-level management algorithms will be presented in Chapter 5 and 6

respectively.

4.1 The Global Power Scheduler And System Manager

In a complex heterogeneous system, the OS acts like a hardware abstraction layer

[11] that manages a variety of system resources. For power critical applications,

simplicity should be the primary design philosophy. The OS should perform a

dedicated set of indispensable duties and only these duties. There are two basic OS

duties: concurrency management at both application and architecture domains and

global power management.

Wireless sensor applications typically have multiple flows of control and data. A

sensor node can sense the environment, forward packets and receive commands all at

the same time. The OS needs to support concurrency in the application as well as

explore and utilize the concurrency in the heterogeneous architecture. Since the OS

has the global “view” of the system, it can also perform global power management to

optimize the overall system power consumption. Essentially, the OS becomes a

power scheduler that schedules the power state transitions of the system modules such

that the overall power consumption is minimized, under a set of constraints. Such

intelligent power scheduling is called a power management policy.

Figure 25 illustrates the role of the OS as the global power scheduler in the

reactive system. Note that in a reactive system, all system modules are “off” until

powered-up by the arrival of events. This is quite different from the conventional

power management approaches where all the modules are assumed “on” and can be

49

put to sleep to conserve energy. The OS is refocused from the microprocessor and

becomes a separate unit that can be implemented in software, hardware,

reconfigurable logic, or some combination. All communications in the system are

event-driven; event flows are denoted by arrows.

The power manager (PM) is connected to the energy train; it has the knowledge of

the battery level and the overall energy reserve of the system. Following the TinyOS

convention, the PM issues power-control commands to the system blocks and receive

events from them. Employing a clean and simple approach to avoid mutual exclusion,

the PM has the exclusive right to initiate all power state transitions and manage power

states. Only control signals go through the power scheduler, data communications

between the blocks do not involve the power manager. Figure 25 shows a behavior

diagram of the system; it should be underlined that it does not imply implementations.

Individual blocks (including OS) in the figure can be implemented in either soft-,

hard-, or configure-ware.

50

4.2 Existing Works On Power Management Policies

Recent surge in the popularity of portable and wireless devices has drawn

tremendous interest in high-level power management. The range of applications

various from battery constrained devices such as laptop and cellular phone to

environmentally concerned servers and disk drives. All system level management

policies seek to reduce energy consumption by selectively placing idle components

into low power states.

The vast majority of the previous works on power management are studies done

on X servers, hard disk drives or laptop computers, in which the system is treated as a

stand-alone device. As far as the author is aware of, no studies have been conducted

User interface

Energy train

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)


Locationing


Sensor/actuators

Antenna

PowerScheduler

Reactiveradio

Reactive Digital Network Processor

User interface

Energy train

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)


Locationing


Sensor/actuators

Antenna

PowerScheduler

Reactiveradio


Figure 25 Behavior diagram of the PicoNode architecture

51

in the context of a network of devices, that is, how to management power in the

interest of the whole network, not just isolated nodes.

The formal definition of power management policy is to predict when to change

power states based on system performance and resource constraints. The system

parameters relevant to policy formulation are: event inter-arrival and service rate

distributions; power consumption at various power states (Active /Idle / Sleep); and

power management overheads. The inter-arrival time between external events is

typically unknown and generally do not fall into well-known probability distributions

[28][29][32]. Service time modeling is also very challenging for complex systems.

Power management overheads consist of wakeup performance and energy overheads,

associated communication overheads and the actual energy consumption of the power

manager. Wakeup performance and energy overheads are typically inherent of the

system and are normally given to the power manager as fixed parameters. Different

policies may have to be devised for different wakeup overheads. However, care

should be taken to ensure the actual implementation cost of the power manager is

minimal. Ideally it should be negligible compared to the dominating wakeup

overheads. A critical power metric of the system is the break-even time Teven, defined

as the ratio of wakeup energy overhead to the idle power dissipation. Teven is also

called the shutdown threshold. It is the length of time such that if the system is idle

for time Teven, the energy that is expended is the same as the energy required for

wakeup the system.

The simplest and most common policy is the time-out policy implemented in most

operating systems in laptops. More sophisticated approaches basically fall into two

52

schools of thoughts: The first are based on stochastic models and tend to be stationary

(the same policy applies at any point in time) in nature except for the one presented in

[29]. The non-stationary (the policy changes over time) model described in [27] only

yields marginal improvement over its stationary counterparts while significantly

increasing the computational complexity and memory requirement.

The second schools of thoughts are predictive policies that estimate the next idle

period of the device, and if the period is long enough, it is transited into a low power

state. The estimation is usually based on dynamically measuring event inter-arrival

rates. These policies are adaptive to traffic changes over time and are non-stationary.

In the next section, we will first briefly present the work of Simunic et al [30],

representative of the stationary stochastic approaches; and then the works of Sinha &

Chandrakasan [31] and Hwang & Hu [32] representatives of the dynamic predictive

approaches.

4.2.1 Stationary Statistical Power Management Policy

Paleologo et al [27] first proposed statistical algorithm based on stochastic

models. Stochastic models use distributions to describe the inter-arrival times of

requests, the service time of the request by the device and the time it takes for the

device to transition between its power states. However, their algorithm was devised

based on the assumption that the inter-arrival times of requests are exponentially

distributed. This assumption is hard to justify and may not hold in view of significant

correlation between requests.

Simunic’s [30] work generalizes the above approach and removes the assumption

on exponential distribution. Her approach is based on Time-indexed Semi-Markov

53

Decision Model (TISMDP). The system model consists of three components: the

user, the device and the queue. The power management policy is formulated as a

constrained optimization problem: Minimize energy under performance constraints.

She has shown that the optimization problem can be solved exactly and in polynomial

time with guaranteed optimal results for general request distribution. This solving

process is both computational and resource intensive: the total size of the system

states equals to the number of time indexes multiplied by the number of states in the

queue multiplied by the number of states in the device service model. Needless to

say, this computation has to be done off-line and only the computed policy table will

be loaded onto the device.

The policy works like a randomized timeout and is easy to implement on the

device. Table 5 shows a sample policy. This example application is a smart badge

device. It has two decision states: idle and standby. From idle state, the device can go

into standby or off state. From standby, it can only transit to off state. The optimal

policy, as shown in Table 5, gives a table of probabilities determining when the

transition between the idle, the standby, and the off states should occur. In this

particular example, it specifies that if the system has been in idle for 50ms, the

transition to the standby state occurs with probability of 0.4, the transition to the off

state with probability of 0.2, and otherwise the device stays idle.

54

The major advantage of the stochastic approach is that it guarantees optimality as

long as one can characterize the distribution models correctly. However, since these

models are generated using existing traces, the results are only as good as the traces.

In certain applications, the traces may be hard to characterize and not fall into any

known distributions. The main drawbacks of this approach are its computational

complexity and its inability to adapt to changing environment. As mentioned in the

previous paragraph, the size of the global state space increases exponentially with the

number of components in the system. And its stationary nature implies that any

significant changes in the service requests distribution will require expensive re-

characterization offline to generate a new policy table.

The authors in [29] try to extend this approach to accommodate non-stationary

service requests. However, their work yields at best 8.5% power improvement under

loose performance constraints, and only 1% improvement with tight performance

constraints, while requiring 100X larger memory space. And the energy saving was

calculated without the consideration of the energy overheads of either the extra

memory itself or the associated control logics.

0.80.90.1100

00.20.450

0000

Standby to Off ProbIdle to Off ProbIdle to Standby ProbIdle Time (ms)

0.80.90.1100

00.20.450

0000

Standby to Off ProbIdle to Off ProbIdle to Standby ProbIdle Time (ms)

Table 5 Sample power control policy

55

4.2.2 Adaptive Power Management Policy For Non-Stationary

Traffic

Sinha & Chandrakasan [31] study power management for a sensor networks

similar to the PicoRadio network. They only concentrate on the individual sensor

nodes, and not the interest of the connected network as a whole.

Each sensor system has four components: Strong ARM, memory, sensor A to D

converter and the radio. Five global power states are generated by the compositions of

the power states of these components. Table 6 shows the five global sleep states S0 to

S4 and their corresponding local states. From S4 to S0, the power consumption

increases while the wakeup overhead decreases.

One bold assumption made by the authors is that inter-arrival time is exponential,

which is not justified by measured or simulated data. Exponential distribution

implies there is no correlation between the past inter-arrival rates and the future. This

assumption may not hold in view of significant correlation between arrivals, as in the

case of typical sensor networks. The algorithm measures the packet inter-arrival rate,

and based on rate decides which low power sleep state to transit into.

Tx, RxOnActiveActiveS0

OffOffSleepSleepS4

OffOnSleepSleepS3

RxOnSleepSleepS2

RxOnSleepIdleS1

RadioSensor,A/DMemoryStrong ARMSleep StateTx, RxOnActiveActiveS0

OffOffSleepSleepS4

OffOnSleepSleepS3

RxOnSleepSleepS2

RxOnSleepIdleS1

RadioSensor,A/DMemoryStrong ARMSleep State

Table 6 Global power states for the sensor node

56

This is the first known work studying power management for wireless sensor

networks. Its dynamic sampling of inter-arrival rate makes it adaptive to non-

stationary traffics. It is also one of the very few works that address systems with more

complex global state space (not just limited to idle/sleep/active). The simplicity of its

implementation makes it feasible for sensor network applications. However, the

drawbacks are also apparent. The assumption on exponential traffic eases the analysis

but is unjustified and unrealistic. Furthermore, there is no analysis on the optimality

of this approach, in other words, no analysis of how close the results are to optimal.

As we can see in the more in-depth analysis presented in the next section, this

approach could produce very poor results under certain circumstances.

Hwang and Wu’s [32] analysis also dynamically samples the inter-arrival rates

and bases decisions on the nature of the data and dynamic behavior. One significant

difference from the previous approach is that they assume that the traffic is correlated.

That is, there is significant correlation between the recent inter-arrival rates and inter-

arrival rates of the near future. In particular, Hwang and Wu ‘s adapted the

exponential – average approach used in CPU scheduling for the prediction of the next

inter-arrival time, i.e. the next idle period of the processing unit. Though a

straightforward implementation of the algorithm has a significant hardware cost. We

will implement this algorithm on the PicoRadio network in the next section and

detailed analysis will be presented.

4.2.3 Dynamic Voltage Scaling (DVS)

In addition to idle and low power states, a component can also transit between

several active states. This is accomplished by incorporating DVS into the power

57

control policy. DVS algorithms adjust the device speed and voltage according to

workload at run time. Since peak performance is not required at all times, the device’s

processing speed and operating voltage can be dynamically adapted to increase

energy efficiency. DVS is a very effective technique for reducing CPU energy and is

supported by many start-of-art embedded processors such as StrongARM [33] and

Transmeta [35]. Software scheduling technique for DVS is a well-published area

[30][31][34].

4.3 Proposed Power Management Algorithm For Sensor

Networks

After reviewing the existing power management policies, we will proceed to study

policy formulation for PicoRadio wireless sensor network applications. We should

first address the unique challenges and then propose our power control algorithm.

4.3.1 Formulating Power Control Policy For PicoRadio Network

The existing works discussed in Section 4.2 only deals with the power

management of a single isolated device. Whereas the goal for our research is to do

power management for the interest of the entire network, not just isolated nodes. In

other words, we are not too concerned about the energy consumed in individual nodes

as long as the network is alive and well. This opens a new outlook on power

management itself, which is completely unexplored by existing works.

We need to first redefine our goals for power control in a network setting. Clearly,

we should refocus from minimizing the energy of individual nodes and concentrate

instead on the interest of the entire network in terms of quality of service and lifetime.

58

One simple metric that matched this interest reasonably well is the worst-case node

energy consumption in the network. That is, the energy consumption of the node that

burns the most amount of energy in the network. Intuitively speaking, controller

nodes at the center of the network consume far more energy than sensor nodes at the

edge of the network. Since energy dissipation translates directly to node lifetime, by

choosing a policy that performs well on these nodes, we avoid the “dying out” of

crucial nodes in the network to ensure a good quality of service. We would like to re-

formulate our power control policy as following:

Given a set of performance and system constraints and network topology,

predict when to change power states such that the worst-case node energy

consumption in the network is minimized.

4.3.2 Requirements For PicoRadio Network Power Control Policy

Most of the existing works model the processing unit as one “black box” that has

only several low power states. While in the PicoRadio system, the processing unit is

partitioned into multiple power domains that can be controlled independently. The

advantage of having a finer granularity of power control is apparent: we can achieve a

higher level of power saving. The disadvantage is the much larger global power state

space. The size of the state space makes the implementation of the statistical approach

infeasible as its computational complexity grows exponentially with the number of

components in the system. As we will see later, how to address this complexity is

crucial in policy formulation.

Moreover, a policy for sensor network has to be adaptive to changes in the

environment and be able to handle stationary traffic. Wireless links (a wireless link is

59

an abstraction of the ensemble of modems, transmitter, receiver, channel) are often

noisy and lossy. Transmission errors occur due to noise and multi-path fading.

Moreover, distance-dependent path loss and adjacent channel interference also affect

the channels. In contrast to the wired connections, the underlying network cannot be

assumed to be reliable, and the time-varying network conditions should be explicitly

taken into account. Moving people and objects, neighborhood changes due to the

mobility (neighbor nodes moving in or out of ranges) and the node lifetime,

interferences from other electronic devices all contribute to the frequently changing

environmental conditions. According to some measurements done on wireless links

[36] [37], both packet loss rates and mean bit error rates are time-varying. In [36],

packet loss rates range from 2% to as high as 80%, and bit error rates vary over

several orders of magnitude. This calls for the inclusion of adaptively into both

protocol implementation and power control policies.

Additional reason for adaptively is the desirability of implementing the same

policy across the network. Obviously it is quite cumbersome and often impractical for

us to characterize the policy for every node in the network, it is therefore very

attractive to have a “self-adaptive” policy that works for different nodes that see

vastly different traffic conditions.

A further requirement is that the policy should be feasible to implement on the

sensor platforms. A sensor node typically has limited computational resources so care

should be taken to ensure the cost of policy implementation does not offset the

benefit. A good counter-example is the straightforward implementation of Hwang &

60

Hu: it demands a large number of multiplications and additions and is clearly

infeasible for sensor networks.

Summarizing the discussions in this section, the “right” power policy for sensor

networks has to be able to handle complexity in system space modeling, be adaptive

as well as implementable on sensor platforms.

4.3.3 Proposed Power Control Algorithms

In handling complexity in system space modeling and avoid state space explosion,

state space partitioning is a very effective method. The question is how to do the

partitioning properly. We have carefully studied the PicoRadio system and made the

following observation: it is much more costly to wakeup a node from “deep sleep”

than to turn on the blocks subsequently. The reason is the following: in the “deep

sleep” mode, the power rails and the clock lines are both turned-off. To wakeup the

node from “deep sleep” involves re-synchronization of the clock, and transit the

global memory blocks from “drowsy” to normal operational modes. Re-

synchronization of the clock involves locking the Phase Lock Loop (PLL), which

could be very expensive both in energy cost and performance. Once the node is

started up and the clock is running, it is relatively cheap to wakeup the subsequent

blocks. For the current version of the PicoNode, it could take up to 300 clock cycles

to synchronize the clock, while waking up the MAC only takes 3 cycles. The wakeup

performance hit of the subsequent blocks can also be partially, if not totally, hidden

by deploying predictive look-ahead scheduling (Section 5.4.1).

Based on the above observation, it is sensible for us to partition the system state

space at the node and network boundary. By doing so, we essential divide the one big

61

problem into two smaller problems: First treat the whole node as one entity and try to

decide when the whole node should go to sleep; then once the node is on, decide on

the scheduling of the blocks in the node. We call the first problem network level

power management and the second problem node level power management.

In network level power management, the system is modeled as a network of nodes

communicating by sending packets. Each node has an inter-arrival and service model.

At this level, the power policy decides whether and when to turn off the whole node

after being in idle. The policy uses network traffic information and targets the most

energy consuming nodes. The detail implementation of the node is abstracted away

and the only relevant issues are its service rate model and system parameters such as

node idle power consumption, node wakeup overhead, etc. In node level power

management, the system is modeled as Concurrent Extended Finite State Machines.

At this level, power scheduling is to determine the sequence and timing of block

wakeups.

These two problems have different levels of abstraction and require different

modeling and simulation environments. The first requires a network simulator and the

second requires a CFSM modeling environment. We have chosen OMNet++ as the

network simulator and the Stateflow-Simulink simulation environment to model the

node architecture.

OMNet++ is a C++ based object-oriented modular discrete event simulator. Its

source code is freely available to provide the programmers maximum freedom. In

OMNet++, we have modeled the complete PicoRadio protocol stack. Interfaces

between layers are cleanly defined such that one can modify the algorithm

62

implemented in one layer without affecting the other layers. At the Application layer,

the sensor node is programmed to be a sensor, a controller or both. Controllers

generate interest packets and sensors generate data packets periodically in respond to

the controller’s inquiry. The network layer implements a geographical routing

protocol. The MAC layer implements a simple carrier sensing collusion avoidance

protocol. The physical layer is modeled using a channel interference matrix. There is

no data aggregation. There is a dedicated power manager that allows us to explore

various power control policies. Since OMNet++ is written in C++, and the protocol

components are not modeled as concurrent FSMs, it is very fast. A simulation of

days-long activities in a 100-node network only takes minutes.

The Stateflow-Simulink integrated simulation environment has the capability to

model and implement the complete digital chip. Stateflow, of which the model of

computation is CFSM, models protocol; Simulink, of which the Model of

Computation is data flow, models the signal processing Baseband. Moreover, there is

a direct path from the Stateflow-Simulink to implementation. Simulink can be

translated to synthesizable VHDL, and Stateflow to C and VHDL. Using this chip

design flow, we are able to do our design specification, estimation and simulation in

Stateflow-Simulink and generate synthesizable hardware and executable software

code as the end product.

In the next chapter, we will discuss in detail the node level power management

algorithm. In Chapter 6, we will present the experimental results on various network

level power management algorithms.

63

5. Node Level Power Management

5.1 Hierarchical Node-level Power Management Architecture

As the design complexity of embedded system increases, so does the management

of the system. The intricacy of the scheduling problem increases considerably with

the number of system components and could soon become intractable. To better

manage complexity, we need to introduce hierarchy. Hierarchy hides low level details

and enhances modularity and scalability. It also enables us to explores locality in the

design and apply power control at various granularities. Figure 26 shows the

hierarchical architecture of the node level power management framework. At every

Domain1

Power Scheduler

Domain2 Domain3

Domain 1

Power Scheduler

Domain2 Domain3

Domain1

Power Scheduler

Domain2 Domain3

Domain 1

Power Scheduler

Domain2 Domain3Domain 1

Power Scheduler

Domain2 Domain3

Figure 26 Hierarchical node-level power management architecture

64

hierarchy level, the system is partitioned into multiple power domains. Power

domains are the basic units of power control and can be implemented through

separate power supply rails. Each power domain can be further divided into sub-

domains. A power scheduler resides in every hierarchy level to provide power

management interface and power scheduling.

Recalled from the profiling experiments conducted in Section 2.3.1.1, timer

services are identified as critical operations (kernels) in protocol processing. There

are numerous timers running in the protocol stacks: routing table timers in the

network layer and random backup timers in the MAC layer. It is quite cumbersome to

support multiple timers in the system for mainly two reasons. Firstly, timers tend to

get out-of-sync over time and are very costly to re-sync. Secondly, timers have to be

running even when the rest of the block is idle and put to sleep. Seemingly the

sensible solution is to export all the local timer services to be handled by one global

timer. This global timer then becomes the only timer that has to be running

throughout the lifetime of the node. The global timer resides and is maintained by the

PM. When a block needs a timer service, it sends a Request_Timer event to the PM to

register, and then it can choose to sleep. The PM will send back a Timer_expiry

command, which potentially can trigger a wake up, once the registered timer expires.

This global timer approach ensures a singular timing reference for the entire sensor

node. By relinquishing the timer functions to the PM, blocks have more opportunities

to go to sleep and save energy.

65

5.2 Global Power Scheduler

Figure 27 sketches the structure of the top-level power manager. The PM consists

of the timer services unit and the power scheduler. The PM is the owner of and the

sole writer to the system power state table. It also gathers network-wide information

such as past inter-arrival traffic rates and statistics (energy, channel qualities etc)

concerning other nodes. Based on the gathered network information, the power state

table and performance/resource constraints, the PM makes scheduling decisions. The

goal is to minimize the overall power consumption while meeting the performance

and resource constraints.

66

The behavior of the power scheduler is modeled as a set of concurrent Finite State

Machines, each of them correspond to a separate power domain. The power state

table lists all the power domains and their operating voltages. The power scheduler

changes the power states of the domains through interfacing with the power state

table. A write to the power state table alters the supply voltage to the power rail of the

domain. This implies the need to have separate supply rails for different power

domains. For example, changing the table entry “Application/Network” from 0 to 3

will transit the domain from the power-off state to active state operating under full

supply voltage of 3. The PM can also write an intermediate voltage between 0 and 3,

MAC timer1 MAC timer2Networktimer1

Timer ServicesMAC timer1 MAC timer2

Networktimer1

Timer Services

Power Scheduler

MAC AppNet PHY

Top Level PM

3Clock & Sys. Init.

…0Physical

0MAC Sub-domain

3Application/Network

VoltageDomain

3Clock & Sys. Init.

…0Physical

0MAC Sub-domain

3Application/Network

VoltageDomain

Power States TableNetwork States

Information

Past Inter-arrival Info.

CLK &Sys. Init

Figure 27 Top-level power manager

67

e.g. 1.5 to the entry. At 1.5V, the domain will run at roughly half the speed and

consume roughly four times less energy.

The top level PM also provides timer services and manages the system time

wheel. If a domain needs a timer service, it sends a Request_Timer_Service event to

the PM, specifying after what time interval the timer should expire. Upon receiving of

the request, the PM will start an alarm with the specified expiry time. Once the alarm

expires, the PM will notify the domain with a Timer_Expiry command. Figure 27

shows three timers registered by the MAC and Network respectively.

Before we proceed further, we need to define the various power states of system

blocks.

5.3 Power States Transitions for System Blocks

The power states for system blocks are Awake and Sleep (see Figure 28) In the

Awake state, a block can respond to incoming events. There are two sub-states in

Awake, Active and Idle. When a block is in Active, it is actively processing incoming

Active

sleep IdleAwake

Active

sleep IdleAwake

Figure 28 Power states for system blocks

68

events; when it is finished with processing all events, it goes into Idle. After being

Idle for certain amount of time, the block can choose to go to Sleep. In the Sleep state,

the block cannot respond to incoming events; it requires some external wakeup signal

to transit to the Awake state. The external wakeup signals are issues by the power

manager. Sleep states consume very little or no power. Awake states can run at

variable frequencies and voltages by implementing dynamic voltage scaling (DVS).

DVS strives to minimize energy consumption by matching module performance and

energy expenditure with workload.

There are two ways to implement the Sleep states, clock gating or power supply

rail gating. Clock gating gates the clock signal to the block; consequently there is no

dynamic power consumption, only static power resulting from leakage current. For

very low duty cycle applications such as sensor networks, even when the clock is

gated, a considerable amount of leakage energy is dissipated. On the other hand,

gating the supply rail completely shuts down the block and causes no power

consumption. Clearly, the latter is a much more efficient, and should be supported by

all the blocks in the system. Unlike the logic blocks, the memory blocks cannot be

totally turned off since the states information has to be preserved. Instead, the

memory blocks are put into a state preserving, low power drowsy mode. In this

drowsy mode, neither read nor write operations can be performed.

The power manager has the exclusive rights to initiate all state transitions. It turns

on a block by ramping up its supply rail and restoring its memory from drowsy to

active; and shuts sown a block by turning off its supply rail and putting its memory to

the drowsy mode. The power state transitions have associated performance and

69

energy costs, which are called the power control overheads. The magnitude of the

costs is determined by certain system implementation parameters such as clock re-

sync overhead and drowsy memory restoration overhead etc.

Since all rights to power state changes are relinquished to the PM, a block cannot

change its own power states. When it wants to go to sleep, it has to send a

Request_to_Sleep power control event to the PM. If the PM accepts the request, it

puts the block to Sleep; otherwise it stays awake.

5.4 Node Level Power Scheduling

At the node level, power scheduling makes decisions on the sequence and exact

timing of the block wakeups and sleeps. The goal is to minimize the overall power

consumption while meeting the performance and resource constraints. Power

scheduling is implemented through power control events and commands. Power

control events are sent by distributed power domain blocks to the centralized power

scheduler: event Request_BlockA_on for wishing to access BlockA and event

Request_to_Sleep for wanting to sleep. Power control commands are issued by the

power scheduler to the domain blocks and may result in power state transitions:

Wakeup_BlockA to transit BlockA from sleep to active, Sleep_Request_Denied to

keep the requesting block in idle and Sleep_Request_Granted to transit the block

from idle to sleep. All the power control events and commands can carry a token that

specifies timing information. For example, Request_BlockA_On(t) means block A

will be accessed in time t.

Power scheduling is a rather complex issue. There are both the local and global

aspects to it. Locally, blocks have to decide when to send in the wakeup and sleep

70

requests. A block may decide to send in a sleep request right after it becomes idle or

wait for a while. Globally, the power scheduler has to decide on what power control

commands to issue and when. In both cases, decisions are made based on the

expected event inter-arrival rates, gathered statistics and implementation specific

parameters such as wakeup overhead and idle state energy dissipation.

We favor a power scheduling approach that is more global in nature: the global

power scheduler makes most intelligent decisions. In this scheme, a local block sends

in the wakeup requests as soon as it realizes another block is to be accessed; and

requests to sleep as soon as it becomes idle. The power scheduler then determines

when to wakeup/sleep the requested block. The justification for the global approach is

that the centralized scheduler has a much grander vision that the individual domain

blocks and can potential make better decisions. The drawback is that the global

scheduler may be overloaded by decisions that can be handled locally. The studies of

localized power control is left to future work.

In a reactive system, power scheduling naturally flows the event flows. This is

best illustrated with an example. Yet again, we will use the PicoRadio sensor node

platform, as shown in Figure 29. Assume every behavior block in the figure is

mapped to a separate power domain. At the initial state, the sensor node is in deep

sleep --- meaning that except for the reactive radio, the top-level PM and the drowsy

memory, the system is completely powered-down.

71

1. Reactive radio senses incoming packet from the network, the RF (TX/RX) is

turned on to receive data stream.

2. Digital processor initializes (Digital clock is re-synchronized, global states are

restored)

3. PM wakes up Baseband, Baseband processes data from RF (TX/RX).

4. Baseband needs to send data to MAC; it sends a Request_MAC_On event to the

PM.

5. PM looks up its power state table and notices MAC is sleeping, it wakes up MAC

and modifies the corresponding power state table entry.

6. Baseband becomes Idle, it sends a Request_to_Sleep event signal to PM.

User interface

Energy train

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)


Locationing


Sensor/actuators

Antenna

PowerScheduler

Reactiveradio


User interface

Energy train

DLL (MAC)

App/UI

Network

Transport

Baseband

RF (TX/RX)


Locationing


Sensor/actuators

Antenna

PowerScheduler

Reactiveradio


Figure 29 Example of power scheduling

72

7. PM grants the Request_to_Sleep and puts Baseband to sleep.

8. MAC processes data from Baseband. It needs Locationing data and sends the PM

a Request_Locationing_On event.

9. PM turns on the Locationing block.

10. MAC processes the packet. It realizes the packet has to go to network layer, and

sends a Request_Network_On event request to PM.

11. PM turns on Network.

12. MAC becomes idle and sends a Request_to_Sleep event. At this point, the PM

evaluated the situation and may or may not grant this request. In the case that the

MAC will be accessed later to forward the packet, and it is more expensive to put

it to sleep and then wake it up than to keep it idle, the PM will deny the sleep

request. This is called predictive look-ahead scheduling. We will introduce the

concepts of predictive scheduling and its implementation in the next section.

5.4.1 Predictive Look-Ahead Scheduling

Predictive scheduling is to determine the power state transitions of a domain

based on its future access prediction. That is, following the global event-flow in the

system, blocks are turned on before they are accessed and kept idle if they will be

accessed later. Predictive scheduling decisions are global in nature and may

contradict the local interest. For example, in Step 12 of the power-scheduling

example, MAC’s request to sleep is over-ridden by the PM due to more global

concerns. Predictive scheduling is typical used in scenarios where the performance

and/or energy penalty associated with waking-ups is significant.

73

1. Performance enhancement through latency hiding. A substantial performance

overhead for wake-ups may cause excessive delays in the system. To reduce this

type of delays, the PM can make predictive decisions and wake up a block

beforehand to make sure it is ready when needed. The exact predictive wakeup

timing, or the “look-ahead” window size, has to be carefully determined. If a

block is woken up too early, it will stay in Idle before needed and waste energy.

On the other hand, if the block is woken up too late, the latency can only be

partially hidden and there will be a performance penalty. Ideally the look-ahead

time window of a block should be equal to its wakeup performance overhead. In

practice, however, this is difficult to accomplish, as the PM may not receive the

relevant control events in time to cover the wakeup latency.

Even for systems where performance is not a concern and some amount of

delays are tolerated, this kind of “Just-In-Time” wake-up scheme can be used to

promote power saving. In an asynchronous event-driven nature of the system,

there are typically a large number of data queues connecting the various system

modules. Queues are turned on and off by the sender and receiver blocks they

connect. When the sender needs to write to a queue, it turns on the queue and

populated it with data. The receiver then pulls data from the queue and turns it

off when the queue is empty. In the situation when the receiver has large wakeup

latency, energy is wasted because the queue has to remain on for an extended

amount of time. Depends on the size of the queue and the wakeup latency of the

receiver, this wasted energy could be substantial.

74

2. Energy saving by the elimination of excessive sleeps and wakeups. If the

energy penalty of waking up the block is high and the block has a good chance of

being accessed in the future, the PM may decide to keep it alive even when it

wants to sleep. The decision is made based on the evaluation of wakeup energy

overheads, idle power dissipation, the predicted future access time and the future

access probability etc. If upon receiving a sleeping request from a block, the PM

estimates that it will be accessed later and should remain idle, the sleep request

will be denied. This is the situation where global predictive scheduling decisions

override local requests.

5.4.2 Implementation of Predictive Scheduling

Predictive wakeups are implemented by encapsulating timing information in

power control events. A token t is carried in the power control event to indicate the

look-ahead window size. Event Request_BlockA_On(t) means block A will be

accessed in time t.

Scenario 1 (latency hiding) is relatively straightforward to implement. As soon as

a block (e.g. B) realizes it needs to access another block (e.g. A), it sends out the event

Request_BlockA_On(t) to the PM, where t is the estimated process delay in B after

which the necessary input data to A is ready: the delay it takes for B to generate

proper data and put it into the queue connecting B to A. The value of the process

delay is implementation dependent and can be pre-characterized. Once the PM

receives the event Request_BlockA_On(t), it will schedule a wakeup_A command in

time (t-wakeup_latency_A). Ideally, A should wake up after exactly time t. However,

if t< wakeup_latency_A, there will be a delay in A’s wakeup and its wakeup latency

75

can not be fully hidden. To maximize latency hiding, the block (B in this case) that

sends in the predictive scheduling event should do so as soon as it knows another

block will be accessed.

Figure 30 shows an example of the network layer trying to wakeup MAC. Starting

from the Init state, the network receives a packet from the MAC layer. The packet can

be of type DATA or INTEREST. The network then processes the packet and decides

whether the packet is to be forwarded to another sensor-node or not. If forwarding is

necessary, it sends out an event Request_MAC_On(process_delay) to the PM, where

the actual value of the parameter process_delay depends on the types of the packet.

Process_delay can be obtained by pre-characterizing the processing speed of the

network module. If an accurate measurement cannot be acquired, an upper bound of

the delay should be used to prevent MAC from staying in Idle after being woken up.

The penalty of a pessimistic estimation of process_delay is that the queues connecting

Init

Parse

Check Table

InterestProc

Table Update

Data

MACPacket?

TimerExpiry? Data?Interest?

Update?

SendPacketToQueue!

Forward? Request_MAC_On(60)!

Forward? Request_MAC_On(50)!

SendPacketToQueue!

Figure 30 Network layer example for predicting scheduling.

76

network and MAC may have to be on longer than necessary.

Scenario 2 (energy saving) is more complicated to implement, due to the

difficulty of estimating both future access time and activation probability. As shown

in Figure 29, a system block is usually “connected” to and hence can be activated by

multiple neighboring blocks. This kind of intrinsic concurrency makes it non-trivial to

predict the next activation time of the block. For example, to estimate the next access

time of MAC, we will have to consider the cases of receiving events from Network,

Baseband and Locationing blocks.

A simplified and more conservative approach is to only consider the scenarios

when the block will be woken-up for certain. The way to accomplish this is to have

the PM checking for any pending predictive wakeups on a block when it receives a

Request_to_Sleep from it. If there is a pending predictive wakeup, the PM will make

shut-down decisions based on the evaluation of wakeup overheads and timing

information.

The exact algorithm for predictive wakeup scheduling is presented in Figure 31,

described in C pseudo code format.

//Event-handler: PM receives a wakeup blockB in time t request

On_event(Request_BlockB_On (t)) {

if (t < t_wakeupLatency_blockB) //wakeup blockB right now. WakeUp_blockB(); else

AddPendingWakeupList_BlockB(t- t_wakeupLatency_blockB);

/************************* The PendingWakeupList is a list of pending wakeup requests

77

It is implemented as a list of alarms that go off at various pending wakeup times. A wakeup_blockB command is generated whenever an alarm goes off. ****************************/

} //Event-handler: PM receives a sleep request from block B. On_event( Request_to_Sleep) { //check if there is any pending wakeup requests if (Empty(PendingWakeupList_BlockB)) //no pending requests, sleep request granted

Output_command(Sleep_Request_Granted); else //handle pending requests { // t_predicted_blockB is the smallest alarm value in the PendingWakeupList t_predicted_blockB= Min(PendingWakeupList_BlockB); //if the pending wakeup time is less than the wakeup latency, do not sleep if (t_predicted_blockB < t_wakeupLatency_blockB) Output_command(Sleep_Request_Denied); else //if the idle_energy consumption is less than the wakeup overhead, do not sleep

if (t_predicted_blockB * idle_power_blockB < Energy_wakeup_blockB) Output_command(Sleep_Request_Denied);

else Output_command(Sleep_Request_Granted);

} }

Figure 31 Predictive wakeup algorithms Predictive wakeups do not come free. Even the current simple scheme requires

the maintenance of a pending wakeup alarms list for every domain. In a low activity

type of system like PicoRadio, typically the lengths of these lists are no more than

one; hence the maintenance tasks are rather manageable in general.

The implementation of effective predictive scheduling is a non-trivial task. The

inherent concurrency in the system makes the precise prediction of event flow

difficult. We presented a preliminary conservative scheduling scheme that favors

78

functional correctness and implementation simplicity. Care was taken to ensure the

proper timing sequence of power control signals such that the system does not end up

in deadlock. More refined and aggressive schemes will be studied in the future.

5.4.3 Power Scheduling Without Predictive Wakeups

If the system has negligible wakeup overheads and performance is not an

overwhelming concern, predictive wakeups schemes can be left out in favor of much

simpler wakeup mechanisms. Without predictive wakeups, the sleep request handling

is very straightforward: sleep requests are always granted by the PM. Wakeup

requests are also greatly simplified. Token t in event Request_BlockA_On(t) takes the

constant value of zero. The wakeup request timing now becomes: If a block (e.g. B)

wants to access another block (e.g. A), it issues an event Request_BlockA_On(0) to

the PM right after it have populated the queue connecting to A with the necessary

data. Upon receiving Request_BlockA_On(0) , the PM will issue the command

Wakeup_BlockA instantaneously. In this robust approach, A is guaranteed to go into

active state when woken up. Otherwise, A might wake up too early, finding the

incoming queue not ready, and be forced to wait in Idle. In this unfortunate scenario,

if A decides to go to sleep after staying Idle for a period of time (sleep requests are

always granted), a deadlock may occur.

5.5 Incorporating Dynamic Voltage Scheduling (DVS)

So far the scheduling algorithm we have discussed only involves turning the

domains “on” and “off”. It should be reminded that the on “state” could have multiple

sub-states that corresponding to various operating voltages. The PM can control their

79

operating voltages and perform dynamic voltage scaling. While incorporating

dynamic voltage scaling into the power management framework promises more

performance and energy improvement, it also adds a level of complexity into the

already complicated power scheduling problem.

Now let us evaluate the potential benefits of applying dynamic voltage scaling

(DVS) to PicoRadio type of applications. DVS is widely applied to processing units

(e.g., CPUs) that cannot be turned off completely, either due to the tremendous

wakeup overheads or the need for it to continuously process incoming requests within

a tight time constraint. On the other hand, the PicoNode architecture is specially

designed to have a finer granularity of power control and small wakeup overheads.

PicoRadio application also has much looser time constraints and is designed to be

robust enough to tolerate certain degree of packet losses. In addition, for this type of

low activity and low duty cycle applications, leakage current is a major concern and

running at a lower voltage does not reduce the leakage energy consumption.

As first glance, DVS may not be as effective on the PicoRadio system as the

traditional CPU architectures. However, it adds levels of refinement to the

scheduling problem and should not be left un-explored. Our current version of the

power manager supports DVS, but the specific scheduling techniques for DVS will

not be studied in this thesis and are left for future work.

5.6 The Stateflow - Simulink Estimation-Simulation

Framework

The entire reactive behavior of the PicoNode as shown in Figure 25 is modeled in

the stateflow-simulink simulation environment. This environment allows us to

80

explore different power control algorithms under different scenarios such as system

architecture and inter-arrival and service distributions. Power and performance

estimation are accomplished by back-annotations in state diagrams.

Figure 32 shows the top-level schematics of the PicoNode Protocol stacks. Figure

33 shows the MAC sub-domain, which has its own PM. Figure 35 shows the network

layer; Figure 34 shows the top level PM.

Queue

MAC

Network

PMQueue

QueueQueue

MAC

Network

PMQueue

Queue

Figure 32 Top level schematics for PicoNode protocol stacks

81

Figure 33 schematics of the MAC domain. There is a second level PM, which is in-charge of the receive and transmit sub-domains

82

Figure 34 Top-level PM with timer services and the power scheduler.

Figure 35 Network layer consists of four concurrent FSMs: MAC packet processing, Application packet processing, sleep request and Timer functions.

83

6. Network Level Power Management

6.1 Traffic Considerations

To devise the appropriate network level power management policy, we need to

understand the nature of the network traffic. While there are a large number of

literatures on Ethernet and World Wide Web traffic [39][40][41] and a few on high

data rate wireless LAN [42] [43], the author is not aware of any studies on low

activity, low data rate sensor networks. Most studies on power management for

sensor networks were conducted under the assumption that the inter-arrival traffic

follows some well-characterized and well-studied distribution, such as exponential

distribution. As we will see shortly, this is a gross over-simplification. There is very

little publication on either the simulated or measured traffic traces in a typical

wireless sensor network.

84

We have conducted some experiments in our OMNet++ model to obtain accurate

information on network traffic. The information we obtain should answer the

following questions: Does sensor network traffic fall into any know distributions? Is

the traffic stream correlated? What protocol or system properties affect the network

traffic? How different are the traffic patterns seen by different nodes in the network?

To acquire meaningful and intuitive results from the experiment, we constructed

a uniform 10x10 grid network of 100 nodes (see Figure 36). By controlling the

transmission power, we can alter the number of neighbors a node observes. Twelve

controller nodes are “sprinkled” among the sensor nodes. Intuitively speaking,

9

55

72

48

9

55

72

48

Figure 36 10x10 grid of sensor nodes. The Green ones are sensors; red ones are controllers.

85

different nodes should see different traffic distributions, depends on the type of the

node and its position in the network. Controller nodes, due to its periodic interest

packet generations and the arrival of requested data packets, see a higher rate of

traffic. And nodes situate close to the center of the network see more forwarding

traffic. In comparison, sensor nodes locate at the edge of the network will see much

less traffic.

We use the following parameters for our experiments: A node can only see its

immediate neighbors, so the number of neighbors per node ranges from two at the

edge of the network to four at the center. In the application layer, the sensor data

generation period is 120 seconds; the data generation duration, which is the time

duration a sensor generates data responding a particular interest, is 1000 seconds; and

the interest lifetime is 1000 seconds. In the network layer, an interest in the interest

cache (routing table) expires after 1000 seconds and the routing table is updated every

120 seconds. We have simulated the grid network for 180000 seconds, or about 50

hours, and collected inter-arrival time stamps for various nodes.

As we have expected, there is a vast discrepancy among the sensor nodes in the

total number of packet received. Sensor Node 9 situates at the edge of the network

and only has two neighbors. It is the least busy node in the network and merely

receives 46 packets total; whereas controller Node 72, the busiest node in the network

receives 2841 packets.

To study inter-arrival time distribution, the Cumulative Probability Function

(CDF) of the inter-arrival times is plotted. Figure 37 and Figure 38 (a zoom in of

Figure 37) show the CDF plots for Node 9. The CDF curve is clearly not exponential

86

and does not fall into any known distributions. Only 20% of the packets arrivals are

within 0.5 seconds and nearly 40% of the packet arrivals are long than 2000 seconds.

The “steps” in the figures occur at the multiples of the sensor data generation period

(120 seconds) and the interest generation period (1000 seconds). Inspecting the traffic

stream, we notice it is somewhat correlated: Sequences of the long inter-arrival times

(> 120 second) interleave with several bursts of short inter-arrival times (< 0.5

second), while the long inter-arrival times dominate. The bursts of short inter-arrival

times vary in length, ranging from two to five.

87

Figure 37 CDF for sensor Node 9. The curve in red is that of an exponential distribution. It is quite obvious the CDF for Node 9 is NOT exponential.

Figure 38 Zoom in of the CDF plot in Figure 37. The list at the right shows the inter-arrival stream. Short inter-arrival sequences (<0.5 seconds) are shown in either red or blue.

Inter-arrival Sequence 1000.554987 18000.403200 24001.064045 29000.288000 34000.839949 47000.576000 73000.695223 77000.604800 77000.668800 77000.704000 77000.739200 77000.774400 77160.035200 77280.035200 77400.035200 77640.035200 77880.035200 82001.282127 85000.552622 85002.889998 85004.912040 85006.289814 85007.044890 85007.552949 85008.528139 85008.545739 85009.690340 85009.707940 85920.052800 97000.288000 108000.432000 108000.441373 108000.876911 108240.035200 108360.035200 108480.035200 108960.035200 108960.070400 108960.105600 109000.691421 112000.144000 126000.851251 144000.374400 148000.172800 151000.576000 156001.033472

88

Controller Node 72, the busiest node in the network, sees very different traffic

patterns than Node 9. Figure 39 shows the CDF plot and Figure 40 zooms in around

the x-axis (0<t<5s). Out of the total 2841 packets it receives, more than 60% have

relatively short inter-arrival times (< 0.5 second), including 50% or more with inter-

arrival times less than 0.1 seconds. There are very few packets (< 5%) that have inter-

arrival times greater than 120 seconds. From Figure 39, we notice the major “jumps”

in the curve occurs at 120 seconds intervals, which is the sensor generation period. In

this particular simulation, sensors generate data at 40 seconds offset to each other, so

we also notice minor “jumps” at the multiple of 40 seconds. As we zoom in further

on Figure 39 and inspect the CDF curve from t = 0 to 0.2 second (Figure 41), we

notice “jumps” at the intervals of 0.0176 seconds. It turns out 0.0176 is the packet

transmission duration in our simulation. And since there is no data aggregation

implemented in the current MAC, once the MAC takes control the shared channel, it

will not release it until all the accumulated packets in the transmission queue are sent.

Two modifications in the MAC protocol will make the “jumps” less prominent.

The first is data aggregation, which is to combine the several packets that are sent to

the same destination into one. The second is to introduce random backups after the

transmission of each packet. Apparently protocol algorithms have impacts on network

traffics. However, independent of any particular algorithm chosen, there is always

going to be some periodic components related to application parameters such as

sensor data generation periods and system parameters such as packet transmission

duration.

89

Compare Node 9, Node 72 definitely sees a more correlated incoming packet

stream. There are still sequences of the long inter-arrival times (> 120 second)

interleaved with sequences of short inter-arrival times (< 0.5 second). But in this case,

the short inter-arrival times sequences are dominate. The lengths of the short inter-

arrival sequences are also much greater on average compared to these of Node 9.

Other nodes have incoming traffic distributions that fall somewhere between

those of Node 9 and Node 72. The general traffic distribution can be described as

having long inter-arrival times sequences interleave with short inter-arrival

sequences. The length and the frequency of occurrences of the sequences in the traffic

stream depend on the types and the positions of the nodes. The busier the node is, the

longer and the more prevailing of the short inter-arrival sequences compared to the

long inter-arrival sequences.

90

Inter-arrival Sequence 12000.317239 12000.612842 12002.307731 12120.035200 12120.176000 12240.176000 12360.281600 12480.246400 12600.035200 12600.646702 12720.176000 12840.176000 12960.176000 13000.172800 13000.236800 13000.737015 13001.245160 13080.062338 13080.070400 13080.500653 13080.518253 13080.531166 13080.566366 13080.979213 13080.996813 13081.014413 13081.032013 13081.205422 13081.835098 13081.852698 13081.870298 13081.887898 13081.905498 13082.024279 13082.059479 13082.584908 13082.602508 13082.603918 13082.620108 13082.637708 13082.655308 13082.656718 13082.672908 13082.690508 13082.691918 13082.727118 13082.762318 13082.797518 13083.571796 13083.694996 13083.782996 13083.818698 13083.836298 13083.853898

Figure 39 CDF for sensor Node 72. The distribution is still not exponential.

Figure 40. Zoom-in the CDF plot in Figure 39. The list at the right shows the inter-arrival stream. Short inter-arrival sequences (<0.5 seconds) are shown in either red or blue.

91

6.2 Constant Threshold algorithm

Let us first evaluate the simplest power management scheme, the constant

threshold algorithm. In this algorithm, the processing device is shut off after its idle

time exceeds a constant threshold. The optimal value of the threshold depends on

system power metrics such as the break-even time Teven and the incoming traffic

distribution. This is a scheme widely used in laptop computers. Most publications on

power management dismiss the constant sleep threshold algorithm as too simplistic

and crude. We should like to find out how well this algorithm works on sensor

network, that is, how close it is to the optimal power control algorithm.

Figure 41 Further zoom in of the CDF plot in Figure 39. Major “jumps” in the curve occurs at the multiple of 0.0176 seconds, the packet transmission time.

92

Since different network nodes see different traffics, the constant thresholds should

take on different values likewise. As a result, all the nodes have to be individually

pre-characterized to obtain their “optimal” threshold values. Note that these values

are pre-determined and do not change over time.

To obtain the value of optimal threshold, we “sweep” the energy dissipation

related to power management over a range of threshold values to generate the

“threshold curve”. The minimum of the curve is the optimal threshold point: the sleep

threshold value at which minimum amount of energy is consumed. It should be

emphasized that only the energy consumed in idle and sleep states and during power

state transitions is counted. Energy dissipated actively processing packets is excluded,

as it is irrelevant to power management.

93

Figure 42 shows the “sleep threshold plots” for sensor Node 9. Since we are

interested in how Teven affects the values of the optimal sleep thresholds (Sth), three

curves are plotted corresponding to Teven values of 0.5, 0.2 and 0.05. Recalled that

Teven is defined as the ratio of wakeup energy overhead to the idle power dissipation.

It is the length of time such that if the system is idle for time Teven, the energy that is

expended is the same as the energy required for waking up the system. Teven is a

good indication of the wakeup overhead: if the system idle power is fixed, the greater

Teven is, the greater the wakeup overhead and overall energy expenditure. For the low

Figure 42 Energy consumed versus sleep thresholds for sensor Node 9. The three curves

correspond to break-even time of 0.5, 0.2 and 0.05.

94

activity sensor Node 9, the optimal sleep thresholds are zeroes for all three values of

Teven, which means the node should shut down immediately once idle.

Now let us study a more typical sensor node in the network, Node 48. Node 48

has four neighbors and one of them is a controller. As a result, it is much more active

than Node 9 and receives a total of 545 packets during simulation. Figure 43 shows

its sleep threshold plots. The optimal sleep thresholds are still zeroes for Teven equals to

0.05 and 0.2. However, as TEven is increased to 0.5, the optimal sleep threshold

increases to 0.0176 seconds.

Because the expected inter-arrival times are shorter for busier nodes, it pays off

for them to stay in idle longer comparing to the less busy nodes. This in turn implies

Figure 43 Energy consumed versus sleep thresholds for sensor Node 9. The three curves correspond to break-even time of 0.5, 0.2 and 0.05.

95

that the optimal value of sleep threshold should increase with node activity. The

examination of the sleep threshold plots of controller Node 72 confirms this

supposition. The optimal sleep threshold is zero for Teven = 0.05 and is 0.036 and 0.09

respectively for Teven = 0.2 and 0.5.

However, our supposition is proven wrong when we exam the sleep threshold

plots of controller Node 55. Node 55 only receives 2295 packets, as compared to the

2841 packets received by Node 72. It is less active than Node 72. However, when Teven

= 0.5, its optimal threshold is 0.125s, which is greater than the 0.09s for Node 72. In

fact, the corresponding energy expenditure of Node 55 at this threshold value is

32449, which is also great then 32232 as of Node 72.


96

The explanation for this discrepancy lies in the traffic distributions. We need to

find out which values of packet inter-arrival time cause the most energy dissipation

and which nodes a higher percentage of these “worst” inter-arrival times. For the

constant threshold scheme, the “worst” values fall between the sleep threshold and

Teven. Because in these inter-arrival scenarios, the system stays awake for the duration

of the sleep threshold and then goes to sleep, only to be waken up again soon after.

The worst-case scenario occurs when the inter-arrival time equals the sleep threshold:

the node goes to sleep and wakeups right after. In fact inter-arrival times around Teven

is “bad” for any power management algorithm, because either the decision to stay

awake or to go to sleep will dissipate energy roughly equal to the wakeup overhead.

Figure 46 and Figure 47 are the CDF plots of Node 55 and Node 72 respectively. By


97

inspected the plots, we notice the following facts: For Node 55, about 15% of the

inter-arrival times fall in between 0.125 (optimal sleep threshold) and 0.5 (Teven); while

for Node 72, the number is only 10%. Since compared to Node 72, Node 55 has 5%

more packets falling into this undesirable scenario, it consumes more energy despite

being almost 20% less active than Node 72.

98

Figure 46 CDF plot for Node 55

Figure 47 Zoom in of the CDF plot in Figure 46.

99

Our experiment results can be summarized as the following: in general the

optimal constant sleep threshold increases with wake-up overhead and traffic

activities. However, nodes with a higher percentage of the inter-arrival times that fall

in-between the sleep threshold and Teven may have an exceptionally high constant

threshold.

We shall now proceed to calculate the optimality of the constant threshold

algorithm. This is accomplished by comparing the results of the constant threshold

scheme to those of the optimal algorithm. The optimal algorithm is assumed to know

the sequence of arrival requests in advance. In essence, it can make decisions based

on the knowledge of the future. The optimal algorithm shuts down the system

immediately if the next idle period is greater than Teven, and keeps the system idle it

is less than Teven. Evidently the results of the optimal algorithm are unobtainable in

reality, and only serve as the upper bounds to any practical implementations.

73.5%74.9%75.3%72Controller

72.6%72.4%86.8%55Controller

98.7%98.8%99.3%9Sensor

85.7%86.3%93.9%48Sensor

0.50.20.0573.5%74.9%75.3%72

Controller72.6%72.4%86.8%55

Controller

98.7%98.8%99.3%9Sensor

85.7%86.3%93.9%48Sensor

0.50.20.05TevenNode

Nod

e Tr

affic

Act

ivity

Wake-up overhead

Table 7 Optimality of the constant threshold algorithm. The worst-case results are highlighted in red.

100

Table 7 lists the performance of the constant threshold algorithm as compared to

the optimal algorithm for Nodes 71, 55, 48, and 9. The results are listed for three

different values of Teven. The general observation is that it performs better for nodes

with low activities and small Teven. For Node 9, it is extremely close to optimal for

all three values of Teven. Even for a typical sensor node like 48, it is at least 86%

optimal. Nevertheless, we need to keep in mind that the worst performing nodes are

the most vital to the interest of the network. If these critical nodes run out of energy

and die out early, the overall quality of service of the network will be adversely

affected. This is both due to their central locations and their role as controllers. The

worst performer when Teven =0.05 is Node 72, the busiest node in the network;

while as Teven increases to 0.2 and 0.5, the worst performer becomes Node 55.

The reason, as explained in the previous paragraphs, is because Node 55 has a

higher percentage of the “worst” inter-arrival times. As the overhead for power

management (Teven) increases, the performance of the constant threshold algorithm

degrades on Node 55 further than on Node 72. The interesting thing is that even with

the optimal algorithm, Node 55 consumes more energy than Node 72 for all three

values of Teven. Again, we need to remind ourselves that only the non-active energy

is considered, the overall energy dissipation is the sum of the non-active and active

energy. Exactly which node consumes the most energy overall depends on the active

energy the nodes spend processing packets. Since Node 72 has to process more

packets, it consumes more “active” energy than 55, which may or may not offset the

comparably less energy consumed during idle and power state transitions.

101

Let us evaluate the pros and cons of the simple constant threshold algorithm. The

pros are two-folds: it is very simple to implement, and yields reasonably good results

for even the worst performing nodes. The cons of the algorithm are quite apparent: a

straightforward implementation of the algorithm implies that every node has to be

pre-characterized to obtain their individualized threshold values. This is quite

cumbersome considering the large numbers of nodes in the network. In addition,

since the threshold value is fixed, the algorithm is non-adaptive to channel quality

(time-varying packet loss and bit-errors rates) changes as well as neighborhood

configuration changes. The former problem can be solved by identifying and

characterizing only the busiest and most critical nodes in the network. Zero can be

used as the threshold value for all the other nodes. This modified algorithm is suitable

for networks with known topology, relatively stable environment and little mobility.

For networks with a more dynamic environment, or with a topology that cannot

be obtained a prior, or with significant mobility, we would need an adaptive

algorithm that modifies the threshold based on traffic changes. In the next sections,

we will investigate several known adaptive algorithms and some variations.

6.3 Sinha & Chandrakasan

We have briefly introduced Sinha & Chandrakasan’s power management

algorithm for sensor networks Section 4.1.2. Their obvious flaw is the incorrect

assumption that sensor network traffic has uncorrelated exponential distributions. As

we have seen in the traffic analysis section, sensor network traffic is indeed quite

correlated. Even under their flawed assumption, the algorithm has some major

drawbacks.

102

To illustrate the latter point, let us design an experiment and investigate the merit

of this algorithm under their exponential traffic assumption. The CDF of exponential

distribution with inter-arrival rate is described as:

D(x) = 1- e- x (1)

The value of is dynamically measured and updated. The device shutdown

probability Pth is calculated as:

Pth = 1- e- Teven (2)

The shut down decision is based on Pth: if Pth is greater than a pre-set constant a, the

system shuts down immediately; otherwise it stays awake. In our experiment setup,

we generate as input a packet stream with = 2, vary the values of Teven from 0.05 to

0.5, and compare the energy consumption of a node implementing this algorithm to

the optimal algorithm. The constant a is set to be 0.5. Table 8 shows the comparison

results. We can see that the algorithm works very well only when Pth is close to either

49286492814928020095100425021Sinha &Chandrakasan

0.640.550.510.340.20.09Pth

80.5

27303

0.4

97

24960

0.35

5822104.9Error %

313491643190924789OptimalAlgorithm

0.50.20.10.05Teven

49286492814928020095100425021Sinha &Chandrakasan

0.640.550.510.340.20.09Pth

80.5

27303

0.4

97

24960

0.35

5822104.9Error %

313491643190924789OptimalAlgorithm

0.50.20.10.05Teven

Gamma=2

Table 8 Performance of the Sinha & Chandrakasan algorithm compared to optimal algorithm assuming exponential inter-arrival distribution. The disparity quickly increases as Pth goes from 0 to 0.5, and from 1 to 0.5.

103

0 or 1, and quickly degrades as Pth approaches the value of constant a (set to 0.5).

The disparity between the two results is only 4.9% when Pth is 0.09, and becomes

97% when Pth is 0.51. The reason is that when the shutdown probability Pth is close

to either 1 or 0, the system is quite certain about whether to stay awake or go to sleep;

whereas when the probability approaches 0.5, the system is just making a random

guess.

It should be rather evident that Pth has a significant impact on the performance of

this algorithm. According to equation (2), Pth is determined by system metric Teven

and traffic parameter . These are system specification and environment parameters

that are given to, and not controlled by the power manager. This algorithm’s heavy

dependence on system and environmental parameters is obvious a considerable flaw.

This combined with the false assumption of exponential distribution, make it quite

unattractive to sensor networks.

6.4 Modified Hwang and Wu’s [32]

Hwang and Wu’s algorithm also dynamically samples the inter-arrival rates and

bases its decisions on computational history. Unlike Sinha & Chandrakasan, they

assume that the traffic is correlated. Hwang and Wu’s adapted the exponential –

average approach used in CPU scheduling for the prediction of the next inter-arrival

time, i.e. the next idle period of the processing unit. The prediction formula is shown

in (3):

I n+1 =a *in + a (1 - a) in-1 + a (1 - a)2 in-2+ ..…. + a ( 1- a)n i0 + ( 1- a)n+1 I0 (3)

Where I n+1 is new predicted value, ik’s are the previous idle periods, and a is a

constant attenuation factor in the range between 0 to 1. The formula indicates that the

104

predicted idle period is the weighted average of previous idle periods. Early idle

periods have less weight, as specified by the exponential attenuation factor. The

parameter a controls the relative weight of recent and past history in the prediction.

There are many multiplication and addition operations in formula (3). This

translates to high computational complexity that is not feasible for direct

implementation on sensor nodes. To solve this problem, we truncate the formula and

only use the three most recent histories. The formula now becomes:

I n+1 =a *in + a (1 - a) in-1 + a (1 - a)2 in-2 (4)

The earlier idle periods carry very little weight and can be discarded with little

effects on I n+1, while reducing the computational complexity drastically. Our

modified Hwang and Wu uses (4) to predict the next idle period I n+1., where ik’s are

the previous inter-arrival times instead of the idle periods. If I n+1 is greater than Teven,

it shuts the system off as soon as the system becomes idle, otherwise the system stays

on until the idle time period reaches Teven.

Wake-up overhead

69.0%73.6%82.7%72 Controller

82.6%90.1%91.2%9 Sensor

80.1%85.0%93.2%48 Sensor

63.2%70.3%87.3%55 Controller

0.50.20.0569.0%73.6%82.7%72 Controller

82.6%90.1%91.2%9 Sensor

80.1%85.0%93.2%48 Sensor

63.2%70.3%87.3%55 Controller

0.50.20.05TevenNode

Nod

e Tr

affic

Act

ivity

Table 9 Optimality of the modified Hwang & Hu. The worst-case nodes are highlighted in red.

105

Table 9 lists the optimality of the modified Hwang & Hu algorithm. We have

chosen the values of parameter a that produce the best results. Similar to the constant

threshold algorithm, its performance degrades as wakeup overhead and traffic

activities increase. Serving as an exception, Node 55 is the worst performing node for

Teven= 0.2 and 0.5. The reason, as explained in Section 4.3.3.1.1, is because Node 55

has a higher percentage of inter-arrival times that incur the most energy related to

power management. As the wakeup overhead increases, it replaces Node 72 as the

worst-case node.

Let us compare Table 9 to the results of the constant threshold algorithm listed in

Table 7. When Teven = 0.05, the worst-case node in modified Hwang & Hu is 85.8%

optimal versus the 75.3% in the case of constant threshold. When Teven = 0.2, the two

perform very similarly (71.4% versus 72.4%). When Teven = 0.5, Hwang & Hu

performs 8% worse than the constant threshold algorithm (63.7 versus 72.6%). Notice

the constant threshold scheme also tends to perform better for the less active nodes in

the network.

6.5 Adaptive Dynamic Threshold Algorithms

We would like to investigate how different ways of executing shutdowns and

weighting the inter-arrival history can affect the performance of the adaptive

algorithms. Instead of making the immediate shutdown decision based on past

history, a different approach is to use the predicted idle time to vary the sleep

threshold. In other words, the system will have a dynamic sleep threshold determined

by the weighted average of the previous inter-arrival rates. We have simulated two

106

dynamic threshold algorithms with different weighting functions. In both cases, the

system goes to sleep after remaining idle for Sth.

1. Dynamic threshold with exponential weighting (EXP). Update the sleep

threshold Sth defined as:

Sth = C * Teven / I n+1 (5)

Where C is a constant, and I n+1 is calculated using equation (4).

2. Dynamic threshold with root mean square (RMS). The sleep threshold is

defined as:

Sth = C * Teven / I n+1 (6)

Where C is a constant and I n+1 = square_root( ( in2 + in-1

2 + in-2 2)/3)

Unlike the exponential weighting scheme, the previous three arrival rates are

weighted equally.

The simulation results of EXP and RMS are displayed in Table 10 and Table 11

respectively.

Wake-up overhead

70.2%76.5%83.0%72 Controller

88.5%88.3%87.8%9 Sensor

82.0%85.1%90.8%48 Sensor

64.4%72.3%83.3%55 Controller

0.50.20.0570.2%76.5%83.0%72 Controller

88.5%88.3%87.8%9 Sensor

82.0%85.1%90.8%48 Sensor

64.4%72.3%83.3%55 Controller

0.50.20.05TevenNode

Nod

e Tr

affic

Act

ivity

Table 10 Optimality of EXP. The worst-case nodes are highlighted in red.

107

One would expect that out of the three adaptive algorithms presented, one would

yield superior results to the others. However, comparing Table 9, Table 10 and Table

11, we notice only marginal variations, and there is no clear winner. EXP seems to

perform on average 8% worse than RMS on Node 9 (as we will see shorter this can

be improved). Remember we are the most interested in the worst performing node,

which can be either Node 72 or Node 55 depends on Teven. The differences in

performance for these nodes among the three algorithms are less than 2%. These

marginal differences suggest that altering the ways of executing shutdowns and

weighting the previous inter-arrival history have little effects on the performance of

the adaptive algorithms. This leads us to speculate that any type of predictive

algorithm that relies on the recent inter-arrival history will not be able to significantly

out-perform modified Hwang & Hu, EXP or RMS. In other words, there is a limit to

the performance of any algorithm that only has the knowledge of the recent inter-

arrival history. A considerably different approach that incorporates information of the

network neighborhood is needed to achieve any major breakthrough. Such

Wake-up overhead

70.3%76.6%82.4%72 Controller

97.1%96.9%97.1%9 Sensor

82.0%85.1%91.6%48 Sensor

64.6%72.4%84.1%55 Controller

0.50.20.0570.3%76.6%82.4%72 Controller

97.1%96.9%97.1%9 Sensor

82.0%85.1%91.6%48 Sensor

64.6%72.4%84.1%55 Controller

0.50.20.05TevenNode

Nod

e Tr

affic

Act

ivity

Table 11 Optimality of RMS. The worst-case nodes are highlighted in red.

108

information can be passed around in the network through piggy-bagging existing

packets with fields dedicated to power management.

6.5.1 Improve Adaptive Algorithms By Exploiting Special

Characteristics Of The Sensor Network

Even though major improvements are difficult to obtain, minor ones can be

accomplish by making some plain observations of network attributes. Firstly, sensor

nodes are woken up not only by the incoming packets, but also by periodic update

timers. Unlike the wakeups caused by incoming packets, expiry timer-triggered

wakeups tend not to be followed by bursts of traffic. This means that the system can

shut down immediately after timer-triggered wakeups. The second observation is on

the controller nodes that generate interest packets periodically. They often will

receive bursts of data packets shortly after the interest generation. Based on this

observation, controller nodes have their dynamic sleep thresholds increased right after

each interest generation.

72.6%78.2%83.6%72 Controller

98.9%98.9%98.9%9 Sensor

83.0%86.6%92%48 Sensor

65.7%74.1%82.7%55 Controller

0.50.20.0572.6%78.2%83.6%72 Controller

98.9%98.9%98.9%9 Sensor

83.0%86.6%92%48 Sensor

65.7%74.1%82.7%55 Controller

0.50.20.05TevenNode

Nod

e Tr

affic

Act

ivity

EXP

Table 12 Optimality of the Improved EXP

109

After adding the above minor improvements in to the EXP algorithm, we obtain

the results listed in Table 12. Although improvements for the critical nodes are only

marginal (1~2%), there is almost a 10% improvement for Node 9.

6.6 Evaluations Of Various Power Management Algorithms

For Sensor Network Applications

Adaptive algorithms seem to be more appropriate solutions since they are able to

explore the temporal correlations in the traffic, handle environmental changes and are

relatively simple to implement. They are also self-designing, which means no traces

and pre-characterizations are needed. Nevertheless, we do not want to rule out the

constant threshold algorithm completely as it achieves better performances in certain

cases.

83.0%86.6%92%Dyn EXP48

85.7%86.3%93.9%Constant

9 98.9%98.9%98.9%Dyn EXP

98.7%98.8%99.3%Constant

55 65.7%74.1%82.7%Dyn EXP

72.6%72.4%86.8%Constant

72

Teven

72.6%78.2%83.6%Dyn. EXP

73.5%74.9%75.3%Constant

0.50.20.05

83.0%86.6%92%Dyn EXP48

85.7%86.3%93.9%Constant

9 98.9%98.9%98.9%Dyn EXP

98.7%98.8%99.3%Constant

55 65.7%74.1%82.7%Dyn EXP

72.6%72.4%86.8%Constant

72

Teven

72.6%78.2%83.6%Dyn. EXP

73.5%74.9%75.3%Constant

0.50.20.05

Table 13 Performance comparison of the constant threshold versus dynamic EXP

110

Let us study the comparison results presented in Table 13. For low activities

nodes such as 48 and 9, the two results are rather compatible. For high activity nodes

such as 55 and 72, EXP out performs the constant threshold algorithm for smaller

Teven. However, while Teven increases and the penalty for mis-prediction increases, the

constant threshold algorithm becomes the winner. Ironically, the simple constant

threshold algorithm seems to execute better than the more sophisticated EXP when

power management becomes difficult and critical. In other words, it does better for

nodes with the “worst” inter-arrival times (Node 55) and systems with higher Teven.

One approach to address the above problem is to use the constant threshold

algorithm for some critical controller nodes, on the condition that we can properly

identify them. A node may be able to extract information from its location in the

network (network topology) and “guess” how busy it is. Topology based power

management is an active area of our future research.

As we have seen before, the busiest node in the network may not be the “worse”

node in a power management sense. At the same time, we need to keep in mind that

the “worse” node in a power management sense may not be the node with the worst

overall energy consumption. The overall energy consumption is determined by the

active energy plus energy controlled by the power manager (Idle energy and power

state transition overheads).

6.7 Service Time Consideration

We did not consider the packet service time in our discussions so far. It is

definitely a relevant parameter in power management. However, for low activity, low

duty cycle type applications like the PicoRadio network, the service time is

111

significantly smaller than the average inter-arrival times. Our simulation indicated

that even for the busiest node in the network, there is only a 2% chance that there is

more than one packet in the queue. It seems reasonable for us to not to include service

time in the modeling.

6.8 Implementation Cost Of The PM

Care should be taken when implementing the power manager itself to make sure it

does not consume an overwhelming amount of energy. The selected power control

algorithm should have low computational complexity and requires modest storage

space. The original Hwang & Hu algorithm clearly violates this requirement and has

to be modified.

The behavior of the power scheduler is modeled as concurrent EFSMs, each of

them controls a power domain (Figure 27). An energy efficient way of implementing

the PM is either mapping it directly to ASIC, or some kind of inter-connected

reconfigurable FSMs. A microprocessor implementation is not recommended due to

its massive cost compared to other implementation fabrics.

7. Conclusions and Future works

In the last chapter of the thesis, we would like to recapitulate the major research

results and contributions. We will also discuss the lessons learned and identify the

open questions and opportunities for further research. Power management for sensor

networks is still a fledging area that is very much unexplored. We hope that this thesis

will instigate a certain amount of interest from the research community to tackle one

of the many interesting problems left.

112

7.1 Summary of Thesis Research and Contributions

2. Formal top-down platform-based design methodology for protocol

implementation. Most protocol design methodologies currently in use are

inadequate, either because they do not rely upon formal techniques and

therefore do not guarantee correctness, or because they do not provide

sufficient support for performance analysis and design exploration and

therefore often lead to sub-optimal implementations. Our methodology relies

on a formal Model of Computation (MOC). It supports architecture

exploration, meets the application’s need on flexibility while achieving energy

efficient solutions. Using PicoRadio as the design driver, the proposed formal

top-down design methodology yields superior results compared to traditional

bottom-up ad-hoc approaches.

3. Reactive systems need reactive management. OS and software support is

crucial for the design of ultra-low energy communication systems. These

systems, reactive in nature, tend to have high level of integration and system

heterogeneity. General-purpose operating systems developed for broad

application are increasingly less suitable for these types of complex real time,

power-critical, domain specific systems implemented on advanced

heterogeneous architectures. More efficient solutions are obtained with OS’s

that are developed to exploit the reactive event-driven nature of the domain.

As proof, we present a comparison between two OS’s that target this

embedded domain: one that is general-purpose multi-tasking (ECOS) and

another that is event-driven (TinyOS). Preliminary results indicate that the

113

event-driven OS achieves an 8x improvement in performance, 2x and 30x

improvement in instruction and data memory requirement, and a 12x

reduction in power over its general-purpose counterpart.

4. Hierarchical power management framework for reactive systems. Based

on the attractive concepts of an existing reactive OS, we proposed a power

management framework that specifically targets reactive heterogeneous

system. Its hierarchical structure enhances design scalability, supports

concurrency and enables power control at various granularities. Most state-of-

art power management systems handle only stand-alone devices. The scope of

our power management algorithm, however, is not limited to individual nodes;

instead, it aims to encompass the interest of the network as a whole. Our

power management algorithm executes in two phases: Network level

algorithm first treats the whole node as one entity and try to decide when the

whole node should go to sleep; then once the node is turned on, the node level

algorithm determines on the scheduling of the various modules inside the

node.

5. The experimentation of network level power management algorithms on

the PicoRadio network. In the OMNet++ simulator, we have simulated the

performance of various power control policies in a typical sensor network

setting. Each sensor node models the complete PicoRadio protocol stack.

From our experiments, adaptive algorithms seem to be good solutions since

they are able to explore the temporal correlations in the traffic stream, handle

environmental changes and are relatively simple to implement. However,

114

simple constant threshold algorithms perform better for nodes with the

“worst” inter-arrival times and systems with high Teven. Our experimentation

on the various adaptive algorithms lead us to speculate that there is a

performance limit to any adaptive algorithm that only has the knowledge of

the recent inter-arrival history. A more “global” approach that incorporates

information on the network neighborhood is needed to achieve major

breakthroughs.

6. PicoRadio prototype development. To validate research concepts and gain

valuable design experience, I have participated in the development of the

PicoRadio II and PicoRadio III chips. For the PicoRadio II chip, I was

responsible for the entire software implementation process, including OS

selection and porting, application code generation, etc. I have been involved in

the architecture development of both PicoRadio II and PicoRadio III.

PicoRadio III deploys a power manager to demonstrate the reactive

management concepts discussed in the thesis.

7.2 Lessons Learned and Future Research Opportunities

In completing the thesis research, we have encountered numerous problems,

which prevented us from accomplishing higher goals. We would like to itemize the

valuable lessons learned from our efforts in methodology and design flow as well as

power management for sensor networks. Future research directions will also be

suggested.

115

Platform-based design methodology for protocol processing

Good modeling is essential for the success of the design flow. The

methodology only works as good as the models. The processor and OS

models provided by the Cadence VCC tool are vastly inaccurate. At a result,

the PicoRadio II system performance is over-estimated and memory

requirement under-estimated. Needless to say, this is detrimental to the design

process.

In the architecture exploration phase of the design flow, it would be highly

desirable to provide a “confidence metric” that indicates how accurate the

performance estimation is. We are not aware of any existing performance

analysis tools that have this capability. Consequently, the designers are often

reluctant to adopt high-level design methodology and use such tools.

Node level power management

Localized power control policies should be investigated in the future. In the

thesis, we presented a power scheduling approach that is global in nature:

most of the intelligence is in the global power scheduler. Certain scheduling

decisions, however, can be delegated to the local blocks to avoid overloading

the power scheduler. A block may decide on how long it should wait in Idle

before requesting to sleep, based on its own evaluation of the expected event

inter-arrival rates and wakeup overheads.

More refined and aggressive predictive look-ahead scheduling schemes should

be studied and implemented. We have presented a rather simplistic and

conservative predictive look-ahead scheduling algorithm. The study of

116

predictive power scheduling is still in the conceptual stage. We need more

quantitative analysis of its costs and benefits. The PicoRadio platform, due to

its very low block wakeup overheads, is not a good candidate for

demonstrating predictive scheduling. An implementation platform that has

high wakeup overheads needs to be selected as the proof-as-concept for

predictive scheduling algorithms. DVS is another open research area. Our

framework has the capability to incorporate DVS but specific scheduling

techniques for DVS are left for future work.

Network level power management

Larger scale networks with more realistic topology should be constructed to

study power management algorithms. We have conducted our experiments on

a simple 100-node grid network; while realistic settings typically do not have

grid configuration and may include “walls” that block signal transmission. It

will be also very interesting to see how the performance of various power

control algorithm scales with the size of the network.

Better wireless channel models should be incorporated in the network

simulation. We are currently using an overly simplistic channel-interference

matrix model. Channel quality has a significant impact on protocol design and

the choice of power management policy. Unfortunately, wireless link

characteristic are difficult to capture and a realistic channel model is not easily

available. The ultimate plan is to implement the protocol stack along with the

power manager on a prototype and deploy it in a real network setting.

117

Measurement data on network traffic and energy dissipation can then be

collected and used to aid policy development.

A global paradigm that uses network topology to devise power management

policies should be studied. This paradigm is drastically different from existing

policies that rely only on local traffic flows. Our experiments suggest that

there are limitations to algorithms with narrow local visions. We believe that

real breakthrough can be achieved if the power manager uses the knowledge

of not only the local node but also the surrounding network. Power control

information can be passed around in the network by appending dedicated

power management fields to existing packet format. A node’s power control

decisions may be influenced by the energy status of its neighboring nodes.

Very often in the sensor network a node knows its own location and those of

its neighbors, and power management polices can be developed based on this

information. For example, in the constant threshold algorithm, instead of pre-

characterizing the nodes to obtain its optimal constant sleep threshold, we

could calculate the value based on global network topology.

118

8. References

1. E. Lee and A. Sangiovanni-Vincentelli, A Unified Framework for Comparing Models of

Computation, IEEE Trans. on Computer Aided Design of Integrated Circuits and Systems, Vol.

17, N. 12:1217-1229, December 1998.

2. A. Girault, B. Lee and E. Lee, A preliminary Study of Hierarchical Finite State Machines with

Multiple Concurrency models, UCB/ERL Technical Report, August 1997.

3. C. Hoare, “Communicating Sequential Processes”, Communications of the ACM, Vol. 21, No. 8,

August 1978.

4. R.Milner, J. Parrow, and D. Walker, A Calculus of Mobile Processes, I, Information and

Computation, Vol. 100, No. 1, Sep 1992.

5. J. Dennis, First Version Data Flow Procedure Language, Technical Memo MAC TM61, MIT Lab

for Computer Science, May 1975.

6. G. Kahn, The Semantics of Simple Language for Parallel Programming, Proc. Of the IFIP

Congress 74, North-Holland Publishing Co., 1974.

7. C. Cassandras, Discrete Event Systems, Modeling and Performance Analysis, Irwin, Homewood

IL, 1973.

8. A. Benveniste and G. Berry, The Synchronous Approach to Reactive and Real-Time Systems,

Proceedings of the IEEE, Vol. 79, No. 9, 1991, pp. 1270-1282.

9. L. Lavagno, A. Sangiovanni-Vincentelli & E. Sentovich, Models of Computation for Embedded

System Design, 1998 NATO ASI Proceedings on System Synthesis, Il Ciocco, Italy, 1998

10. A. Sangiovanni-Vincentelli, R. McGeer and A. Saldanha, Verification of Integrated Circuits and

Systems, Proc. Of 1996 Design Automation Conference, June 1996.

11. A. Ferrari and A. Sangiovanni-Vincentelli, System Design: Traditional Concepts and New

Paradigms, Proceedings of the 1999 Int. Conf. On Comp. Des., Austin, Oct. 1999.

12. J. Rabaey et al. PicoRadio Supports Ad Hoc Ultra-Low Power Wireless Networking. IEEE

Computer, Vol. 33, No. 7, pp. 42-48, July 2000.

13. OPNET Radio Modeler, OPNET Technologies, Inc., http://www.mil3.com

119

14. C. Perkins and E. Royer, Ad-hoc On-Demand Distance Vector Routing, Proceeding of the 2nd

IEEE Workshop. Mobile Comp. Sys. and Apps., pp. 90-100, Feb. 1999.

15. C. Perkins and P. Bhagwat, Highly Dynamic Destination Sequenced Distance-Vector Routing

(DSDV) for Mobile Computers, Computer Communications Review, pp. 234-44, Oct. 1994.

16. H. Zhang et al, “1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless

Applications,” Proc. ISSCC Conf., 2000, pp. 68-69.

17. Darren C. Cronquist et al, "Architecture Design of Reconfigurable Pipelined

Datapaths," Twentieth Anniversary Conference on Advanced Research in VLSI 1999.

18. M. Smith, Application Specific Integrated Circuits: Chapter 5.4. Altera MAX, Addison-Wesley,

1997.

19. Tim Tuan, Suet-Fei Li, Jan Rabaey, Reconfigurable Platform Design for Wireless Protocol

Processors, Proc. Of 2001 ICASSP, May 2001.

20. B. Kienhuis et al, An Approach for Quantitative Analysis of Application-specific Dataflow

Architectures, Proceedings of International Conf. of Application-specific Systems, Architectures

and Processors, pp. 338-349, Zurich, Switzerland 1997.

21. Cadence Design Systems, http://www.cadence.com

22. ECOS Operating System, http://www.redhat.com

23. David Culler et al, The TinyOS group, Department of EECS, UC Berkeley.

24. Pai Chou; Borriello, G. Software architecture synthesis for retargetable real-time embedded

systems, Proceedings of the Fifth International Workshop on Hardware/Software Codesign.

25. K.Ramamritham and J.A. Stankovic, Scheduling Algorithms and Operating Systems Support for

Real-Time Systems, Proceedings of the IEEE, January 1994, pp. 55-67.

26. R. Evans & P. Franzon, “Energy Consumption Modeling and Optimization for SRAM’s”, Journal

of Solid-State Circuits, Vol. 30, No. 5, May 1995.

27. G. paleologo, L. Benini, A. Bogliolo and G. De. Micheli, Policy optimatization for dynamic power

management, Proc. 35th Design Automation Conference, June 1998, pp. 182-187.

28. A. Karlin, M. Manasse, L. McGeoch and S. Owicki, Competitive randomized algorithm for

nonuniform problems, Algorithmica, vol.11, no.6, pp 542-571, June 1994.

120

29. E.Chung, L. Benini and G. De. Micheli, Dynamic Power Management for non-stationary service

requests, Design, Automation and Test in Europe, pp 77-81, 1999.

30. T. Simunic, Dynamic Management of Power Consumption, Chapter 1, Power Aware Computing,

Kluwer Academic Publisher, pp102-125. 2002.

31. A. Sinha and A Chandrakasan, Dynamic Power Management in Wireless Sensor Networks, IEEE

Design & Test of Computers, 2001.

32. C. H Hwang and A. Wu, A predictive system shutdown method for saving of event-driven

computation, Int. Conf. in Computer Aided Design, Nov. 1997, pp. 28-32.

33. Intel StrongARM processors. http://developer.intel.com/design/strong/sa1100.html

34. T. Pering, T. Burd, R. Brodersen, The simulation and evaluation of Dynamic Voltage Scaling

algorithms, Proceeding of IEEE International Symposium on Low Power Electronic and Design,

1998.

35. L.Geppert, T. Perry, Transmeta’s magic show, IEEE Spectrum, vol. 37, pp 26-33, May 2000.

36. Andrea Willig, Martin Kubisch, Christian Hoene and Adam Wolisz, Measurement of a wireless

Link in an Industrial Environment using IEEE 802.11-Compliant Physical Layer, IEEE

Transaction on Industrial Electronics, 2002.

37. D. Duchamp and N. Reynolds, Measured performance of wireless LAN, Proc. Of 17th Conf. On

Local Computer Networks, Minneapolis, 1992.

38. M. Srivastava, A Chandrakasan and R.W. Brodersen, Predictive shutdown and other architectural

techniques for energy efficient programmable computation, IEEE Trans. Very Large Integated

Syst,. Vol. 4, pp. 42-4, Mar. 1996.

39. Crovella and Bestavros,Self-similarity in world Wide traffic: Evidence and possible causes, IEEE

Trans. Networking, vol. 5, pp. 835-846, Dec 1997.

40. M. Garrett and W. Willinger, Analysis, modeling and generation of self-similar VBR video traffic,

in SIGCOMM’94, London, U.K., August 1994, pp. 269-280.

41. W. Leland et al., On the self-similar nature of Ethernet traffic, IEEE Trans. Networking, vol. 2, pp.

1-15, Feb 1994.

121

42. Q. Liang, Ad Hoc Wireless Network Traffic – Self-Similarity and Forecasting, IEEE

Communication Letters, Vol. 6, No 7, July 2002

43. J. Redi and D. Averesky, Performance of Energy-Conserving Access Protocols Under Self-similar

Traffic. IEEE Wireless Communications and Networking Conference (WCNC'99), September 21-24, 1999.

Documents

Exploration and Implementation of Wireless Protocol Platformsbwrcs.eecs.berkeley.edu/faculty/jan/JansWeb/ewExternalFiles/03T_SuetFeiLi.pdf · and flexible implementation of wireless