Copyright © 2005 Department of Computer Science May 14, 2011Networks Conference11 The Edge of Smartness Carey Williamson iCORE Chair and Professor Department

Copyright © 2005 Department of Computer Science

May 14, 2011 Networks Conference 11

The Edge of Smartness

Carey WilliamsoniCORE Chair and ProfessorDepartment of Computer ScienceUniversity of Calgary

Email: [email protected]



Main Message• Now, more than ever, we need “smart edge”

devices to enhance the performance, functionality, and efficiency of the Internet

Application

Transport

Network

Data Link

Physical

Application

Transport

Network

Data Link

PhysicalCoreNetwork



Talk Outline

• The End-to-End Principle: Revisited• The Smart Edge: Motivation and Definition• Example 1: Redundant Traffic Elimination• Example 2: TCP Incast Problem• Example 3: Speed Scaling Systems• Future Outlook and Opportunities• Questions and Discussion



The End-to-End Principle• Central design tenet of the Internet (simple core)• Represented in design of TCP/IP protocol stack• Wikipedia: Whenever possible, communication

protocol operations should be defined to occur at the end-points of a communications system

• Some good reading:– J. Saltzer, D. Reed, and D. Clark, “End-to-End

Arguments in System Design”, ACM ToCS, 1984– M. Blumenthal and D. Clark, “Rethinking the Design

of the Internet: The end to end arguments vs. the brave new world”, ACM ToIT, 2001



Internet Protocol Stack

Application

Transport

Network

Data Link

Physical

Application

Transport

Network

Data Link

Physical

Application

Transport

Network

Data Link

Physical

Router



The End-to-End Principle: Revisited• Claim: The ongoing evolution of the Internet is

blurring our notion of what an end system is• This is true for both client side and server side

– Client: mobile phones, proxies, middleboxes, WLAN– Server: P2P, cloud, data centers, CDNs, Hadoop

• When something breaks in the Internet protocol stack, we have to find a suitable retrofit to make it work properly

• We have done this repeatedly for decades, and will likely keep doing it again and again!



(Selected) Existing Examples• Mobility: Mobile IP, MoM, Home/Foreign Agents• Small devices: mobile portals, content transcoding• Web traffic volume: proxy caching, CDNs• Wireless: I-TCP, Proxy TCP, Snoop TCP, cross-layer• IP address space: Network Address Translation (NAT)• Multi-homing: smart devices, cognitive networks, SDR• Big data: P2P file sharing, BT, download managers• P2P file sharing: traffic classification, traffic shapers• Security concerns: firewalls, intrusion/anomaly detection• Intermittent connectivity: delay-tolerant networks (DTN)• Deep space: inter-planetary IP



The Smart Edge• Putting new functionality in a “smart edge” device

seems like a logical choice, for reasons of performance, functionality, efficiency, security

• What is meant by “smart”?– Interconnected: one or more networks; define basic

information units; awareness of location/context– Instrumented: suitably represent user activities;

location, time, identity, and activity; perf metrics– Intelligent: provisioning, management, adaptation;

appropriate decision-making in real-time



Example 1:Redundant Traffic Elimination



Motivation for RTE

• A lot of the data content carried on the Internet today is (partially) redundant

• Examples:– Spam email that we receive (CIBC, RBC, …)– Regular email that we receive (drafts)– Web pages that we visit (U of X)

• It would be nice to avoid having to send this redundant data more than once (especially on low-bandwidth links!)



Basic Principles of RTE• If you can “remember” what you have

sent before, then you don’t have to send another copy

• Redundant Traffic Elimination (RTE)

• Done using a dictionary of chunks and their associated fingerprints

• Examples:– Joke telling by certain CS professors– Data deduplication in storage systems (90%)– “WAN Optimization” in networks (20%)



A Toy Example

Mary had a little lambIts fleece was white as snowAnd everywhere that Mary wentThat lamb was sure to go

It followed her to school one dayWhich was against the ruleIt made the children laugh and playTo see a lamb at school

Mary had a little lambA little pork, a little hamMary had a little lambAnd then she had dessert!



Chunk Granularity Issue

• Object: large potential savings, but exact hits will be very rare

• Paragraph: very few repeats though• Sentence: some repeats, some savings

• Chunk: “just right” size and savings

• Word: lots of repeats, small savings• Letter: finite alphabet, many hits, but

relatively high overhead to encode



Redundant Traffic Elimination (RTE)

14

• Purpose: Use bottleneck link more efficiently• Basic idea: Use a cache of data chunks to avoid

transmitting identical chunks more than once

• RTE process:– Divide IP packet into chunks– Select a subset of chunks– Store a cache of chunks at two ends

of a network link or path– Transfer only chunks that are not cached

• Works within and across files• Combines caching and chunking

C hunk A C hunk B C hunk C

D istance O verlap

C hunk cache

Chunk B

Chunk A

Chunk CFP C

FP A

FP B

.. ... .

.. ... .. ..

. ... ..

. ..

F P A = fingerp rin t (C hunk A )



Background on RTE

15

• Proposed by [Spring and Wetherall 2000]– Intended to augment Web caching– Proposed for IP packet level redundancy elimination– Found up to 54% redundancy in Web traffic– Applied to high-speed wired links (WAN Optimization)

• Chunking used in storage systems to avoid storing redundant data (data deduplication)

• Can also apply this approach in WLAN context: – Increasing demand for wireless broadband– Plenty of CPU power, cheap storage available– Wireless traffic content similar to wired traffic– More efficient use of constrained wireless channel



Some References on RTE• N. Spring and D. Wetherall, “A Protocol-Independent Technique for

Eliminating Redundant Network Traffic”, ACM SIGCOMM 2000

• A. Anand et al., “Packet Caches on Routers: The Implications of Universal Redundant Traffic Elimination”, ACM SIGCOMM 2008

• A. Anand et al., “Redundancy in Network Traffic: Findings and Implications”, ACM SIGMETRICS 2009

• A. Anand et al., “SmartRE: An Architecture for Coordinated Network-wide Redundancy Elimination”, ACM SIGCOMM 2009

• B. Aggarwal et al., “EndRE: An End-System Redundancy Elimination Service for Enterprises”, USENIX NSDI 2010

• E. Halepovic et al., “DYNABYTE: A Dynamic Sampling Algorithm for Redundant Content Detection”, IEEE ICCCN 2011

• E. Halepovic et al., “Enhancing Redundant Network Traffic Elimination”, under review, 2011



RTE Process Pipeline

17

Packet

NIC

Chunking(no overlap)

FIFO cachemanagement

Forwarding

Yes

Yes

Packet

NIC

Fingerprinting

Forwarding

Large enough?

No

Next chunk

Overlap OK?

No

non-FIFO cachemanagement

Current Proposed

Fingerprinting

Chunk expansion Content

promising?No

Yes

Improve traditional RTE

Exploit traffic non-uniformities: Packet size (bypass

technique) Chunk popularity

(new cache management scheme)

Content type (content-aware RTE)

Up to 50% more detected redundancy



Fixed-size Chunks with Overlap

18

• Traditional RTE uses variable-sized chunks with expansion– After detecting a chunk match, the matching region is expanded– Need to store whole packets in cache– Need full packet cache at both ends of the link – Constrained to FIFO replacement policy

• Replaced with fixed-size chunks (64 bytes) and overlap– Store chunks only in cache, not whole packets (less overhead)– Full cache needed only at receiver, fingerprints only at sender– Allows alternative cache management schemes

• Benefits of fixed-size chunks with overlap:– Simpler technique with lower storage overhead– Detects 9-14% more redundancy compared to 13% with

“expansion“



LSA Cache Replacement

19

• Frequency-based cache replacement (not FIFO)• Exploit non-uniform chunk popularity• Replace chunks that contributed least to savings

– Track savings by chunks, not cache hits (overlap)– New metric: “total bytes saved” per chunk– LFU-like, may cause cache pollution– Need “aging” factor: purge entire cache!

• Least Savings with Aging (LSA) improves detected redundancy by up to 12%



RTE in Wireless Traffic

20

• Using traces of campus WLAN traffic• RTE applied to aggregate wireless traffic

– Savings comparable to inbound aggregate campus traffic, but higher for outbound direction by about 30%

– Why? Inbound traffic mix similar for campus and WLAN traffic, but differs for outbound (more P2P)

• RTE applied to individual WLAN user traffic– 65% of users have up to 10% redundancy in traffic– 30% of users have 10-50% redundant traffic– 5% of users have 50% or more redundancy



Type Value Description Example

Nulls 57.1% Consecutive null bytes 0x00000000

Text 16.7% Plain text (English) Gnutella

HTTP 7.3% HTTP directives Content-Type:

Mixed 6.2% Plain text and other chars 14pt font

Binary 5.8% Random characters 0x27c46128

HTML 3.7% HTML code fragments <HTML> <p>

Char+1 3.2% Repeated text chars AAAAAAAz

Main Sources of Redundancy



Content-Aware RTE

22

• Improvement techniques are nice, but chunk selection is still random

• Tackle the fundamental problem of RTE:selecting the most redundant data chunks

• Content-based vs. Random• Exploit non-uniform content in data traffic• RTE savings contribution by different data chunks:

– Null strings: 57%– Text-based: 31% Select more text-based chunks – Binary, Mixed: 12% Bypass binary data



Example with 6-byte chunks (1 of 2)

23

Normal selection:HTTP/1.1 200 OK<CRLF>Server: Apache/2.2.11 (Unix)<CRLF>Last-Modified: Mon, 25 Jan 2010 16:19:01 GMT<CRLF>ETag: "a7046c-a6e-47dff86f24740"<CRLF>Accept-Ranges: bytes<CRLF>Content-Length: 2670<CRLF>Cache-Control: maxage=3600<CRLF>X-UA-Compatible: IE=EmulateIE7<CRLF>Content-Type: image/png<CRLF>Date: Fri, 29 Jan 2010 19:30:05 GMT<CRLF><CRLF><CRLF>‰PNG <CRLF><CRLF>IHDR Œ . ÍÊé gAMA ÖØÔOX2 tEXtSoftware Adobe ImageReadyqÉe< PLTE÷ì|''"áá®×Õ–{{iþú �òò¼--'$þôrââ°ÆÄ¤óóºþövîï»·«Sš•E¾¾•âÕ\±±ŽþòpKKChhYÿðmåå±YYG“’t’’‚ÿïk„„sžžƒ<;4ôô¼ÓÔ«¢¢†uudóð¬–“ŒDD:ëë¹33,÷÷½ææµœšŠÚÚ«\\UJE@ŠŠqÖÕ¦þ÷”»ºšÿón²²G� ECøëjŽŽy÷÷¾–•zììµ¦¦‹““zÝÝ



Example with 6-byte chunks (2 of 2)

24

Content-aware selection:HTTP/1.1 200 OK <CRLF>Server: Apache/2.2.11 (Unix) <CRLF>Last-Modified: Mon, 25 Jan 2010 16:19:01 GMT <CRLF>ETag: "a7046c-a6e-47dff86f24740" <CRLF>Accept-Ranges: bytes <CRLF>Content-Length: 2670 <CRLF>Cache-Control: maxage=3600 <CRLF>X-UA-Compatible: IE=EmulateIE7 <CRLF>Content-Type: image/png <CRLF>Date: Fri, 29 Jan 2010 19:30:05 GMT <CRLF><CRLF><CRLF>‰PNG <CRLF><CRLF>IHDR Œ . ÍÊé gAMA ÖØÔOX2 tEXtSoftware Adobe ImageReadyqÉe< PLTE÷ì|''"áá®×Õ–{{iþúòò¼--,'$þôrââ°ÆÄ¤óóºþövîï»·«Sš•E¾¾•âÕ\±�±ŽþòpKKChhYÿðmåå±YYG“’t’’‚ÿïk„„sžžƒ<;4ôô¼ÓÔ«¢¢†uudóð¬–“ŒDD:ëë¹33,÷÷½ææµœšŠÚÚ«\\UJE@ŠŠqÖÕ¦þ÷”»ºšÿón²²GECøëjŽŽy÷÷¾–•zììµ¦¦‹““zÝÝ�



Entropy-based bypass

25

• Lower entropy means higher redundancy

• Select from chunks with entropy of 5.3 or less

• Problem: CPU time required

0%

5%

10%

15%

20%

25%

3 3.2 3.4 3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 5.8 6

Perc

enta

ge o

f chu

nks

Entropy

HTMLPDFMP3



Textiness-based bypass

26

• “Textiness”: proportion of plain text characters in a chunk

• Computationally simple

• Select from chunks with textiness of at least 0.9

• Modest CPU demands

• Similar RTE savings

0%

10%

20%

30%

40%

50%

60%

70%

80%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Perc

enta

ge o

f Ch

unks

Textiness value

HTML

MP3



RTE Summary

27

• Improves traditional RTE savings by up to 50%• Techniques can be used individually or together• RTE very beneficial for wireless traffic

– 30% of users have 10-50% redundant traffic

• Proposed a novel content-aware RTE– Improve RTE savings by up to 38%

• Challenges of content-aware RTE– Needs refinement to be able to work on real traces, or

exploit an appropriate traffic classification scheme

– Needs improvement in execution time



Example 2:The TCP Incast Problem



Motivation

29

• Emerging IT paradigms– Data centers, grid computing, HPC, multi-core– Cluster-based storage systems, SAN, NAS– Large-scale data management “in the cloud”– Data manipulation via “services-oriented computing”

• Cost and efficiency advantages from IT trends, economy of scale, specialization marketplace

• Performance advantages from parallelism– Partition/aggregation, Hadoop, multi-core, etc.– Think RAID at Internet scale! (1000x)



Problem Formulation

• High-speed, low-latency network (RTT ≤ 0.1 ms) • Highly-multiplexed link (e.g., 1000 flows)• Highly-synchronized flows on bottleneck link• Limited switch buffer size (e.g., 100 packets)

How to provide high goodputfor data centerapplications?

TCP retransmission timeouts

TCP throughput degradation



Some References on TCP Incast

• A. Phanishayee et al., “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”, Proceedings of FAST 2008

• Y. Chen et al., “Understanding TCP Incast Throughput Collapse in Datacenter Networks”, WREN 2009

• V. Vasudevan et al., “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication”, SIGCOMM 2009

• M. Alizadeh et al., “Data Center TCP”, ACM SIGCOMM 2010• A. Shpiner et al., “A Switch-based Approach to Throughput Collapse

and Starvation in Data Centers”, IEEE IWQoS 2010• M. Podlesny et al., “An Application-Level Solution to the TCP-incast

Problem in Data Center Networks”, IEEE IWQoS 2011



Effect of Timer Granularity

Finer granularity definitely helps a lot!



Application-layer Scheduling

Start time of the response from the i-th server:


3434

Solution Analytical Model



Effect of Number of Servers

• Note non-monotonic behaviour! (ceiling functions)



Summary

• Data centers have specific network characteristics

• TCP-incast throughput collapse problem emerges

• Solutions:

– Tweak TCP parameters for this environment

– Redesign TCP for this environment

– Rewrite applications for this environment

– Smart edge coordination for uploads/downloads

Summary: TCP Incast Problem



Example 3:Speed Scaling Systems



Motivation• Computer systems performance evaluation

research has traditionally considered throughput, response time, delay as performance metrics

• In modern computer and communication systems, energy consumption, dollar cost, and sustainability are becoming more important

• Dynamic Voltage and Frequency Scaling (DVFS) is well-supported in modern processors, but not used particularly effectively

• Growing research interest in “CPU Speed Scaling”



Some References on Speed Scaling• M. Weiser et al., “Scheduling for Reduced CPU Energy”, OSDI 1994• F. Yao et al., “A Scheduling Model for Reduced CPU Energy”,

Proceedings of ACM FOCS 1995• N. Bansal et al., “Speed Scaling to Manage Energy and

Temperature”, JACM, Vol. 54, 2007• N. Bansal et al., “Speed Scaling with an Arbitrary Power Function”,

Proceedings of ACM-SIAM SODA 2007• D. Snowdon et al., “Koala: A Platform for OS-level Power

Management”, Proceedings of ACM EuroSys 2009• S. Albers, “Energy-Efficient Algorithms”, CACM, May 2010• L. Andrew et al., “Optimality, Fairness, and Robustness in Speed

Scaling Designs”, Proceedings of ACM SIGMETRICS 2010• A. Gandhi, “Optimality Analysis of Energy-Performance Tradeoff for

Server Farm Management”, IFIP Performance 2010


40

A Toy Example

• Consider 5 jobs (with no specific deadlines)

• Scheduling policies:– FCFS, PS, SRPT

• Simple simulator of single-CPU system

• Plot number of active jobs in system vs time

• Plot number of active bytes in system vs time

Job Arrival Size

0 1.0 5

1 2.2 2

2 2.8 3

3 3.5 1

4 4.7 4


41

i Ai Si

0 1.0 5

1 2.2 2

2 2.8 3

3 3.5 1

4 4.7 4


42

i Ai Si

0 1.0 5

1 2.2 2

2 2.8 3

3 3.5 1

4 4.7 4


43

FCFS PS SRPT

No Speed Scaling


44

Dynamic Speed Scaling

FCFS PS SRPT



Speed Scaling Summary

45

• A widely-applicable problem– Small-scale: desktops, multi-core, wireless devices– Large-scale: enterprise networks, data centers

• Mechanisms available, but policies unclear• Interesting tradeoffs between fairness,

efficiency, cost, optimality, and robustness• Much more work remains to be done

– Universal fairness metrics– Worst-case bounds vs average-case performance– Dynamic energy pricing...



Concluding Remarks• We need “smart edge” devices to enhance the

performance, functionality, security, and efficiency of the Internet (now more than ever!)

Application

Transport

Network

Data Link

Physical

Application

Transport

Network

Data Link

PhysicalCoreNetwork



Future Outlook and Opportunities

• Traffic classification• QoS management• Load balancing• Security and privacy• Cloud computing• Virtualization everywhere• Cognitive radio networks• Smart Applications on Virtual Infrastructure

(SAVI)



For More Information• C. Williamson, “The Edge of Smartness”,

Workshop on Data Center Performance (DCPerf 2011), Minneapolis, MN, June 2011

• Web site: http://www.cpsc.ucalgary.ca/~carey

• Email: [email protected]

• Questions and Discussion?

Documents

Copyright © 2005 Department of Computer Science May 14, 2011Networks Conference11 The Edge of Smartness Carey Williamson iCORE Chair and Professor Department