Upload
tulia
View
22
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Empirical Quantification of Opportunities for Content Adaptation in Web Servers. Michael Gopshtein and Dror Feitelson School of Engineering and Computer Science The Hebrew University of Jerusalem. Supported by a grant from the Israel Internet Association. Capacity Planning. - PowerPoint PPT Presentation
Citation preview
Empirical Quantification of Opportunities for Content Adaptation
in Web Servers
Michael Gopshtein and Dror FeitelsonSchool of Engineering and Computer Science
The Hebrew University of Jerusalem
Supported by a grant from the Israel Internet Association
Capacity Planning
Daily cycle of activity
Utilized capacityWasted capacity
time
capa
city
Capacity Planning
Flash crowd
capa
city
time
Capacity Planning
• The problem:– Required capacity for flash crowds cannot be
anticipated in advance– Even capacity for daily fluctuations is highly
wasteful
• Academic solution: use admission control
• Business practice: unacceptable to reject any clients– Especially in cases of surge in traffic
Content Adaptation
• Trade off quality for throughput– Installed capacity matches normal load– Handle abnormal load by reducing quality– But still manage to provide meaningful service
to all clients
• Assumes normal optimizations have been made already– Compress or combine images, promote
caching, …– Empirically this usually is not the case
Content Adaptationsmily
smily
smily
Low load
Content Adaptationsmily
smily
smily
High load
smilysmily
smily
smilysm
ily
Content Adaptation
• Maintain the invariant:
• Need to change quality (and cost!) of content– Prepare multiple versions in advance
capacityrequest
perstco
requests
ofrate
The Questions
• What are the main costs in web service?– Bottleneck is CPU / network / disk?– What do we gain by eliminating HTTP requests?– What do we gain by reducing file sizes?
• What can realistically be done?– What is the structure of a “random” site?– How much can we reduce quality?
Assumption: static web pages only
Costs of Serving Web Pages
Measuring Random Web Sites
• http://en.wikipedia.org/wiki/Special:Random
• Use title of page as input to Google search
• Extract domain of first link to get home page
• Retrieve it using IE
• Collect statistical data by intercepting system calls to send and receive
Retrieved Component Sizes
This is only 0.02% of the components
A ¼ of total data from components
larger than 200 KB
Download Times
Download time (and bandwidth requirements) roughly proportional to image size
Network Bandwidth
• Typical Ethernet packets are 1526 bytes– Ethernet and TCP/IP headers require 54 bytes– HTTP response headers require 280-325
• Most components fit into few packets– 43% fit into a single packet– 24% more fit into 2 packets
Save bandwidth by reducingnumber of small componentsor size of large components
Locality and Caching
• Flash crowds typically involve a very small number of pages (possibly the home page)
• Servers allocate GB of memory for cache
• This is enough for thousands of files
Disk is not expected to bea bottleneck
CPU Overhead
• CPU usage reflects several activities– Opening TCP connection– Processing request– Sending data
• Measure using combinatorical microbenchmarks– Open connection only– One extremely large file– Many small files– Many requests for non-existent file
CPU Overhead
Example: single 10KB file
• Equal processing and transfer at 240KB– Only 0.3% of files are so big
Establishing connection 25%
Processing request 72%
Data transfer 3%
If CPU is bottleneck, needto reduce number of requests
Optimizations
Guidelines
• Either CPU or network are the bottleneck
• Network bandwidth saved by reducing large components
• CPU saved by eliminating small components
• Maintaining “acceptable” quality is subjective
Eliminating Images
• Images have many functions– Story (main illustrative item)– Preview (for other page)– Commercial– Logo– Decoration (bullets, background)– Navigation (buttons, menus)– Text (special formatting)
• Some can be eliminated or replaced
Distribution of Types
• Manually classified 959 images from 30 random sites
• 50% decoration• 18% preview• 11% commercial• 6% logo• 6% text
Automatic Identification
• Decorations are candidates for elimination
• Identified by combination of attributes:– Use gif format– Appear in HTML tags other than <IMG>– Appear multiple times in same page– Small original size– Displayed size much bigger than original– Large change in aspect ratio when displayed
Image Sizes Distribution
decoration
preview
commercial
Auxiliary Files
• JavaScript– May be crucial for page function– Impossible to understand automatically
• CSS (style sheets)– May be crucial for page structure– May be possible to identify those parts that
are used
Auxiliary Files
• Cannot be eliminated
• Common wisdom: use separate files– Allow caching at client– Save retransmission with each page
• Alternative: embed in HTML– Reduce number of requests– May be better for flash crowds that do not
request multiple pages
Text and HTML
• Some areas may be eliminated under extreme conditions– Commercials– Some previews and navigation options
• Often encapsulated in <DIV> tags
• Sometimes identified by ID or class names, e.g. “sidebanner”– Especially when using modular design
Summary
Content Adaptation
• Degraded content usually better than exclusion
• Only way to handle flash crowds that overwhelm installed capacity
• Empirical results identify main options– Identify and eliminate decorations– Compress large images (story, commercial)– Embed JavaScript and CSS– Hide unnecessary blocks
Next Paper Preview
• Implementation in Apache
• Monitor CPU utilization and idle threads to switch between modes
• Use mod_rewrite to redirect URLs to adapted content
• Achieve up to x10 increase in throughput for extreme adaptation