View
230
Download
0
Category
Preview:
Citation preview
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 1/44
Copyright © Ellis Horowitz 1999-2012 1
Lecture
The Internet and Web Basics
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 2/44
Copyright © Ellis Horowitz 1999-2012 2
The Internet and the WWW are Different
•The Internet is a global digital infrastructure
that connects hundreds of millions of computers and
people
•The World Wide Web is a mechanism that unifies the
retrieval and display of a subset of data on the
Internet
•An intranet is a local/global informationstructure that connects an organization internally.
Intranets today often make use of Web technologies
•An extranet is a private network that uses the
public telecommunication system to securely sharepart of a business's information or operations with
suppliers, vendors, partners, customers, or other
businesses.
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 3/44
Copyright © Ellis Horowitz 1999-2012 3
How Big is the Internet -
https://www.isc.org/solutions/surveyDate | HostCount
-------+-----------Jul 11 |849,869,781
Jan 11 |818,374,269
Jul 10 |768,913,036
Jan 10 |732,740,444
Jul 09 |681,064,561
Jan 09 |625,226,456
Jul 08 |570,937,778
Jan 08 |541,677,360
Jul 07 |489,774,269Jan 07 |433,193,199
Jul 06 |439,286,364
Jan 06 |394,991,609
Jul 05 |353,284,187
Jan 05 |317,646,084
Jul 04 |285,139,107
Jan 04 |233,101,481
Jan 03 |171,638,297
Jul 02 |162,128,493Jan 02 |147,344,723
Jul 01 |125,888,197
Jan 01 |109,574,429
Jul 00 | 93,047,785
Jan 00 | 72,398,092
Jul 99 | 56,218,000
Jan 99 | 43,230,000
Jul 98 | 36,739,000
Jan 98 | 29,670,000Jul 97 | 19,540,000
Jan 97 | 16,146,000
Jul 96 | 12,881,000
Jan 96 | 9,472,000
Jul 95 | 6,642,000
Jan 95 | 4,852,000
Jul 94 | 3,212,000
Jan 94 | 2,217,000
Jul 93 | 1,776,000
hosts are/were doubling every 18 months
See the survey background at: http:///www.isc.org/solutions/survey
It counts the number of IP addresses that have been assigned a name. The survey
queries the domain name system for the name assigned to every possible IP address.
But rather than sending a query to every one of the 4.3 billion possible IP addresses,
the survey starts with a list of all network numbers that have been delegated within the
IN-ADDR.ARPA domain. See http://www.isc.org/solutions/survey/background for
details
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 4/44
Copyright © Ellis Horowitz 1999-2012 4
Key Internet Technologies
• Packet switching - permits multiple pairs of computers to
communicate over a shared network
– Messages/files are broken into segments of varying size,
called packets.
– Each packet is labeled with source and destination
addresses
– The receiver must re-assemble the packets in the proper
order
– The inventor(s) of packet switching is in dispute, see
http://query.nytimes.com/gst/fullpage.html?res=9F0CE3DA1
139F93BA35752C1A9679C8B63
• IP Addresses
– An IPv4 address is a 32-bit number, from 0 to about 4.3
billion
– These numbers are written as four sets of eight bits
each, network.subnetwork.subnetwork.computer
• TCP/IP protocol (see ahead)
• Domain Name System (see ahead)
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 5/44
Copyright © Ellis Horowitz 1999-2012 5
IPv4 Address
• A 32-bit number divided into four sets of 8 bitnumbers, e.g. a.b.c.d
• There are three classes of IP addresses
– class A - 16 million hosts on 127 networks
– class B - 65,000 hosts on 16,000 networks
– class C - 254 hosts on 2 million networks
– class D - reserved for multicast
– class E - reserved for IETF for its research purposes
• USC has a class B license, 128.125.x.y
• We are running out of IP addresses
– CIDR* address makes more IP addresses available– see http://www.webopedia.com/TERM/C/CIDR.html
in common practice
Class Leftmost bits Network Local Address (host)
A 0 7 bits 24 bits
B 10 14 bits 16 bits
C 110 21 bits 8 bits
D 1110 28 bits (Multicast address)
*Classless Inter-Domain Routing
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 6/44
Copyright © Ellis Horowitz 1999-2012 6
IP Address (IPv4)
Ipv4 Address Ranges
Class Leftmost bits Start address End address
A 0xxx 0.0.0.0 127.255.255.255
B 10xx 128.0.0.0 191.255.255.255
C 110x 192.0.0.0 223.255.255.255
D 1110 224.0.0.0 239.255.255.255
E 1111 240.0.0.0 255.255.255.255
Private Address Ranges
Class Private start address Private end address
A 10.0.0.0 10.255.255.255
B 172.16.0.0 172.31.255.255
C 192.168.0.0 192.168.255.255
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 7/44
Copyright © Ellis Horowitz 1999-2012 7
Classless Inter-Domain Routing
• An IP addressing scheme that replaces the older systembased on classes A, B, and C.
– a single IP address can be used to designate many uniqueIP addresses.
– A CIDR IP address looks like a normal IP address exceptthat it ends with a slash followed by a number, called
the IP network prefix. For example: 172.200.0.0/16
– The IP network prefix specifies how many addresses are
covered by the CIDR address, with lower numbers coveringmore addresses. CIDR currently uses prefixes anywherefrom 13 to 27 bits
– For example, in the CIDR address 206.13.01.48/25, the"/25" indicates the first 25 bits are used to identify
the unique network leaving the remaining bits to
identify the specific host.
CIDR Block Prefix# Equivalent Class C # of Host Addresses
/27 1/8th of a Class C 32 hosts
/26 1/4th of a Class C 64 hosts /25 1/2 of a Class C 128 hosts
/24 1 Class C 256 hosts
/23 2 Class C 512 hosts
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 8/44
Copyright © Ellis Horowitz 1999-2012 8
IPv6
• IPv4 uses a 32-bit address space• IPv6 uses a 128-bit address space
• IPv6 supports:
– a total of more than 3 x 1038 addresses – OR-
– a total 6 x 1023 addresses for every square meter on the Earth’s surface
– Currently Internet routers must support both IPv4 and IPv6;
– The conversion to IPv6 was slowed by the use of NAT routers (Network Address Translation)
– A NAT router listens to outbound data packets from local devices andreroutes these packets to the global Internet, while rewriting the IP address
and port number in these packets.– Each outgoing packet stream is assigned a unique port number. Incoming
packets are scanned for the port numbers and if these port numbers matchan existing communication stream or a port number in a fixed routingtable, the destination IP address and port number is rewritten and the
packet is forwarded to the internal device.• References (IPv4)
– http://compnetworking.about.com/library/weekly/aa042400b.htm (IP Tutorial)
• References (IPv6):
– http://www.ipv6.org/
– http://www.pcsupportadvisor.com/nasample/c0655.pdf (Understanding IPv6 by
David Morton)
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 9/44
Copyright © Ellis Horowitz 1999-2012 9
IPv4 vs. IPv6
• IPv4 is 32bits divided into four 8-bit segments, separated by dots. Each segment is a
number between 0 and 255
• IPv6 is 128 bits divided into eight 16-bit segments, separated by colons. Each segment is
a number between 0 and 2^16-1. The number is written as 4 hexadecimal digits;
• IPv6 uses 16 octets (128 bit addresses) instead of 4 (32-bit addresses)
• Here is an example
– 2001:0011:abcd:0000:0000:0000:0023:4567
– 2001 is the address type.
– The 0011:abcd defines the subnet. A /48 subnet is typical. That means that the first
48 bits of the ipv6 addresses you get are fixed, and you are free to assign values tothe other 80 bits for each of the devices in your network. In some cases a /64 subnet
is assigned to you where the first 64 bits are fixed.
– The ISP will route all traffic for destinations where the address begins with
2001:0011:abcd or 2001:0011:abcd:0000 to a single internet connection. The
connection must route these addresses further to internal devices.
– The difference between IPv4 and IPv6 is that instead of one IP address which can
point to only one device, we get a truckload full of IP addresses which we have toroute ourselves.
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 10/44
Copyright © Ellis Horowitz 1999-2012 10
TCP/IP
• TCP/IP is a two-layer program.– The higher layer, Transmission Control Protocol, manages
the assembling of a message or file into smaller packetsthat are transmitted over the Internet and received by aTCP layer that reassembles the packets into the originalmessage.
– The lower layer, Internet Protocol, handles the addresspart of each packet so that it gets to the rightdestination.
• Each gateway computer (router) on the network checks thisaddress to see where to forward the message. Even thoughsome packets from the same message are routed differentlythan others, they'll be reassembled at the destination.
• TCP/IP solves several problems of network reliability– if a router is overrun with packets, it discards them
– if a packet is lost, it re-requests it
• the receiver acknowledges receipt to the source
• the sender starts a timer and if no acknowledgementis received it automatically resends the packet
• the sender’s timer uses a different time dependingupon the distance to the destination and currentinternet traffic
– it reorders the packets into proper sequence
– it eliminates duplicate packets
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 11/44
TCP Stack
11Copyright © Ellis Horowitz 1999-2012
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 12/44
Copyright © Ellis Horowitz 1999-2012 12
TCP/IP is a Suite of Protocols
• Routing Protocols include– IP (Internet Protocol) actual transmission of data
– ICMP (Internet Control Message Protocol) handles messages for IP
– RIP (Routing Information Protocol) determines best routing
– OSPF (Open Shortest Path First) an alternate delivery method
• Network Address Protocols include
– ARP (Address Resolution Protocol) determines the unique numeric
addresses of machines on the network– DNS (Domain Name Service) determines numeric addresses from
machine names
– RARP (Reverse Address Resolution Protocol) determines theaddresses of machines on the network, but in a reverse orderfrom ARP
• User based services
– BootP (Boot Protocol) boots a network by reading info from aserver
– FTP (File Transfer Protocol) allows transfer of files across thenetwork
– Telnet, used to remotely log in to another machine
• Gateway based services
– EGP (Exterior Gateway Protocol) governs the transfer of routing
information for external networks– GGP (Gateway-to-Gateway Protocol) handles routing of information
between gateways
– IGP (Interior gateway protocol) handles routing of info forinternal networks
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 13/44
Copyright © Ellis Horowitz 1999-2012 13
Layering of TCP/IP Protocols
application
layer
transport
layer
networklayer
data link
layer
HTTP FTP TELNET NFS/RPC DNS SNMP
TCP UDP
IP
Open Systems Interconnect (OSI) Reference Model includes 7 layers: application,
presentation, session, transport, network, data link and physical.
(Note: use WireShark, a network protocol analyzer, to show packets at each
layer. See http://www.wireshark.org/)
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 14/44
HTTP/HTTPS Protocol Stacks
14Copyright © Ellis Horowitz 1999-2012
From HTTP: The Definitive Guide, by David Gourley, Brian Totty
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 15/44
Copyright © Ellis Horowitz 1999-2012 15
Internet Domain Names
• The Domain Name System is a mapping to/from IP
addresses to domain names
– defined in RFC 1034, 1035, see e.g.
– http://www.faqs.org/rfcs/rfc1035.html
– Invented in 1983 by Paul Mockapetris, see
http://en.wikipedia.org/wiki/Domain_name_system
• There are 13 top level root name servers, see
– www.dns.net/dnsrd/tld.html
• ICANN is the organization in charge of maintainingthe DNS system, see
– www.icann.com
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 16/44
Copyright © Ellis Horowitz 1999-2012 16
Top Level Domain Names
• Top level domains were originally divided into the
following logical categories
– com commercial and industrial organizations
– edu educational institutions
– gov non-military, government affiliated
organizations
– mil military organizations
– net network operations
– org other organizations and user groups
• new top level domains have been added– .biz, .info, .name, .museum, .coop, .aero, .pro, .xxx
• www.internic.net/faqs/new-tlds.html
• In Oct. 2009 ICANN agreed to accept internationalized
domain names, encoded as Unicode:
– see http://www.icann.org/en/topics/idn/fast-track/• In 2011 ICANN agreed to offer Generic Top Level domains:
– see http://www.icann.org/en/tlds/select.htm or the movie
at http://newgtlds.icann.org/announcements-and-
media/video/overview-en
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 17/44
Copyright © Ellis Horowitz 1999-2012 17
Domain Names Outside the US
• Countries append their 2 letter country code,
Two letter codes are maintained as an ISO 3166
standard. Here is a sample
AFGHANISTAN AFALBANI AL
ALGERIA DZ
AMERICAN SAMOA AS
ANDORRA AD
ANGOLA AOANGUILLA AI
ANTARCTICA AQ
ANTIGUA & BARBUDA AG
ARGENTINA AR
ARMENIA AMARUBA AW
AUSTRALIA AU
AUSTRIA AT
AZERBAIJAN AZ
BAHAMAS BSBAHRAIN BH
BANGLADESH BD
BARBADOS BB
BELARUS BY
BELGIUM BEBELIZE BZ
BENIN BJ
BERMUDA BM
BHUTAN BT
BOLIVIA BOBOSNIA AND HERZ. BA
BOTSWANA BW
BOUVET ISLAND BV
BRAZIL BR
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 18/44
DNS Resolution
• The DNS protocol is an important part of the web's infrastructure
• Every time you visit a website, your computer performs a DNS lookup
• Complex pages often require multiple DNS lookups before they start loading,
so your computer may be performing hundreds of lookups a day
• DNS latency is mainly due to
– The round-trip time to make the request and get the response, due to
network congestion, overloaded servers, denial-of-service attacks
– Cache misses which cause recursive querying of other name servers
• Google has introduced a Public DNS– Configure your network to use 8.8.8.8 and 8.8.4.4
– Google handles more than 70 billion requests a day!
– Google also has IPv6 addresses
• 2001:4860:4860::8888 and 2001:4860:4860::8844
– http://code.google.com/speed/public-dns/docs/intro.html
• Another alternative is opendns.com
– The have a global network of DNS resolvers to speed resolution
– The base service is free, but upgrades cost
Copyright © Ellis Horowitz 1999-2012 18
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 19/44
DNS Resolution
• The chart shows the times spent
loading a page where black
represents DNS resolution, Gray
represents Connection waiting,
Yellow represents connection, red is
JavaScript parsing, and blue isJavaScript execution.
• There are 13 calls to the DNS
resolver and 5 of them are serial
lookups accounting for several
seconds of the total 11 seconds spentloading the page
Copyright © Ellis Horowitz 1999-2012 19
http://code.google.com/speed/public-dns/docs/performance.html
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 20/44
Copyright © Ellis Horowitz 1999-2012 20
Internet Statistics
Conclusion: the .net and .com
categories are the largest
followed by Japan, Italy and
Brazil
Distribution of Top-Level Domain Namesby Host Count, July. 2011,
at http://ftp.isc.org/www/survey/reports/2011/01/bynum.txt
Above shows 99 million .com sites out of
a total 135 million or roughly 73% of the total
See http://www.domaintools.com/internet-statistics
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 21/44
Copyright © Ellis Horowitz 1999-2012 21
Who Controls Internet Domain Names
• Granting of domain names is done by a registrar
• Registrars must be approved by ICANN,www.icann.org, the Internet Corporation forAssigned Names and Numbers
• Currently there are more than 100 registrarsassigning domain names for *.com, *.org, and*.net
• All domain name registrars share theirinformation with the domain name registry,which for com/net/org is Network Solutions,
see:– http://www.networksolutions.com/
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 22/44
Copyright © Ellis Horowitz 1999-2012 22
Internet Domain Name Registrars
• There are three key software systems a registrar must
implement
– A Whois service checks if a domain name is already
registered and returns registration information
• This is a client/server application thatinterfaces with NSI’s database; a read-only
operation
– A Shared Registry System reserves a name for the
registrar
• This is a client/server application that
interfaces with NSI’s database; a write
operation
– A local database maintains customer accounts,
domain names, etc.
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 23/44
Copyright © Ellis Horowitz 1999-2012 23
Internet Traffic
• How efficiently is the Internet working now– http://www.internettrafficreport.com/
– http://netflow.internet2.edu/
internet2 is a project to develop new technologies
for high-performance computer networking. It is
led by a consortium of 206 universities.
While specifically developed tofacilitate research and educational purposes,
the involvement of research, commercial and
government organizations also aims to distribute
these technology into the wider community.
The tables below show the type and amount of
traffic
Data Transfers are 41%
HTTP is approx 39%
HTTPS is approx 48% of
encrypted traffic
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 24/44
Copyright © Ellis Horowitz 1999-2012 24
Useful Routines
• traceroute indicates the path of a packet from
source to destination
pollux.usc.edu(19):/usr/sbin/traceroute
doc.ic.ac.uk
Try tracert on Windows
• ping sends a packet and waits for a response;
determines if the site is up
pollux.usc.edu(35):/usr/sbin/ping mit.edu
• nslookup will return the IP address given the
domain name, and vice-versa
nslookup pollux.usc.edu returns
Name: pollux.usc.edu
Addresses: 128.125.7.29
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 25/44
Copyright © Ellis Horowitz 1999-2012 25
How the Internet Functions Today
Wide Area Backbone, e.g. AT&T,SPRINT
Regional
Provider, e.g. Los Nettos
Regional
Provider
Regional
Provider, e.g. Earthlink
Local Local Local Local Local
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 26/44
Copyright © Ellis Horowitz 1999-2012 26
Defining the World Wide Web
• A wide-area hypertext, multimedia information
retrieval system that provides access to a
large universe of documents
• A uniform way of accessing and viewing some
information on the Internet
• The WWW
– creates a world in which information has a
reference by which it can be accessed
– subsumes the capabilities of ftp, gopher,
wais and news
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 27/44
Copyright © Ellis Horowitz 1999-2012 27
Graphical View of the WWW
Web
server
Web
server
Web
server
Web
server
Data
Source
Data
Source
Data
Source
Data
Source
Intranet
Internet
Browser
computer
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 28/44
Copyright © Ellis Horowitz 1999-2012 28
Major Technology Components
• Client/server architecture
– where client programs interact with web
servers
• Network protocol
– HTTP, Hypertext Transfer Protocol, is thelanguage understood by browsers and web
servers
– designed to move quickly from document to
document
• Addressing system (Uniform Resource Locators)
– http://domain/directory/file.html
• Markup Language
– every web server understands and every browser
displays
– includes support for HyperText and multimedia
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 29/44
Copyright © Ellis Horowitz 1999-2012 29
Client/Server Architecture Model
Multiplatformbrowsers
(clients)
MechanismsAddressing scheme (URL) + Protocols (HTTP, etc.) + Format
Negotiation (MIME)
Servers foreach
protocol
HTTPserver
FTPserver
Gopherserver
NNTPserver
Terminals PCs Macs X Windows
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 30/44
Copyright © Ellis Horowitz 1999-2012 30
The WWW Server
• Web browsers and Web servers communicate according to
a protocol known as HTTP (HyperText Transfer
Protocol)
– The current HTTP protocol is version 1.1
• The Web server is a software system running on a
machine often called the Web server, don’t confusethem
• A web server can
– receive and reply to HTTP requests
– retrieve documents from specified directories
– run programs in specified directories
– handle limited forms of security
• A web server does not
– know about the contents of a document, links in a
document, images in a document or whether a
particular file, e.g. a *.gif file, is in the
correct format
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 31/44
Copyright © Ellis Horowitz 1999-2012 31
Uniform Resource Locator (URL)
• A mechanism whereby an Internet resource can be
specified in a single line of ASCII text
• See RFC 1738: http://www.faqs.org/rfcs/rfc1738.html
URL Refers to:
file://pub/xt.ps a PostScript file in directory
pub on your local machine
ftp://usc.edu/docs/sweng.txt
file sweng.txt in directory docs
on usc.edu, an anonymous ftp sitehttp://nunki.usc.edu/mydocs/book.doc
a file in directory mydocs on
machine nunki.usc.edu, a WWW site
news:comp.compilers the newsgroup computers.compilers
mailto:horowitz@usc.edu an e-mail address
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 32/44
Copyright © Ellis Horowitz 1999-2012 32
General Description of a URL
1. Scheme followed by a colon
http:,ftp:,gopher:,news:,mailto:,wais:,telnet:
2. Double slash (only for http, ftp, gopher,
wais) //
3. Internet domain name e.g., pollux.usc.edu
4. Port number (this field is optional; e.g.,
pollux.usc.edu:8081)
Standard or default port numbers:
--- ftp is 21 gopher is 70
--- telnet is 23 http is 80
--- smtp is 25 nntp is 119
--- imap is 143 secure nntp is 563
--- pop3 is 110 secure pop3 is 995
5. Path e.g., /pub/docs
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 33/44
Copyright © Ellis Horowitz 1999-2012 33
URL Character Set
• RFC 1738, Dec. 1994 defines the URL character set as"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the
quotes], and reserved characters used for their reserved purposes may be usedunencoded within a URL."
• However, HTML supports ISO-8859-1 (ISO-Latin) character set– HTML 4.x extends the character set to all of Unicode
• Therefore, in URLs an escape mechanism is used, % followed by twohex digits
• Characters that should be encoded include:
%, /, ., .., #, ?, ;, :, $, +, @, &, =
• Here are some encoded values for so-called “unsafe” characters
~ %7E | %7C
SPACE %20 \ %5C
% %25 ^ %5E
& %26 [ %5B
= %3D ] %5D? %3F # %23
{ %7B > %3E
} %7D < %3C
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 34/44
Copyright © Ellis Horowitz 1999-2012 34
Markup Languages
• HTML - hypertext markup language, specifies
document layout and the specification of
hypertext links to text, graphics and other
types of objects
• Browsers display text and graphics using the
markup as guidance
• However, HTML is not like a word processing
program, e.g. Microsoft Word or WordPerfect,
and not like a page description languages, e.g.
postscript
– as a result, translation into HTML can
produce a result that does not look exactly
like the original
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 35/44
Copyright © Ellis Horowitz 1999-2012 35
What is HyperText?
•Regular text, with the additional feature of links
to related documents
•As you read documents and follow links, you
traverse a “web” of interconnections
Emancipation
Proclamation
... all persons found as
slaves within any State, ...
Declaration of
Independence
When in the course of
human events it becomes
necessary for one ...
Gettysburg Address
by A. Lincoln
Fourscore and seven years ago, ourfathers brought forth upon this
continent a new nation, conceived in
liberty and dedicated to the
proposition that all men are created
equal. We are now engaged in a
great Civil War , testing whether that
nation or any other nation soconceived and so dedicated can long
endure.
War Between the
States by Eric
Barnes,
McGraw-Hill
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 36/44
Copyright © Ellis Horowitz 1999-2012 36
The WWW Data Model
Findinformation
about
trains
Submit
Yahoo
My Courses
My Research
ProfessorJohn Smith’shome page
Index ofmaterial on
trains
Description
of aspecifictrain
Searching
Education
My home page
Search forfaculty
John Smith
University ofWisconsinhome page
Search
Search
HyperlinkHyperlink
Hyperlink
Submit
Link
Link
Static page
Dynamicpage
Staticpages
43
A directed graph where nodes are documents, edges
are links
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 37/44
Copyright © Ellis Horowitz 1999-2012 37
Graph Structure and the Web
• Nodes = static web pages (~1+ billion)
• Edges = static hyperlinks (~10 billion)
• It’s a sparse graph: ~7 links/page on average
• Some Questions
– is the web connected? can we always traverse from
one page to any other
– can link connectivity improve the results of
search engines?– if we watch the web graph change over time, what
does that tell us about social processes
• Reference: http://www9.org/w9cdrom/160/160.html
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 38/44
Copyright © Ellis Horowitz 1999-2012 38
Some Graph Algorithms
• Weakly connected components (WCC)
– a maximal subgraph of a directed graph such that for
every pair of vertices u, v in the subgraph, there is an
undirected path from u to v and a directed path from v
to u
• Strongly connected component (SCC)
– A maximal subgraph of a directed graph such that for
every pair of vertices u, v in the subgraph, there is a
directed path from u to v and a directed path from v to
u.
• Algorithms for the above all exist in linear time
• A Graph's diameter
– The length of the "longest shortest path" between any
two graph vertices, OR the largest number of vertices
which must be traversed in order to travel from one
vertex to another when paths which backtrack, detour, or
loop are excluded from consideration
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 39/44
Copyright © Ellis Horowitz 1999-2012 39
Challenges of Scale
• A typical algorithm to compute the diameter of
a graph requires a number of steps
~(nodes * edges), or ~(pages * links)
• For 1 billion pages, 10 billion links, and 0.10
microseconds/step we need ~1 billion seconds or
about 10 million days
• Results of a May 1999 crawl at Alta Vista
– 220 million pages after duplicates are eliminated
– Giant WCC has ~186 million pages
– Giant SCC has ~56 million pages
– Cannot browse your way from any page to any other
– Next biggest SCC has ~150K pages
– Other crawls produce similar results
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 40/44
Copyright © Ellis Horowitz 1999-2012 40
Reachability Question
• How many pages are reachable from a random
page?
• Start at page p
– get its neighbors and put them on a list– following the neighbors, repeating the
process, watching for loops and marking dead
ends
• Keep track of the number of pages reached from
p, as a function of the distance d
• Experiment: start at 1,000 random pages and for
each build BFS profiles
• Results:
– either dies quickly (~100 pages reached)
– or explodes and reaches ~100 million pages
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 41/44
Copyright © Ellis Horowitz 1999-2012 41
Web Anatomy
one can pass from any node of IN through SCC to any node of OUT.Hanging off IN and OUT are TENDRILS containing nodes that are
reachable from portions of IN, or that can reach portions of OUT,
without passage through SCC. It is possible for a TENDRIL hanging
off from IN to be hooked into a TENDRIL leading into OUT, forming
a TUBE -- a passage from a portion of IN to a portion of OUT without
touching SCC
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 42/44
Copyright © Ellis Horowitz 1999-2012 42
Early History of the WWW
• 1989-1990 Tim Berners-Lee conceives the WWW at CERN inGeneva
• 11/90 Berners-Lee releases WWW prototype on NeXtcomputer
• 01/92 Release of source code for line mode browser,
lynx and HTTP• 03/93 Mosaic browser from NCSA is released
• 09/93 WWW internet traffic now measures 1% of NSF
backbone
• 12/94 Netscape Navigator 1.0 is released
World Wide Web Consortium formed
• 08/95 Microsoft Windows 95 and Internet
Explorer 1.0 released
• 12/95 Java is released
• 12/04 Firefox 1.0 is released
• 09/08 Google Chrome 1.0 is released
See http://www.w3.org/History.html and tim Berners-Lee’s presentation at the 10th
anniversary, http://www.w3.org/2004/Talks/w3c10-HowItAllStarted/?n=1
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 43/44
Copyright © Ellis Horowitz 1999-2012 43
Recent WWW Developments
• Browsers continue to be enhanced
– Microsoft develops Internet Explorer 6-9, and 10(beta on Windows 8) including support for ActiveX,Active Server Pages, and .NET, and special toolssuch as Expression Web
– Netscape opens the source for Navigator producingNetscape 7.x-8.1,followed by Mozilla and Firefox
– Netscape browser “killed” by AOL on 12/28/2007.
– Opera available on Windows, Mac OS X and Java-based cell phones and PDAs, as Opera Mini
– Apple Safari (WebKit) available on Mac OS X,Windows, smartphones and tablets (Apple iPhone,iPod Touch, iPad, Android, Nokia Symbian)
– Google releases Chrome browser, 2008 (WebKit)
• Other interesting technologies
– multimedia streaming, e.g. Adobe Flash, Microsoft
Silverlight, and now HTML5/H.264 (discussed later)– Application servers, e.g. IBM's WebSphere, BEA
Weblogic
8/3/2019 Internet Web Basics
http://slidepdf.com/reader/full/internet-web-basics 44/44
Copyright © Ellis Horowitz 1999-2012 44
WWW Consortium
• Founded in 10/94, headed by Tim Berners-Lee,
http://www.w3.org
• Goal: “to lead the World Wide Web to its full
potential by developing common protocols thatpromote its evolution and ensure its
interoperability.”
• Many of the technologies guided by the WWW
consortium will be discussed this semester:– HTML, Style Sheets, Document Object Model,
international character sets, HTTP, XML,
etc.
Recommended