Building Scalable Web Architectures

  • Published on

  • View

  • Download


Building Scalable Web Architectures. Aaron Bannert / Goal. To build a reliable , scalable , cheap , flexible , extendable internet application. The Age of LAMP. What does a LAMP architecture give us?. Scalability. Grows in small steps - PowerPoint PPT Presentation


<ul><li><p>BuildingScalable Web ArchitecturesAaron /</p></li><li><p>Goal</p><p>To build a reliable, scalable, cheap, flexible, extendable internet application.</p></li><li><p>The Age of LAMPWhat does a LAMP architecture give us?</p></li><li><p>ScalabilityGrows in small stepsStays up when it countsCan grow with your trafficRoom for the future</p></li><li><p>ReliabilityHigh Quality of ServiceMinimal DowntimeStabilityRedundancyResilience</p></li><li><p>Low CostLittle or no software licensing costsMinimal hardware requirementsAbundance of talentReduced maintenance costs</p></li><li><p>FlexibleModular ComponentsPublic APIsOpen ArchitectureVendor NeutralMany options at all levels</p></li><li><p>ExtendableFree/Open Source LicensingRight to UseRight to InspectRight to ImprovePluginsSome FreeSome CommercialCan always customize</p></li><li><p>Free as in Beer?</p><p>PriceSpeedQuality</p><p>Pick any two.</p></li><li><p>LAMP-like Architectures</p></li><li><p>The Big Picture</p></li><li><p>External Caching Tier</p></li><li><p>Web Serving Tier</p></li><li><p>Application Server Tier</p></li><li><p>Internal Cache Tier</p></li><li><p>Database Tier</p></li><li><p>Misc. Services (DNS, Mail, etc)</p></li><li><p>The GlueRoutersSwitchesFirewallsLoad Balancers</p></li><li><p>Software ChoicesBuilding LAMP Software</p></li><li><p>External Caching Tier</p></li><li><p>External Caching TierWhat is this?SquidApaches mod_proxyCommercial HTTP Accelerator</p></li><li><p>External Caching TierWhat does it do?Caches outbound HTTP objectsImages, CSS, XML, HTML, etcFlushes ConnectionsUseful for modem users, frees up web tierDenial of Service Defense</p></li><li><p>External Caching TierHardware RequirementsLots of MemoryModerate to little CPUFast NetworkModerate Disk CapacityRoom for cache, logs, etc (disks are cheap)One slow disk is OK</p><p>Two Cheapies &gt; One Expensive</p></li><li><p>External Caching TierOther QuestionsWhat to cache?How much to cache?Where to cache (internal vs. external)?</p></li><li><p>Web Serving Tier</p></li><li><p>Web Serving TierWhat is this?ApachethttpdTux Web ServerIISNetscape</p></li><li><p>Web Serving TierWhat does it do?HTTP, HTTPSServes Static Content from diskGenerates Dynamic ContentCGI/PHP/Python/mod_perl/etcDispatches requests to the App Server TierTomcat, Weblogic, Websphere, JRun, etc</p></li><li><p>Web Serving TierHardware RequirementsLots and lots of MemoryMemory is main bottleneck in web servingMemory determines max number of usersFast NetworkCPU depends on usageDynamic content needs CPUStatic file serving requires very little CPUCheap slow disk, enough to hold your content</p></li><li><p>Web Serving Tier: Zero-copyPerformance HintDedicated static content serversModern web servers are very good at serving static content such asHTMLCSSImagesZip/GZ/Tar files</p></li><li><p>Web Serving TierPerformance HintStateless SessionsEach connection is a fresh startServer remembers nothingBenefits?Allows Better CachingScales Horizontally</p></li><li><p>Web Serving TierChoicesHow much dynamic content?When to offload dynamic processing?When to offload database operations?When to add more web servers?</p></li><li><p>Application Server Tier</p></li><li><p>Application Server TierWhat does it do?Dynamic Page ProcessingJSPServletsStandalone mod_perl/PHP/Python engines Internal ServicesEg. Search, Shopping Cart, Credit Card Processing</p></li><li><p>Application Server TierHow does it work?Web Tier generates the request usingHTTP (aka REST, sortof)RPC/CorbaJava RMIXMLRPC/Soap(or something homebrewed)App Server processes request and responds</p></li><li><p>Application Server TierCaveatsDecoupling of services is GOODManage Complexity using well-defined APIsDont decouple for scaling, change your algorithms!Remote Calling overhead can be expensiveMarshaling of dataSockets, net latency, throughput constraintsXML, Soap, XMLRPC, yuck (dont scale well)Better to use Javas RMI, good old RPC or even Corba</p></li><li><p>Application Server TierMore CaveatsRemote Calling can introduce new failure scenariosClassic Distributed ProblemsHow to detect remote failures?How long to wait until deciding its failed?How to react to remote failures?What do we do when all app servers have failed?</p></li><li><p>Application Server TierHardware RequirementsLots and Lots and Lots of MemoryApp Servers are very memory hungryJava was hungry to being withConsider going to 64bit for larger memory-spaceDisk depends on application, typically minimal neededFAST CPU required, and lots of them(This will be an expensive machine.) </p></li><li><p>Database Tier</p></li><li><p>Database TierAvailable DB ProductsFree/Open Source DBsPostgreSQLGNU DBMIngresSQLite</p><p>CommercialOracleMS SQLIBM DB2SybaseSleepyCatMySQLSQLitemSQLBerkeley DB</p></li><li><p>Database TierWhat does it do?Data Storage and RetrievalData Aggregation and ComputationSortingFilteringACID properties(Atomic, Consistent, Isolated, Durable)</p></li><li><p>Database TierChoicesHow much logic to place inside the DB?Use Connection Pooling?Data Partitioning?Spreading a dataset across multiple logical database slices in order to achieve better performance.</p></li><li><p>Database TierHardware RequirementsEntirely dependent upon application.Likely to be your most expensive machine(s).</p><p>Tons of MemorySpindles galoreRAID is useful (in software or hardware)Reliability usually trumps SpeedRAID levels 0, 5, 1+0, and 5+0 are usefulCPU also importantDual power suppliesDual Network</p></li><li><p>Internal Cache Tier</p></li><li><p>Internal Cache TierWhat is this?Object CacheWhat Applications?MemcacheLocal Lookup TablesBDB, GDBM, SQL-basedApplication-local Caching (eg. LRU tables)Homebrew Caching (disk or memory)</p></li><li><p>Internal Cache TierWhat does it do?Caches objects closer to the Application or Web TiersTuned for your applicationVery Fast AccessScales Horizontally</p></li><li><p>Internal Cache TierHardware RequirementsLots of MemoryNote that 32bit processes are typically limited to 2GB of RAMLittle or no diskModerate to low CPUFast Network</p></li><li><p>Misc. Services (DNS, Mail, etc)</p></li><li><p>Misc. Services (DNS, Mail, etc)Why mention these?Every LAMP system has themCrucial but often overlookedSource of hidden problems</p></li><li><p>Misc. Services: DNSImportant PointsAlways have an offsite NS slaveAlways have an onsite NS slaveMinimize network latencyDont use NAT, load balancers, etc</p></li><li><p>Misc. Services: Time SynchronizationSynchronize the clocks on your systems!Hints:Use NTPDATE at boot time to set clockUse NTPD to stay in synchDont ever change the clock on a running system!</p></li><li><p>Misc. Services: MonitoringSystem Health MonitoringNagiosBig BrotherOrcalatorGangliaFault Notification</p></li><li><p>The GlueRoutersSwitchesFirewallsLoad Balancers</p></li><li><p>Routers and SwitchesExpensiveComplexCrucial Piece of the System</p><p>HintsUse GigE if you canJumbo Frames are GOODVLans to manage complexityLACP (802.3ad) for failover/redundancy</p></li><li><p>Load BalancersHardware vs. SoftwareSoftware is complex to set up, but cheaperHardware is expensive, but dedicatedIMHO: Use SW at first, graduate to HW</p></li><li><p>Load BalancersWhat services to balance?HTTP Caches and Servers, App Servers, DB SlavesWhat NOT to balance?DNSLDAPNISMemcacheSpreadAnything with its own built-in balancing</p></li><li><p>Message BussesWhat is out there?SpreadJMSMQSeriesTibco Rendezvous</p><p>What does it do?Various forms of distributed message delivery.Guaranteed Delivery, Broadcasting, etcUseful for heterogeneous distributed systems</p></li><li><p>What about the OS?Operating System Selection</p></li><li><p>Lots of OS choicesLinuxFreeBSDNetBSDOpenBSDOpenSolarisCommercial Unix</p></li><li><p>Whats Important?MaintainabilityUpgrade PathSecurity UpdatesBug FixesUsabilityDo your engineers like it?CostHardware Requirements(you dont need a commercial Unix anymore)</p></li><li><p>Features to look forMulti-processor Support64bit CapableMature Thread SupportVibrant User CommunitySupport for your devices</p></li><li><p>Hardware ChoicesBuilding LAMP Hardware</p></li><li><p>Commodity Hardware DiscussionConsistency vs. SpecializationConsistency reduces maintenance costsLess Burn-in testingFewer drivers to supportFewer OS variantsFewer types of security updates, upgradesIn Sort: Dont throw hardware at the problem.However, specialization may improve ROIPut the money where best needed</p></li><li><p>Commodity Hardware DiscussionWhat I do when planning for growth:Specialize in the beginningWhen cost is more importantAnd designs arent yet matureDesign for horizontal scalabilityPlan on machine-sized piecesWant to grow by just adding more boxesEventually settle on two or three machine types</p></li><li><p>In-House vs. ColocationAlmost no reason to stay in-house these days</p><p>Colos keep getting cheaperLeased lines are still expensive</p></li><li><p>Beige-Box vs. Name BrandDetermine your Reqs ahead of timeTalk to your engineers FirstHow important is a support plan?Hardware will break, plan on itName Brand usually has fewer optionsWorks well if they have exactly what you needSeek a neutral technical advisorIn the end it should come down to cost</p></li><li><p>Disk Drive TechnologiesSCSIExpensiveBig (300GB)FastReliable</p><p>IDECheapHuge (500GB!)SlowOn-board support, often w/ RAID0/1Use SCSI for PerformanceUse IDE for cluster nodesIDE w/ RAID for cheap speed</p></li><li><p>Disk Drive TechnologiesPATAImmature driversParticularly w/ OSSLinux has poor support Prices coming downUnnecessary addonsHot Swap not often needed, costs more</p><p>SATATried and TestedObsoleteSATA is not SCSIThe fast SATAs cost as much as SCSISATA not quite there for servers</p></li><li><p>Disk Drive Technology: SpindlesNumber of SpindlesMore spindles can giveHigher ThroughputHigher ConcurrencyConcurrency is crucial for DatabasesReliabilityFailover drives, mirrors</p></li><li><p>Memory TechnologiesECCExpensive</p><p>Use only for keystone machines</p><p>Non-ECCCheap, Fast</p><p>Use for cluster nodes</p></li><li><p>Processor Technologies: SMPMultiple ProcessorsSignificantly higher costECC, Dual PowerExpensive Chassis, MotherboardLess ReliableMore parts to breakRequires MP-capable OSGood in Linux 2.6, Solaris, FreeBSD 5.x</p><p>Dual CPU systems cost more than doublePossible exception: Dual-Core CPUs</p></li><li><p>Processor Technologies: 64bitMost 32bit OSes limit each process to 2GBSome 32bit BIOSes are limited to 3.6GB RAM64bit chips are still expensive64bit OSes are becoming quite matureSolaris 10 (AMD64)Linux 2.6 (x86_64)Programs work but not yet tunedJava looks goodMySQL not so good</p></li><li><p>Summary</p></li><li><p>Design for Horizontal ScalabilityDesign Stateless SystemsDecouple Internal ServicesWrite well-defined APIs</p></li><li><p>Use Commodity PartsStandardize HardwareUse Commodity Software (Open Source!)Avoid Fads</p></li><li><p>THE ENDThank You</p><p> 2005 Aaron BannertExcept where otherwise noted, this presentation is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 2.5 License, available here: The scalability of large systems is a difficult aspect to quantify. In an ideal world, every component of a complex system would be tested to measure its capacities. In the real world, this is a little more vague. Its still important to measure your system under load and under artificial conditions designed to mimic real-world usage. In the end, what this really means is confidence that your system will grow as your business grows.Reliability is very easy to measure. The common metric in our industry is uptime, or the percentage of downtime over a period of time (typically 12 months).</p><p>Not everyone needs five-nines, but the amount of resources you dedicate to keeping your system up and running will directly affect the experience of your users.One of the fundamental aspects of LAMP is that its built from Open Source Components. Open source components are typically provided for free (as in beer), even for commercial uses.</p><p>Another benefit of using open source components is the availability of talented engineers experienced in the various technologies. Since the tools themselves are open and available, the learning curve is often greatly improved compared to closed source alternatives.</p><p>Using commodity parts such as open source allows builders to rely on the abundance of documentation and shared knowledge, as well as the shared bug reports of the open source community. All of these theoretically combine to allow greatly reduced maintenance costs.Using components that share open and well publicized APIs also improves the flexibility of the system. The parts of the system are modular and interchangeable, allowing for many competitive choices (most of which are free).Having the source to a commodity piece of software has been a boon for system developers. Bugs can be found and squashed. Improvements made and often contributed back to the system.</p><p>Note that not all open source licenses allow for use, inspection, and improvements, so pay attention to those licenses.Alas, everything has a price.</p><p>There are tradeoffs in everything, including software. LAMP, unfortunately, doesnt solve this. But it does make some outstanding improvements in each area of Price, Speed and Quality.LAMP stands for Linux, Apache, MySQL and PHP (or Python).</p><p>At each level there are many alternatives, which I will go into more later on. First, lets review the big picture.This is what a typical Web Architecture looks like these days.This is the External Cache, used as the gateway to the system.The Web Servers dish up all the content and dispatch requests to deeper systems.The Application Servers process dynamic requests and perform other heavy lifting. This is the brains of the beast.The Internal Cache is where Application-specific objects can be stored for short periods of time. Its useful for accelerating expensive repetitive operations such as database lookups.This is the heart of the beast. The database maintains consistency throughout the entire system, persists data, and maintains order.Within any complex Web Architecture there are numerous support systems. The systems, although often forgotten, are essential components. We will look at various subsystems that in one way or another contribute to the overall scalability or reliability of the system. And between all of these systems is the communication subsystem, a component in and of itself.Now lets talk about each of these layers in detail, look at what options we have, what the hardware requirements are and consider any caveats.The main thing here is that all of these products operate only on the HTTP or HTTPS layer. There are SSL accelerators available that would fit here. Each of these systems are capable of caching, but they are confined to the rules defined by the HTTP specification.Depending on the type of HTTP response being made, the Caching Tier may decide to store the object for future use. As soon as another request is made for the same object, the Caching Tier may decide to reuse the previous object, saving a round trip to the next layer. On other occasions, the Caching Tier may ignore the local object entirely and force a refresh of the object from the backend. All of these semantics are governed by the HTTP spec.</p><p>Also, since each connection to a web server uses a portion of that servers very finite resources, its important to complete responses as quickly as possible. The Caching Tier has a nice side affect in that it quickly receives responses from the Web Tier, freeing it up very quickly, and can then slowly drain that response to the user. This is particularly useful for modem users, which, despite their slow speed, are actually quite a strain on high-traffic servers.Buy lots of these for redundancy, and buy cheap parts (expect them to break).The nice thing about HTTP is that the web server, aka the origin server, is the thing that gets to define what is cacheable and what is not. The Caching Tier should be entirely transparent, as far as the Users and the Web Servers are concerned. Therefore, the types...</p></li></ul>