Upload
david-cramer
View
1.519
Download
0
Embed Size (px)
DESCRIPTION
Disqus' presentation at DjangoCon 2010. Covers their basic hardware setup, some of their concerns with database usage, and how they manage with a small team of engineers.
Citation preview
Jason Yan@jasonyan
David Cramer@zeeg
Scaling the World’s Largest Django App
1
What is DISQUS?
2
What is DISQUS?
We are a comment system with an emphasis on connecting communities
http://disqus.com/about/
dis·cuss • dĭ-skŭs'
3
What is Scale?
17,000 requests/second peak
450,000 websites
15 million profiles
75 million comments
250 million visitors (August 2010)
50M100M150M200M250M300M
Number of Visitors
Our traffic at a glance
4
Our Challenges
• We can’t predict when things will happen
• Random celebrity gossip
• Natural disasters
• Discussions never expire
• We can’t keep those millions of articles from 2008 in the cache
• You don’t know in advance (generally) where the traffic will be
• Especially with dynamic paging, realtime, sorting, personal prefs, etc.
5
Our Challenges (cont’d)
• High availability
• Not a destination site
• Difficult to schedule maintenance
6
Server Architecture
7
Server Architecture - Load Balancing
• Load Balancing
• Software, HAProxy
• High performance, intelligent server availability checking
• Bonus: Nice statistics reporting
• High Availability
• heartbeat
Image Source: http://haproxy.1wt.eu/
8
Server Architecture
• ~100 Servers
• 30% Web Servers (Apache + mod_wsgi)
• 10% Databases (PostgreSQL)
• 25% Cache Servers (memcached)
• 20% Load Balancing / High Availability (HAProxy + heartbeat)
• 15% Utility Servers (Python scripts)
9
Server Architecture - Web Servers
• Apache 2.2
• mod_wsgi
• Using `maximum-requests` to plug memory leaks.
• Performance Monitoring
• Custom middleware (PerformanceLogMiddleware)
• Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog
• Collected and graphed through Ganglia
10
Server Architecture - Database
• PostgreSQL
• Slony-I for Replication
• Trigger-based
• Read slaves for extra read capacity
• Failover master database for high availability
11
Server Architecture - Database
• Make sure indexes fit in memory and measure I/O
• High I/O generally means slow queries due to missing indexes or indexes not in buffer cache
• Log Slow Queries
• syslog-ng + pgFouine + cron to automate slow query logging
12
Server Architecture - Database
• Use connection pooling
• Django doesn’t do this for you
• We use pgbouncer
• Limits the maximum number of connections your database needs to handle
• Save on costly opening and tearing down of new database connections
13
Our Data Model
14
Partitioning
• Fairly easy to implement, quick wins
• Done at the application level
• Data is replayed by Slony
• Two methods of data separation
15
Vertical PartitioningVertical partitioning involves creating tables with fewer columns
and using additional tables to store the remaining columns.
http://en.wikipedia.org/wiki/Partition_(database)
Posts UsersForums Sentry
16
Pythonic Joins
posts = Post.objects.all()[0:25]
# store users in a dictionary based on primary keyusers = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)))
# map users to their postsfor p in posts: p._user_cache = users.get(p.user_id)
Allows us to separate datasets
17
Pythonic Joins (cont’d)
• Slower than at database level
• But not enough that you should care
• Trading performance for scale
• Allows us to separate data
• Easy vertical partitioning
• More efficient caching
• get_many, object-per-row cache
18
Designating Masters
• Alleviates some of the write load on your primary application master
• Masters exist under specific conditions:
• application use case
• partitioned data
• Database routers make this (fairly) easy
19
Routing by Application
class ApplicationRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None app_label = instance._meta.app_label
return get_application_alias(app_label)
20
Horizontal PartitioningHorizontal partitioning (also known as sharding) involves splitting
one set of data into different tables.
http://en.wikipedia.org/wiki/Partition_(database)
Your Blog CNNDisqus Telegraph
21
Horizontal Partitions
• Some forums have very large datasets
• Partners need high availability
• Helps scale the write load on the master
• We rely more on vertical partitions
22
Routing by Partition
class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None
return get_forum_alias(forum_id)
# Now, making sure hints are availableforum.post_set.all()
# What we used to doPost.objects.filter(forum=forum)
23
Optimizing QuerySets
• We really dislike raw SQL
• It creates more work when dealing with partitions
• Built-in cache allows sub-slicing
• But isn’t always needed
• We removed this cache
24
Removing the Cache
• Django internally caches the results of your QuerySet
• This adds additional memory overhead
• Many times you only need to view a result set once
• So we built SkinnyQuerySet
# 1 queryqs = Model.objects.all()[0:100]
# 0 queries (we don’t need this behavior)qs = qs[0:10]
# 1 queryqs = qs.filter(foo=bar)
25
Removing the Cache (cont’d)
class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache is not None: # __len__ must have been run return iter(self._result_cache)
has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator()
Optimizing memory usage by removing the cache
http://gist.github.com/550438
26
Atomic Updates
• Keeps your data consistent
• save() isnt thread-safe
• use update() instead
• Great for things like counters
• But should be considered for all write operations
27
Atomic Updates (cont’d)
post = Post(pk=1)# a moderator approvespost.approved = Truepost.save()
Thread safety is impossible with .save()
Request 1
post = Post(pk=1)# the author adjusts their messagepost.message = ‘Hello!’post.save()
Request 2
28
Atomic Updates (cont’d)
post = Post(pk=1)# a moderator approvesPost.objects.filter(pk=post.pk)\
.update(approved=True)
So we need atomic updates
Request 1
post = Post(pk=1)# the author adjusts their messagePost.objects.filter(pk=post.pk)\
.update(message=‘Hello!’)
Request 2
29
Atomic Updates (cont’d)
def update(obj, using=None, **kwargs): """ Updates specified attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v)
A better way to approach updates
http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py
30
Delayed Signals
• Queueing low priority tasks
• even if they’re fast
• Asynchronous (Delayed) signals
• very friendly to the developer
• ..but not as friendly as real signals
31
Delayed Signals (cont’d)
from disqus.common.signals import delayed_save
def my_func(data, sender, created, **kwargs): print data[‘id’]
delayed_save.connect(my_func, sender=Post)
We send a specific serialized versionof the model for delayed signals
This is all handled through our Queue
32
Caching
• Memcached
• Use pylibmc (newer libMemcached-based)
• Ticket #11675 (add pylibmc support)
• Third party applications:
• django-newcache, django-pylibmc
33
Caching (cont’d)
• libMemcached / pylibmc is configurable with “behaviors”.
• Memcached “single point of failure”
• Distributed system, but we must take precautions.
• Connection timeout to memcached can stall requests.
• Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches.
34
Caching (cont’d)
• Default (naive) hashing behavior
• Modulo hashed cache key cache for index to server list.
• Removal of a server causes majority of cache keys to be remapped to new servers.
CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]key = ‘my_cache_key’cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]
35
Caching (cont’d)
• Better approach: consistent hashing
• libMemcached (pylibmc) uses libketama (http://tinyurl.com/lastfm-libketama)
• Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers)
Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction
36
Caching (cont’d)
• Thundering herd (stampede) problem
• Invalidating a heavily accessed cache key causes many clients to refill cache.
• But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower.
• Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache.
• django-newcache or MintCache (http://djangosnippets.org/snippets/793/) will do this for you.
• Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem.
37
Transactions
• TransactionMiddleware got us started, but down the road became a burden
• For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’]
• Each query is in its own transaction. This means each request won’t start in a transaction.
• But sometimes we want transactions (e.g., saving multiple objects and rolling back on error)
38
Transactions (cont’d)
• Tips:
• Use autocommit for read slave databases.
• Isolate slow functions (e.g., external calls, template rendering) from transactions.
• Selective autocommit
• Most read-only views don’t need to be in transactions.
• Start in autocommit and switch to a transaction on write.
39
Scaling the Team
• Small team of engineers
• Monthly users / developers = 40m
• Which means writing tests..
• ..and having a dead simple workflow
40
Keeping it Simple
• A developer can be up and running in a few minutes
• assuming postgres and other server applications are already installed
• pip, virtualenv
• settings.py
41
Setting Up Local
1. createdb -E UTF-8 disqus
2. git clone git://repo
3. mkvirtualenv disqus
4. pip install -U -r requirements.txt
5. ./manage.py syncdb && ./manage.py migrate
42
Sane Defaults
from disqus.conf.settings.default import *
try: from local_settings import *except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc()
settings.py
from disqus.conf.settings.dev import *
local_settings.py
43
Continuous Integration
• Daily deploys with Fabric
• several times an hour on some days
• Hudson keeps our builds going
• combined with Selenium
• Post-commit hooks for quick testing
• like Pyflakes
• Reverting to a previous version is a matter of seconds
44
Continuous Integration (cont’d)
Hudson makes integration easy
45
Testing
• It’s not fun breaking things when you’re the new guy
• Our testing process is fairly heavy
• 70k (Python) LOC, 73% coverage, 20 min suite
• Custom Test Runner (unittest)
• We needed XML, Selenium, Query Counts
• Database proxies (for read-slave testing)
• Integration with our Queue
46
Testing (cont’d)
# failures yield a dump of queriesdef test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’)
def test_button(self): self.selenium.click('//a[@class=”dsq-button”]')
Query Counts
Selenium
Queue Integrationclass WorkerTest(DisqusTest): workers = [‘fire_signal’]
def test_delayed_signal(self): ...
47
Bug Tracking
• Switched from Trac to Redmine
• We wanted Subtasks
• Emailing exceptions is a bad idea
• Even if its localhost
• Previously using django-db-log to aggregate errors to a single point
• We’ve overhauled db log and are releasing Sentry
48
django-sentry
Groups messages intelligently
http://github.com/dcramer/django-sentry
49
django-sentry (cont’d)
Similar feel to Django’s debugger
http://github.com/dcramer/django-sentry
50
Feature Switches
• We needed a safety in case a feature wasn’t performing well at peak
• it had to respond without delay, globally, and without writing to disk
• Allows us to work out of trunk (mostly)
• Easy to release new features to a portion of your audience
• Also nice for “Labs” type projects
51
Feature Switches (cont’d)
52
Final Thoughts
• The language (usually) isn’t your problem
• We like Django
• But we maintain local patches
• Some tickets don’t have enough of a following
• Patches, like #17, completely change Django..
• ..arguably in a good way
• Others don’t have champions
Ticket #17 describes making the ORM an identify mapper
53
Housekeeping
Want to learn from others about performance and scaling problems?
Birds of a Feather
We’re Hiring!
DISQUS is looking for amazing engineers
Or play some StarCraft 2?
54
Questions
55
References
django-sentryhttp://github.com/dcramer/django-sentry
Our Feature Switcheshttp://cl.ly/2FYt
Andy McCurdy’s update()http://github.com/andymccurdy/django-tips-and-tricks
Our PyFlakes Forkhttp://github.com/dcramer/pyflakes
SkinnyQuerySethttp://gist.github.com/550438
django-newcachehttp://github.com/ericflo/django-newcache
attach_foreignkey (Pythonic Joins)http://gist.github.com/567356
56