DjangoCon 2010 Scaling Disqus

Jason Yan@jasonyan

David Cramer@zeeg

Scaling the World’s Largest Django App

1

What is DISQUS?

2

What is DISQUS?

We are a comment system with an emphasis on connecting communities

http://disqus.com/about/

dis·cuss • dĭ-skŭs'

3



What is Scale?

17,000 requests/second peak

450,000 websites

15 million profiles

75 million comments

250 million visitors (August 2010)

50M100M150M200M250M300M

Number of Visitors

Our traffic at a glance

4

Our Challenges

• We can’t predict when things will happen

• Random celebrity gossip

• Natural disasters

• Discussions never expire

• We can’t keep those millions of articles from 2008 in the cache

• You don’t know in advance (generally) where the traffic will be

• Especially with dynamic paging, realtime, sorting, personal prefs, etc.

5

Our Challenges (cont’d)

• High availability

• Not a destination site

• Difficult to schedule maintenance

6

Server Architecture

7

Server Architecture - Load Balancing

• Load Balancing

• Software, HAProxy

• High performance, intelligent server availability checking

• Bonus: Nice statistics reporting

• High Availability

• heartbeat

Image Source: http://haproxy.1wt.eu/

8

http://haproxy.1wt.eu

http://haproxy.1wt.eu

Server Architecture

• ~100 Servers

• 30% Web Servers (Apache + mod_wsgi)

• 10% Databases (PostgreSQL)

• 25% Cache Servers (memcached)

• 20% Load Balancing / High Availability (HAProxy + heartbeat)

• 15% Utility Servers (Python scripts)

9

Server Architecture - Web Servers

• Apache 2.2

• mod_wsgi

• Using `maximum-requests` to plug memory leaks.

• Performance Monitoring

• Custom middleware (PerformanceLogMiddleware)

• Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog

• Collected and graphed through Ganglia

10

Server Architecture - Database

• PostgreSQL

• Slony-I for Replication

• Trigger-based

• Read slaves for extra read capacity

• Failover master database for high availability

11


• Make sure indexes fit in memory and measure I/O

• High I/O generally means slow queries due to missing indexes or indexes not in buffer cache

• Log Slow Queries

• syslog-ng + pgFouine + cron to automate slow query logging

12


• Use connection pooling

• Django doesn’t do this for you

• We use pgbouncer

• Limits the maximum number of connections your database needs to handle

• Save on costly opening and tearing down of new database connections

13

Our Data Model

14

Partitioning

• Fairly easy to implement, quick wins

• Done at the application level

• Data is replayed by Slony

• Two methods of data separation

15

Vertical PartitioningVertical partitioning involves creating tables with fewer columns

and using additional tables to store the remaining columns.

http://en.wikipedia.org/wiki/Partition_(database)

Posts UsersForums Sentry

16



Pythonic Joins

posts = Post.objects.all()[0:25]

# store users in a dictionary based on primary keyusers = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)))

# map users to their postsfor p in posts: p._user_cache = users.get(p.user_id)

Allows us to separate datasets

17

Pythonic Joins (cont’d)

• Slower than at database level

• But not enough that you should care

• Trading performance for scale

• Allows us to separate data

• Easy vertical partitioning

• More efficient caching

• get_many, object-per-row cache

18

Designating Masters

• Alleviates some of the write load on your primary application master

• Masters exist under specific conditions:

• application use case

• partitioned data

• Database routers make this (fairly) easy

19

Routing by Application

class ApplicationRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None app_label = instance._meta.app_label

return get_application_alias(app_label)

20

Horizontal PartitioningHorizontal partitioning (also known as sharding) involves splitting

one set of data into different tables.


Your Blog CNNDisqus Telegraph

21



Horizontal Partitions

• Some forums have very large datasets

• Partners need high availability

• Helps scale the write load on the master

• We rely more on vertical partitions

22

Routing by Partition

class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None

return get_forum_alias(forum_id)

# Now, making sure hints are availableforum.post_set.all()

# What we used to doPost.objects.filter(forum=forum)

23

Optimizing QuerySets

• We really dislike raw SQL

• It creates more work when dealing with partitions

• Built-in cache allows sub-slicing

• But isn’t always needed

• We removed this cache

24

Removing the Cache

• Django internally caches the results of your QuerySet

• This adds additional memory overhead

• Many times you only need to view a result set once

• So we built SkinnyQuerySet

# 1 queryqs = Model.objects.all()[0:100]

# 0 queries (we don’t need this behavior)qs = qs[0:10]

# 1 queryqs = qs.filter(foo=bar)

25

Removing the Cache (cont’d)

class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache is not None: # __len__ must have been run return iter(self._result_cache)

has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator()

Optimizing memory usage by removing the cache

http://gist.github.com/550438

26



Atomic Updates

• Keeps your data consistent

• save() isnt thread-safe

• use update() instead

• Great for things like counters

• But should be considered for all write operations

27

Atomic Updates (cont’d)

post = Post(pk=1)# a moderator approvespost.approved = Truepost.save()

Thread safety is impossible with .save()

Request 1

post = Post(pk=1)# the author adjusts their messagepost.message = ‘Hello!’post.save()

Request 2

28


post = Post(pk=1)# a moderator approvesPost.objects.filter(pk=post.pk)\

.update(approved=True)

So we need atomic updates

Request 1

post = Post(pk=1)# the author adjusts their messagePost.objects.filter(pk=post.pk)\

.update(message=‘Hello!’)

Request 2

29


def update(obj, using=None, **kwargs): """ Updates specified attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v)

A better way to approach updates

http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py

30



Delayed Signals

• Queueing low priority tasks

• even if they’re fast

• Asynchronous (Delayed) signals

• very friendly to the developer

• ..but not as friendly as real signals

31

Delayed Signals (cont’d)

from disqus.common.signals import delayed_save

def my_func(data, sender, created, **kwargs): print data[‘id’]

delayed_save.connect(my_func, sender=Post)

We send a specific serialized versionof the model for delayed signals

This is all handled through our Queue

32

Caching

• Memcached

• Use pylibmc (newer libMemcached-based)

• Ticket #11675 (add pylibmc support)

• Third party applications:

• django-newcache, django-pylibmc

33

Caching (cont’d)

• libMemcached / pylibmc is configurable with “behaviors”.

• Memcached “single point of failure”

• Distributed system, but we must take precautions.

• Connection timeout to memcached can stall requests.

• Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches.

34

Caching (cont’d)

• Default (naive) hashing behavior

• Modulo hashed cache key cache for index to server list.

• Removal of a server causes majority of cache keys to be remapped to new servers.

CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]key = ‘my_cache_key’cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]

35

Caching (cont’d)

• Better approach: consistent hashing

• libMemcached (pylibmc) uses libketama (http://tinyurl.com/lastfm-libketama)

• Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers)

Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

36

http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

Caching (cont’d)

• Thundering herd (stampede) problem

• Invalidating a heavily accessed cache key causes many clients to refill cache.

• But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower.

• Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache.

• django-newcache or MintCache (http://djangosnippets.org/snippets/793/) will do this for you.

• Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem.

37

http://djangosnippets.org/snippets/793/




Transactions

• TransactionMiddleware got us started, but down the road became a burden

• For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’]

• Each query is in its own transaction. This means each request won’t start in a transaction.

• But sometimes we want transactions (e.g., saving multiple objects and rolling back on error)

38

Transactions (cont’d)

• Tips:

• Use autocommit for read slave databases.

• Isolate slow functions (e.g., external calls, template rendering) from transactions.

• Selective autocommit

• Most read-only views don’t need to be in transactions.

• Start in autocommit and switch to a transaction on write.

39

Scaling the Team

• Small team of engineers

• Monthly users / developers = 40m

• Which means writing tests..

• ..and having a dead simple workflow

40

Keeping it Simple

• A developer can be up and running in a few minutes

• assuming postgres and other server applications are already installed

• pip, virtualenv

• settings.py

41

Setting Up Local

1. createdb -E UTF-8 disqus

2. git clone git://repo

3. mkvirtualenv disqus

4. pip install -U -r requirements.txt

5. ./manage.py syncdb && ./manage.py migrate

42

Sane Defaults

from disqus.conf.settings.default import *

try: from local_settings import *except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc()

settings.py

from disqus.conf.settings.dev import *

local_settings.py

43

Continuous Integration

• Daily deploys with Fabric

• several times an hour on some days

• Hudson keeps our builds going

• combined with Selenium

• Post-commit hooks for quick testing

• like Pyflakes

• Reverting to a previous version is a matter of seconds

44

Continuous Integration (cont’d)

Hudson makes integration easy

45

Testing

• It’s not fun breaking things when you’re the new guy

• Our testing process is fairly heavy

• 70k (Python) LOC, 73% coverage, 20 min suite

• Custom Test Runner (unittest)

• We needed XML, Selenium, Query Counts

• Database proxies (for read-slave testing)

• Integration with our Queue

46

Testing (cont’d)

# failures yield a dump of queriesdef test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’)

def test_button(self): self.selenium.click('//a[@class=”dsq-button”]')

Query Counts

Selenium

Queue Integrationclass WorkerTest(DisqusTest): workers = [‘fire_signal’]

def test_delayed_signal(self): ...

47

Bug Tracking

• Switched from Trac to Redmine

• We wanted Subtasks

• Emailing exceptions is a bad idea

• Even if its localhost

• Previously using django-db-log to aggregate errors to a single point

• We’ve overhauled db log and are releasing Sentry

48

django-sentry

Groups messages intelligently

http://github.com/dcramer/django-sentry

49



django-sentry (cont’d)

Similar feel to Django’s debugger


50



Feature Switches

• We needed a safety in case a feature wasn’t performing well at peak

• it had to respond without delay, globally, and without writing to disk

• Allows us to work out of trunk (mostly)

• Easy to release new features to a portion of your audience

• Also nice for “Labs” type projects

51

Feature Switches (cont’d)

52

Final Thoughts

• The language (usually) isn’t your problem

• We like Django

• But we maintain local patches

• Some tickets don’t have enough of a following

• Patches, like #17, completely change Django..

• ..arguably in a good way

• Others don’t have champions

Ticket #17 describes making the ORM an identify mapper

53

Housekeeping

Want to learn from others about performance and scaling problems?

Birds of a Feather

We’re Hiring!

DISQUS is looking for amazing engineers

Or play some StarCraft 2?

54

Questions

55

References

django-sentryhttp://github.com/dcramer/django-sentry

Our Feature Switcheshttp://cl.ly/2FYt

Andy McCurdy’s update()http://github.com/andymccurdy/django-tips-and-tricks

Our PyFlakes Forkhttp://github.com/dcramer/pyflakes

SkinnyQuerySethttp://gist.github.com/550438

django-newcachehttp://github.com/ericflo/django-newcache

attach_foreignkey (Pythonic Joins)http://gist.github.com/567356

56



http://blog.disqus.com/post/789540337/partial-deployment-with-feature-switches

http://blog.disqus.com/post/789540337/partial-deployment-with-feature-switches

http://github.com/andymccurdy/django-tips-and-tricks

http://github.com/andymccurdy/django-tips-and-tricks

http://github.com/dcramer/pyflakes

http://github.com/dcramer/pyflakes



http://github.com/ericflo/django-newcache

http://github.com/ericflo/django-newcache



Documents

DjangoCon 2010 Scaling Disqus