56
Jason Yan @jasonyan David Cramer @zeeg Scaling the World’s Largest Django App 1

DjangoCon 2010 Scaling Disqus

Embed Size (px)

DESCRIPTION

Disqus' presentation at DjangoCon 2010. Covers their basic hardware setup, some of their concerns with database usage, and how they manage with a small team of engineers.

Citation preview

Page 1: DjangoCon 2010 Scaling Disqus

Jason Yan@jasonyan

David Cramer@zeeg

Scaling the World’s Largest Django App

1

Page 2: DjangoCon 2010 Scaling Disqus

What is DISQUS?

2

Page 3: DjangoCon 2010 Scaling Disqus

What is DISQUS?

We are a comment system with an emphasis on connecting communities

http://disqus.com/about/

dis·cuss • dĭ-skŭs'

3

Page 4: DjangoCon 2010 Scaling Disqus

What is Scale?

17,000 requests/second peak

450,000 websites

15 million profiles

75 million comments

250 million visitors (August 2010)

50M100M150M200M250M300M

Number of Visitors

Our traffic at a glance

4

Page 5: DjangoCon 2010 Scaling Disqus

Our Challenges

• We can’t predict when things will happen

• Random celebrity gossip

• Natural disasters

• Discussions never expire

• We can’t keep those millions of articles from 2008 in the cache

• You don’t know in advance (generally) where the traffic will be

• Especially with dynamic paging, realtime, sorting, personal prefs, etc.

5

Page 6: DjangoCon 2010 Scaling Disqus

Our Challenges (cont’d)

• High availability

• Not a destination site

• Difficult to schedule maintenance

6

Page 7: DjangoCon 2010 Scaling Disqus

Server Architecture

7

Page 8: DjangoCon 2010 Scaling Disqus

Server Architecture - Load Balancing

• Load Balancing

• Software, HAProxy

• High performance, intelligent server availability checking

• Bonus: Nice statistics reporting

• High Availability

• heartbeat

Image Source: http://haproxy.1wt.eu/

8

Page 9: DjangoCon 2010 Scaling Disqus

Server Architecture

• ~100 Servers

• 30% Web Servers (Apache + mod_wsgi)

• 10% Databases (PostgreSQL)

• 25% Cache Servers (memcached)

• 20% Load Balancing / High Availability (HAProxy + heartbeat)

• 15% Utility Servers (Python scripts)

9

Page 10: DjangoCon 2010 Scaling Disqus

Server Architecture - Web Servers

• Apache 2.2

• mod_wsgi

• Using `maximum-requests` to plug memory leaks.

• Performance Monitoring

• Custom middleware (PerformanceLogMiddleware)

• Ships performance statistics (DB queries, external calls, template rendering, etc) through syslog

• Collected and graphed through Ganglia

10

Page 11: DjangoCon 2010 Scaling Disqus

Server Architecture - Database

• PostgreSQL

• Slony-I for Replication

• Trigger-based

• Read slaves for extra read capacity

• Failover master database for high availability

11

Page 12: DjangoCon 2010 Scaling Disqus

Server Architecture - Database

• Make sure indexes fit in memory and measure I/O

• High I/O generally means slow queries due to missing indexes or indexes not in buffer cache

• Log Slow Queries

• syslog-ng + pgFouine + cron to automate slow query logging

12

Page 13: DjangoCon 2010 Scaling Disqus

Server Architecture - Database

• Use connection pooling

• Django doesn’t do this for you

• We use pgbouncer

• Limits the maximum number of connections your database needs to handle

• Save on costly opening and tearing down of new database connections

13

Page 14: DjangoCon 2010 Scaling Disqus

Our Data Model

14

Page 15: DjangoCon 2010 Scaling Disqus

Partitioning

• Fairly easy to implement, quick wins

• Done at the application level

• Data is replayed by Slony

• Two methods of data separation

15

Page 16: DjangoCon 2010 Scaling Disqus

Vertical PartitioningVertical partitioning involves creating tables with fewer columns

and using additional tables to store the remaining columns.

http://en.wikipedia.org/wiki/Partition_(database)

Posts UsersForums Sentry

16

Page 17: DjangoCon 2010 Scaling Disqus

Pythonic Joins

posts = Post.objects.all()[0:25]

# store users in a dictionary based on primary keyusers = dict( (u.pk, u) for u in \ User.objects.filter(pk__in=set(p.user_id for p in posts)))

# map users to their postsfor p in posts: p._user_cache = users.get(p.user_id)

Allows us to separate datasets

17

Page 18: DjangoCon 2010 Scaling Disqus

Pythonic Joins (cont’d)

• Slower than at database level

• But not enough that you should care

• Trading performance for scale

• Allows us to separate data

• Easy vertical partitioning

• More efficient caching

• get_many, object-per-row cache

18

Page 19: DjangoCon 2010 Scaling Disqus

Designating Masters

• Alleviates some of the write load on your primary application master

• Masters exist under specific conditions:

• application use case

• partitioned data

• Database routers make this (fairly) easy

19

Page 20: DjangoCon 2010 Scaling Disqus

Routing by Application

class ApplicationRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None app_label = instance._meta.app_label

return get_application_alias(app_label)

20

Page 21: DjangoCon 2010 Scaling Disqus

Horizontal PartitioningHorizontal partitioning (also known as sharding) involves splitting

one set of data into different tables.

http://en.wikipedia.org/wiki/Partition_(database)

Your Blog CNNDisqus Telegraph

21

Page 22: DjangoCon 2010 Scaling Disqus

Horizontal Partitions

• Some forums have very large datasets

• Partners need high availability

• Helps scale the write load on the master

• We rely more on vertical partitions

22

Page 23: DjangoCon 2010 Scaling Disqus

Routing by Partition

class ForumPartitionRouter(object): def db_for_read(self, model, **hints): instance = hints.get('instance') if not instance: return None forum_id = getattr(instance, 'forum_id', None) if not forum_id: return None

return get_forum_alias(forum_id)

# Now, making sure hints are availableforum.post_set.all()

# What we used to doPost.objects.filter(forum=forum)

23

Page 24: DjangoCon 2010 Scaling Disqus

Optimizing QuerySets

• We really dislike raw SQL

• It creates more work when dealing with partitions

• Built-in cache allows sub-slicing

• But isn’t always needed

• We removed this cache

24

Page 25: DjangoCon 2010 Scaling Disqus

Removing the Cache

• Django internally caches the results of your QuerySet

• This adds additional memory overhead

• Many times you only need to view a result set once

• So we built SkinnyQuerySet

# 1 queryqs = Model.objects.all()[0:100]

# 0 queries (we don’t need this behavior)qs = qs[0:10]

# 1 queryqs = qs.filter(foo=bar)

25

Page 26: DjangoCon 2010 Scaling Disqus

Removing the Cache (cont’d)

class SkinnyQuerySet(QuerySet): def __iter__(self): if self._result_cache is not None: # __len__ must have been run return iter(self._result_cache)

has_run = getattr(self, 'has_run', False) if has_run: raise QuerySetDoubleIteration("...") self.has_run = True # We wanted .iterator() as the default return self.iterator()

Optimizing memory usage by removing the cache

http://gist.github.com/550438

26

Page 27: DjangoCon 2010 Scaling Disqus

Atomic Updates

• Keeps your data consistent

• save() isnt thread-safe

• use update() instead

• Great for things like counters

• But should be considered for all write operations

27

Page 28: DjangoCon 2010 Scaling Disqus

Atomic Updates (cont’d)

post = Post(pk=1)# a moderator approvespost.approved = Truepost.save()

Thread safety is impossible with .save()

Request 1

post = Post(pk=1)# the author adjusts their messagepost.message = ‘Hello!’post.save()

Request 2

28

Page 29: DjangoCon 2010 Scaling Disqus

Atomic Updates (cont’d)

post = Post(pk=1)# a moderator approvesPost.objects.filter(pk=post.pk)\

.update(approved=True)

So we need atomic updates

Request 1

post = Post(pk=1)# the author adjusts their messagePost.objects.filter(pk=post.pk)\

.update(message=‘Hello!’)

Request 2

29

Page 30: DjangoCon 2010 Scaling Disqus

Atomic Updates (cont’d)

def update(obj, using=None, **kwargs): """ Updates specified attributes on the current instance. """ assert obj, "Instance has not yet been created." obj.__class__._base_manager.using(using)\ .filter(pk=obj) .update(**kwargs) for k, v in kwargs.iteritems(): if isinstance(v, ExpressionNode): # NotImplemented continue setattr(obj, k, v)

A better way to approach updates

http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py

30

Page 31: DjangoCon 2010 Scaling Disqus

Delayed Signals

• Queueing low priority tasks

• even if they’re fast

• Asynchronous (Delayed) signals

• very friendly to the developer

• ..but not as friendly as real signals

31

Page 32: DjangoCon 2010 Scaling Disqus

Delayed Signals (cont’d)

from disqus.common.signals import delayed_save

def my_func(data, sender, created, **kwargs): print data[‘id’]

delayed_save.connect(my_func, sender=Post)

We send a specific serialized versionof the model for delayed signals

This is all handled through our Queue

32

Page 33: DjangoCon 2010 Scaling Disqus

Caching

• Memcached

• Use pylibmc (newer libMemcached-based)

• Ticket #11675 (add pylibmc support)

• Third party applications:

• django-newcache, django-pylibmc

33

Page 34: DjangoCon 2010 Scaling Disqus

Caching (cont’d)

• libMemcached / pylibmc is configurable with “behaviors”.

• Memcached “single point of failure”

• Distributed system, but we must take precautions.

• Connection timeout to memcached can stall requests.

• Use `_auto_eject_hosts` and `_retry_timeout` behaviors to prevent reconnecting to dead caches.

34

Page 35: DjangoCon 2010 Scaling Disqus

Caching (cont’d)

• Default (naive) hashing behavior

• Modulo hashed cache key cache for index to server list.

• Removal of a server causes majority of cache keys to be remapped to new servers.

CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]key = ‘my_cache_key’cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]

35

Page 36: DjangoCon 2010 Scaling Disqus

Caching (cont’d)

• Better approach: consistent hashing

• libMemcached (pylibmc) uses libketama (http://tinyurl.com/lastfm-libketama)

• Addition / removal of a cache server remaps (K/n) cache keys (where K=number of keys and n=number of servers)

Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

36

Page 37: DjangoCon 2010 Scaling Disqus

Caching (cont’d)

• Thundering herd (stampede) problem

• Invalidating a heavily accessed cache key causes many clients to refill cache.

• But everyone refetching to fill the cache from the data store or reprocessing data can cause things to get even slower.

• Most times, it’s ideal to return the previously invalidated cache value and let a single client refill the cache.

• django-newcache or MintCache (http://djangosnippets.org/snippets/793/) will do this for you.

• Prefer filling cache on invalidation instead of deleting from cache also helps to prevent the thundering herd problem.

37

Page 38: DjangoCon 2010 Scaling Disqus

Transactions

• TransactionMiddleware got us started, but down the road became a burden

• For postgresql_psycopg2, there’s a database option, OPTIONS[‘autocommit’]

• Each query is in its own transaction. This means each request won’t start in a transaction.

• But sometimes we want transactions (e.g., saving multiple objects and rolling back on error)

38

Page 39: DjangoCon 2010 Scaling Disqus

Transactions (cont’d)

• Tips:

• Use autocommit for read slave databases.

• Isolate slow functions (e.g., external calls, template rendering) from transactions.

• Selective autocommit

• Most read-only views don’t need to be in transactions.

• Start in autocommit and switch to a transaction on write.

39

Page 40: DjangoCon 2010 Scaling Disqus

Scaling the Team

• Small team of engineers

• Monthly users / developers = 40m

• Which means writing tests..

• ..and having a dead simple workflow

40

Page 41: DjangoCon 2010 Scaling Disqus

Keeping it Simple

• A developer can be up and running in a few minutes

• assuming postgres and other server applications are already installed

• pip, virtualenv

• settings.py

41

Page 42: DjangoCon 2010 Scaling Disqus

Setting Up Local

1. createdb -E UTF-8 disqus

2. git clone git://repo

3. mkvirtualenv disqus

4. pip install -U -r requirements.txt

5. ./manage.py syncdb && ./manage.py migrate

42

Page 43: DjangoCon 2010 Scaling Disqus

Sane Defaults

from disqus.conf.settings.default import *

try: from local_settings import *except ImportError: import sys, traceback sys.stderr.write("Can't find 'localsettings.py’\n”) sys.stderr.write("\nThe exception was:\n\n") traceback.print_exc()

settings.py

from disqus.conf.settings.dev import *

local_settings.py

43

Page 44: DjangoCon 2010 Scaling Disqus

Continuous Integration

• Daily deploys with Fabric

• several times an hour on some days

• Hudson keeps our builds going

• combined with Selenium

• Post-commit hooks for quick testing

• like Pyflakes

• Reverting to a previous version is a matter of seconds

44

Page 45: DjangoCon 2010 Scaling Disqus

Continuous Integration (cont’d)

Hudson makes integration easy

45

Page 46: DjangoCon 2010 Scaling Disqus

Testing

• It’s not fun breaking things when you’re the new guy

• Our testing process is fairly heavy

• 70k (Python) LOC, 73% coverage, 20 min suite

• Custom Test Runner (unittest)

• We needed XML, Selenium, Query Counts

• Database proxies (for read-slave testing)

• Integration with our Queue

46

Page 47: DjangoCon 2010 Scaling Disqus

Testing (cont’d)

# failures yield a dump of queriesdef test_read_slave(self): Model.objects.using(‘read_slave’).count() self.assertQueryCount(1, ‘read_slave’)

def test_button(self): self.selenium.click('//a[@class=”dsq-button”]')

Query Counts

Selenium

Queue Integrationclass WorkerTest(DisqusTest): workers = [‘fire_signal’]

def test_delayed_signal(self): ...

47

Page 48: DjangoCon 2010 Scaling Disqus

Bug Tracking

• Switched from Trac to Redmine

• We wanted Subtasks

• Emailing exceptions is a bad idea

• Even if its localhost

• Previously using django-db-log to aggregate errors to a single point

• We’ve overhauled db log and are releasing Sentry

48

Page 49: DjangoCon 2010 Scaling Disqus

django-sentry

Groups messages intelligently

http://github.com/dcramer/django-sentry

49

Page 50: DjangoCon 2010 Scaling Disqus

django-sentry (cont’d)

Similar feel to Django’s debugger

http://github.com/dcramer/django-sentry

50

Page 51: DjangoCon 2010 Scaling Disqus

Feature Switches

• We needed a safety in case a feature wasn’t performing well at peak

• it had to respond without delay, globally, and without writing to disk

• Allows us to work out of trunk (mostly)

• Easy to release new features to a portion of your audience

• Also nice for “Labs” type projects

51

Page 52: DjangoCon 2010 Scaling Disqus

Feature Switches (cont’d)

52

Page 53: DjangoCon 2010 Scaling Disqus

Final Thoughts

• The language (usually) isn’t your problem

• We like Django

• But we maintain local patches

• Some tickets don’t have enough of a following

• Patches, like #17, completely change Django..

• ..arguably in a good way

• Others don’t have champions

Ticket #17 describes making the ORM an identify mapper

53

Page 54: DjangoCon 2010 Scaling Disqus

Housekeeping

Want to learn from others about performance and scaling problems?

Birds of a Feather

We’re Hiring!

DISQUS is looking for amazing engineers

Or play some StarCraft 2?

54

Page 55: DjangoCon 2010 Scaling Disqus

Questions

55

Page 56: DjangoCon 2010 Scaling Disqus

References

django-sentryhttp://github.com/dcramer/django-sentry

Our Feature Switcheshttp://cl.ly/2FYt

Andy McCurdy’s update()http://github.com/andymccurdy/django-tips-and-tricks

Our PyFlakes Forkhttp://github.com/dcramer/pyflakes

SkinnyQuerySethttp://gist.github.com/550438

django-newcachehttp://github.com/ericflo/django-newcache

attach_foreignkey (Pythonic Joins)http://gist.github.com/567356

56