Upload
erik-rose
View
3.790
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Follow a Firefox crash from its genesis in a collapsing browser process through the dizzying array of collection, storage, and reporting systems that make up Socorro, our open-source crash collector. Enjoy war stories of weird, interlocking failures, and see how we nevertheless continue to fulfill our mandate: “Never lose a crash.” Observe some patterns that emerged from this system which can be useful in yours.
Citation preview
What Happens When Firefox Crashes?
orIt’s Not My Fault Tolerance
by Erik Rose
Welcome![Erik Rose (if not introduced)]write server-side code @ Mozillato tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.•! ❑ ! Challenging to test it▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.! •! ❑ ! Even unique timings of your setup can trigger bugs.! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things! •! ❑ ***Any of which could make FF explode***▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
What Happens When Firefox Crashes?
orIt’s Not My Fault Tolerance
by Erik Rose
Welcome![Erik Rose (if not introduced)]write server-side code @ Mozillato tell you about the Big Data systems behind FF crash reporting
•! ❑ ! A browser is a complex piece of software.•! ❑ ! Challenging to test it▼! ❑ ! Interacts with a lot of other software: JS add-ons, compiled plugins, OSes, different hardware.! •! ❑ ! Even unique timings of your setup can trigger bugs.! •! ❑ ! Also, 50 billion – 1 trillion web pages. They do unpredictable, creative things! •! ❑ ***Any of which could make FF explode***▼! –! That's why, in addition to an extensive test suite and manual testing, we invest a lot in crash reporting.
So today, I want to show you what happens when Firefox crashes and what the systems look like that receive and process the crash reports
•!❑! If you've crashed FF, you've seen this dialog.�! ❑!If you choose to send us a crash report, we use it to…! •!❑! find new bugs! •!❑! decide where to concentrate our time
Socorro
�!–! The thing that receives FF crash reports is called Socorro.•!❑! ***Open source.***•!❑! You can use it if you want. Very flexible.•!❑! Used by Valve, Yandex•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
Socorro
h!ps://github.com/mozilla/socorro
�!–! The thing that receives FF crash reports is called Socorro.•!❑! ***Open source.***•!❑! You can use it if you want. Very flexible.•!❑! Used by Valve, Yandex•!❑! Socorro gets its name from the Very Large Array in Socorro, NM because…
Very Large ArraySocorro, New Mexico
like that array, it receives signals from out in the universe and tries to filter out patterns from the noise.
•!❑! 27 dish antennas, which can move to follow objects across the sky•!❑! Socorro is a Very Large Array of slightly less expensive systems which tracks crashes
across the userbase
!
BigPicture
!e
Let’s take a peek behind the curtainYou’ll recognize some things you’re doing yourself,and some other things might surprise you.
So let’s embark on our tour of Socorro!
! •!❑! On its front end, it looks like this.
Public.Don’t hide our failuresUnusual.
You can drill into this, to seee.g. top crashers:
! •!❑! ***% of all crashes***! •!❑! signature (stack trace)! •!❑! breakdown by platform! •!❑! ticket correllations
You can drill into this, to seee.g. top crashers:
! •!❑! ***% of all crashes***! •!❑! signature (stack trace)! •!❑! breakdown by platform! •!❑! ticket correllations
�!–! Another example: explosive crashes! �!–! Music charts: "bullets"! •!❑! song which rises quickly up the charts to suddenly become extremely popular! •!❑! Something we expect to see as 5% of all crashes, but then you wake up one morning, and
they're 85% of all crashes.! •!❑! Generally what this means is that one of the major sites shipped a new piece of JS which
crashes us.! �!✓! The most recent example of this is during the last Olypmics, when Google released a new
Doodle every day.
! •!❑! I think it was this one that crashed us.! •!❑! On the one hand, we knew the problem was going away tomorrow. So that’s nice.! •!❑! OTOH, a lot of people have Google set as their startup page. So that's bad. ;-)
�!❑! You can also find…! •!❑! Most common crashes for a version, platform, etc.! •!❑! New crashes! �!❑! Correlations! •!❑! ferret out interactions between plugins, for example
•!❑! Pretty straightforward, right?
Backend is less straightforward…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
500M Firefox users
150M daily users
3000 crashes per minute
150KB-20MB per crash
800GB in PostgreSQL
40TB in HDFS, 110TB replicated
•!❑! Over 120 boxes, all physical.�!❑! Why physical?! •!❑! Organizational momentum! •!❑! HBase doesn't do so well virtualized. It's very talky between nodes, so low latency is important.�!–!! •!❑! How much data?! •!❑! "The smallest big-data project"! •!❑! Used to be considered big. Not anymore.! �!✓! Numbers! •!✓! ***500M FF users***! •!✓! ***150M ADUs. Probably more.***! •!✓! ***3000 crashes/minute.*** 3M/day.! •!✓! ***A FF crash*** is 150K-20MB (hard ceiling—anything over 20MB is just an out-of-mem crash
anyway and just full of corrupt garbage)! •!✓! ***800GB*** in PG! •!✓! ***110TB*** in HDFS. That's replicated. 40TB actual data.! �!✓! Dictum: “Never lose a crash.” We have all Firefox crashes from the very beginning.! •!✓! One reason for this is so a developer can go into the UI and request a crash be processed, and it
will be.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.But even that’s made up of multiple moving parts.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
It all starts ***down here***, with FF.But even that’s made up of multiple moving parts.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side***First 2*** in FF process
�! ❑! Breakpad! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa! �! ❑! stack dump of all threads! •!❑! opaque; doesn't even know the frame boundaries! •!❑! a little other processor state! •!❑! throws it to another process: ***Crash Reporter***
Why?Remember, FF has crashed.State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”binary crash dump + JSON metadata→ POST → collectors…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side***First 2*** in FF process
�! ❑! Breakpad! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa! �! ❑! stack dump of all threads! •!❑! opaque; doesn't even know the frame boundaries! •!❑! a little other processor state! •!❑! throws it to another process: ***Crash Reporter***
Why?Remember, FF has crashed.State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”binary crash dump + JSON metadata→ POST → collectors…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side***First 2*** in FF process
�! ❑! Breakpad! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa! �! ❑! stack dump of all threads! •!❑! opaque; doesn't even know the frame boundaries! •!❑! a little other processor state! •!❑! throws it to another process: ***Crash Reporter***
Why?Remember, FF has crashed.State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”binary crash dump + JSON metadata→ POST → collectors…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side***First 2*** in FF process
�! ❑! Breakpad! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa! �! ❑! stack dump of all threads! •!❑! opaque; doesn't even know the frame boundaries! •!❑! a little other processor state! •!❑! throws it to another process: ***Crash Reporter***
Why?Remember, FF has crashed.State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”binary crash dump + JSON metadata→ POST → collectors…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
These ***first 3*** pieces all on client side***First 2*** in FF process
�! ❑! Breakpad! •!❑! Used by Firefox, Chrome, Google Earth, Camino, Picasa! �! ❑! stack dump of all threads! •!❑! opaque; doesn't even know the frame boundaries! •!❑! a little other processor state! •!❑! throws it to another process: ***Crash Reporter***
Why?Remember, FF has crashed.State unknown.
“Crash Reporter, which is responsible for ***this little dialog***,”binary crash dump + JSON metadata→ POST → collectors…
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
where really enters Socorro***…***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
Collectors: super simpleWrites crashes to ***local disk…***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
Then, another processon same box
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
Crash Moverspicks up crashes off local disk→ 2 places
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
1st: → HBase.HBase is primary store for crashes.70 nodes
At the same time***…***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
IDs → Rabbit
�! ❑! Soft realtime: and normal queues! •!❑! Priority: process within 60 secs
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
�!❑! Processors! •!❑! Where the real action happens! •!❑! To process a crash means to do what's necessary to make it visible in the web UI.! •!❑! ID from Rabbit! •!❑! binary → debug! •!❑! signature generation! •!❑! Then it puts it into buckets and adds it to PG and ES.
First, PG.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
�!❑! Postgres! �!❑! Our main interactive datastore! •!❑! It's what the web app and most batch jobs talk to.! �!❑! Stores (cut?)! •!❑! unique crash signatures! •!❑! numbers of crashes, bucketed by signature! �!❑! other aggregations of crash counts on various facets! •!❑! to make reporting fast! •!❑! (see slide 32 of breakpad.socorro.master.key.)! �!❑! In there for a couple reasons! •!❑! Prompt, reliable answers to queries! �!❑! Ref integ! •!❑! Stores unique crash signatures! •!❑! And their relationships to versions, tickets, & so on! •!❑! PHP & Django easy to query from
Now, let’s turn around & talk about ES, which operates in parallel.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
�!❑! Elasticsearch! •!❑! 90-day rolling window! •!❑! Faceting! �!❑! NKOTB•! ❑!Extremely flexible text analysis.! ! ! •! ❑! Though geared toward natural language, we may be able to persuade it to take apart C++
call signatures & let us mine those in meaningful ways.! �!❑! May someday eat some of HBase or Postgres's lunch! �!❑! It scales out like HBase & can even execute arbitrary scripts near the data, collating & returning
data through a master node.! •!❑! Maybe not the flexibilty of full map-reduce! •!❑! Filter caching! •!❑! Supports indices itself
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
�!❑! Web services (“middleware”)! •!❑! At end of this story: web application! •!❑! But between it and data is REST middleware! �!❑! Why?! •!❑! was in PHP and we didn't want to reimplement model logic in 2 languages! •!❑! We change datastores.! •!❑! We move data around.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
�!✓! Web App! •!✓! Django! •!✓! Each runs memcached
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
And that concludes our big-picture tour of Socorro!
Now, as years have gone by and the system has grown in scope and size,interesting patterns
!
BigPa!erns
tooling was clearly missing.standard practices weren’t good enough.
I’m going to call out some of these emergent needs andshow you our solutions.Maybe you’ll even find some of our tools useful.
The first…
!
BigStorage
Every Big Data system put everything somewhereSolutions well-establishedAmount of data you can deal with in a commoditized fashion rises every yearsharding, repl
expensive
We realizedby application of statistics***shrink amount of data***
!
BigStorage
***sampling***per productall FFOS crashes
don’t wanna lose interesting rare events (due to sampling)***targeting***take anything with a comment•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***throw away uninteresting parts of stack frames�!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.! •!❑! Sentinel frames to jump TO! •!❑! Frames that should be ignoredAn important part of making our hash buckets wider reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline.Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs.But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
BigStorage
Sampling
***sampling***per productall FFOS crashes
don’t wanna lose interesting rare events (due to sampling)***targeting***take anything with a comment•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***throw away uninteresting parts of stack frames�!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.! •!❑! Sentinel frames to jump TO! •!❑! Frames that should be ignoredAn important part of making our hash buckets wider reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline.Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs.But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
BigStorage
SamplingTargeting
***sampling***per productall FFOS crashes
don’t wanna lose interesting rare events (due to sampling)***targeting***take anything with a comment•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***throw away uninteresting parts of stack frames�!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.! •!❑! Sentinel frames to jump TO! •!❑! Frames that should be ignoredAn important part of making our hash buckets wider reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline.Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs.But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
BigStorage
SamplingTargeting
Rari!cation
***sampling***per productall FFOS crashes
don’t wanna lose interesting rare events (due to sampling)***targeting***take anything with a comment•!❑! Our statisticians have told us all kinds of useful things about the shape of our data. For
instance, the rules that select interesting events don't throw off our OS or version statistics.
***rarification***throw away uninteresting parts of stack frames�!❑! Skiplist rules get uninteresting parts of the stack out of the data, to reduce noise. 2
kinds.! •!❑! Sentinel frames to jump TO! •!❑! Frames that should be ignoredAn important part of making our hash buckets wider reducing # of unique crash signatures
With these 3 techniques, we cut down the amount of data we need to handle in the later stages of our pipeline.Sure, we still have to keep everything in HBase, but we don’t run live queries against that, so it just means buying more HDs.But processors, rabbit, PG, ES, memcache, crons—all have lighter load
!
Big Systems
•!❑! Big Data systems tend to be complicated systems.•!❑! Diverse parts: not just one big 500-node HBase cluster and done�!❑! Example: 6 data stores:! •!❑! FS! •!❑! PG! •!❑! ES! •!❑! HBase! •!❑! memcache! •!❑! RabbitMQ! •!❑! This is typical of architectures now. Gone are the days of 1 datastore, 1 representation.! •!❑! 18 months ago, was hearing jokes about data mullet: relational in the front, NoSQL in the
back.! •!❑! data dreadlocks. It's all over the place.
The kinds of problems you can have in these systemsreally tough to track down
Hadoops!A tale of Big Failure
crash every 50 hours
***Hadoop’s cleverness*** with TCP connectionsTCP stack bugs in Linuxlying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.Remember that hardware is a nontrivial part of your system
�!❑! When you have a problem, it can be hard to work out exactly what's gone wrong.! •!❑! Can take time to get everybody together
must keep receiving crashes.***Boxes & springs***
Hadoops!A tale of Big Failure
Complex interactions
crash every 50 hours
***Hadoop’s cleverness*** with TCP connectionsTCP stack bugs in Linuxlying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.Remember that hardware is a nontrivial part of your system
�!❑! When you have a problem, it can be hard to work out exactly what's gone wrong.! •!❑! Can take time to get everybody together
must keep receiving crashes.***Boxes & springs***
Hadoops!A tale of Big Failure
Complex interactionsHardware ma"ers.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connectionsTCP stack bugs in Linuxlying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.Remember that hardware is a nontrivial part of your system
�!❑! When you have a problem, it can be hard to work out exactly what's gone wrong.! •!❑! Can take time to get everybody together
must keep receiving crashes.***Boxes & springs***
Hadoops!A tale of Big Failure
Complex interactionsHardware ma"ers.Design for failure.
crash every 50 hours
***Hadoop’s cleverness*** with TCP connectionsTCP stack bugs in Linuxlying NICs
OS buffers fill up with unclosed connections & crash
•!❑! So we're very very cautious about ***the equipment*** we use.Remember that hardware is a nontrivial part of your system
�!❑! When you have a problem, it can be hard to work out exactly what's gone wrong.! •!❑! Can take time to get everybody together
must keep receiving crashes.***Boxes & springs***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
The most important: ***this Local FS***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
Everything else can fail3 days of runwaySaved us several times
Yours may not look like this, but•!❑! You could imagine a system being able to serve just out of cache if the datastore went away.•!❑! Or operate in read-only mode if writes became unavailable.! ! ! ! SUMO
One thing from this diagram we didn’t talk about much yet was ***cron jobs***.
!
Big Batching
•!❑! Mozilla is a large project with a long legacy, and Socorro interfaces with a lot of other systems. ***A lot of this occurs via batch jobs.***
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
matviewsversion scraper, 1x/daybugzilla•!❑! Send advice back to users, like in the case where we see they have malwareADUs denominator for every metric fails a lot. Metrics’ systems unreliable. everything that depends on it fails
In fact, you can look at a lot of our periodic tasks as a dependency tree.
One thing upstream fails***…***
…and downstream everything else fails.
replaced cron w/crontabberInstead of blindly running jobs whose prerequisites aren’t filled,runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sysToo error-prone by hand.***Then*** we thought: why not have crontabber draw them for us?
…and downstream everything else fails.
replaced cron w/crontabberInstead of blindly running jobs whose prerequisites aren’t filled,runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sysToo error-prone by hand.***Then*** we thought: why not have crontabber draw them for us?
…and downstream everything else fails.
replaced cron w/crontabberInstead of blindly running jobs whose prerequisites aren’t filled,runs the ***parent*** until it succeeds, then runs ***children***.
Diagrams to visualize state of sysToo error-prone by hand.***Then*** we thought: why not have crontabber draw them for us?
SVGs are really neat.can wiggle if unclear
And then break down specifics into a ***table…***
One job at a time atm cuz “eek matviews perf”, but a great contribution would be some kind of shared locks or thresholds for multiple.
But you know, right now, it’s ***good enough…***
!
BigDeal
And it’s surprising how often that happens. Oftentimes, your makeshift solutions end up being good enough to do the job.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have.Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have.Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have.Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
Duplicate Finder
Zeus Zeus
Collectors
Local FS
Crash Movers
HBase
RabbitMQ Processors
PostgreSQL
elasticsearch
Web Front-end
memcached
Debug symbols on
NFS
pgbouncer
LDAP
Middleware
Zeus Zeus
Bugzilla Associator
Automatic Emailer
Bugzilla
Materialized View
Builders
Active Daily Users
Signatures
Versions
Explosiveness
ADU Count Loader
Version Scraper
FTP Vertica
Zeus
cronjobs
Zeus load balancer
Crash Reporter
Breakpad
***Slapdash, hacky queue (PG)*** polls HBase → PG polls PG → processors***Local FS buffer*** was a temporary fix when we had reliability problems with HBase.
***I could tell*** you “don’t be afraid of temporary hacks”. But I think that’s a healthy fear to have.Or perhaps my message should be: do a good job on your temporary solutions, because they’ll probably be around awhile.
definition: hook up to one computer, or fit on one deskchanges every year
The fact…wearing nearly 100GBunimaginable to operator of punch card duplicator from only 50 years ago
But the patterns that come out of large systems remain.Duplicate cards: why? To facet 2 ways in parallel.
While you may need to generalize a bit,I have no doubttechniques you learn today and tomorrowserve you well into the future.
Big !ankstwi!er: ErikRose