97
Damn You Facebook GREG WARDEN HEAD OF ENGINEERING SERVICES ATLASSIAN @USMILE1

AtlasCamp 2015: Damn you Facebook - Raising the bar in SaaS

Embed Size (px)

Citation preview

Damn You Facebook

GREG WARDEN • HEAD OF ENGINEERING SERVICES • ATLASSIAN • @USMILE1

Thanks to Facebook, Twitter, Netflix, Linkedin, Etsy and others…

The standard for availability,

performance and dev speed is now

VERY HIGH

For us old-farts that learned to develop

for servers, the rules have changed

Damn

There is already A LOT of material

out there about this stuff

By people WAY smarter than me

So this talk is NOT going to be about…

CAP theory

or the fallacies of distributed

programming

or Raft vs. Paxos

or CRDTs

It’s not about 12-factor apps

or micro-services

or the reactive manifesto

It’s not about circuit-breakers or

back pressure

or no-sql, CQRS, immutability,

idempotentcy, or functional

programming

Sorry

This talk is about thinking differently about some of the

basics

it’s about un-learning some stuff

practical learnings

Let’s start with availability

Un-learning #1: the customer admin

is sitting there on app startup

You can’t just dump some friendly error

message to a log

You can’t just bork the whole system

and expect the admin will sort it

out

@40000000553d9deb364aa0d4 2015-04-27 04:24:33,909 main FATAL [atlassian.jira.startup.JiraStartupLogger] @40000000553d9deb364aa8a4 @40000000553d9deb364aa8a4 *************************************************************************************************************************************@40000000553d9deb364aac8c The following plugins are required by JIRA, but have not been started: Gadget Spec Publisher Plugin (com.atlassian.gadgets.publisher)@40000000553d9deb364aac8c *************************************************************************************************************************************@40000000553d9deb36589af4 @40000000553d9deb3686947c 2015-04-27 04:24:33,914 main ERROR [atlassian.jira.upgrade.UpgradeLauncher] Skipping, JIRA is locked.@40000000553d9deb368ada3c 2015-04-27 04:24:33,915 main INFO [atlassian.jira.scheduler.JiraSchedulerLauncher] JIRA Scheduler not start

Now what?

Just text by itself, for impact.

Graceful degradation

Un-learning #2: sys properties are a good

source of user information

Tied to a server?

It would really suck if you had to restart

the system to change the timezone?

public TimeZone getDefaultTimezone() { String systemDefaultTimeZoneId = applicationProperties.getString( APKeys.JIRA_DEFAULT_TIMEZONE ); if (StringUtils.isNotEmpty(systemDefaultTimeZoneId)) { return TimeZone.getTimeZone(systemDefaultTimeZoneId);

} return TimeZone.getDefault(); }

The old way

public void acceptTenant(final String tenantId, final Map<String, String> tenantProperties) throws LandlordRequestException { final Tenant tenant = new JiraTenantImpl(tenantId); jiraTenantAccessor.addTenant(tenant);

updateTimezoneIfProvided(tenantProperties); tenantPluginBridge.trigger(); for (final TenantInitialDataLoader loader : tenantInitialDataLoaders) { loader.start(tenant); } eventPublisher.publish(new TenantArrivedEvent(tenant)); log.debug("Tenanted - AcceptTenantCalled"); }

Might this be better?

Un-learning #3: You can take the system

down to upgrade it

stick with me a sec

Cost-effective high availability tends to imply multi-tenancy

Sharding be-damned, you have global customers on

shared infrastructure

Somebody is always awake

So you need 0-downtime upgrades

Safe 0-downtime upgrades on shared infrastructure

tends to imply

You can’t upgrade your code and schema at the same time

Fast five pattern

Fast five pattern

Fast five pattern

Fast five pattern

Fast five pattern

Fast five pattern

Un-learning #4

you can rely on an in-memory cache

for global state

The Server

The Cluster

The Cloud

X

you can’t cache all the JIRA users for every customer on

every node

Un-learning #5

you have a local filesystem

Decoupled remote file service

And you replicate that storage globally

And then:

backup / restore import / export search indexing

performance

Now lets talk about performance

Un-learning #6:

Varying the resource returned based on

cookies is a good idea

Ever seen a URL like this in JIRA:

<script type="text/javascript" src="//d2p4ir3ro0j0cb.cloudfront.net/gregdownunder.atlassian.net/s/d7f5a59ca238699619bdd31870a15fca-CDN/en_USdre0s0/65000/8/67/_/

download/superbatch/js/batch.js?atlassian.aui.raphael.disabled=true&amp;locale=en-US" ></

script>

You see this bit?

<script type="text/javascript" src="//d2p4ir3ro0j0cb.cloudfront.net/gregdownunder.atlassian.net/s/d7f5a59ca238699619bdd31870a15fca-CDN/en_USdre0s0/65000/8/67/_/

download/superbatch/js/batch.js?atlassian.aui.raphael.disabled=true&amp;locale=en-US" ></

script>

It used to be just:

<script type="text/javascript" src="//gregdownunder.atlassian.net/s/d7f5a59ca238699619bdd31870a15fca-CDN/en_USdre0s0/65000/8/67/_/download/superbatch/js/batch.js?

atlassian.aui.raphael.disabled=true&amp" ></script>

Remember this?

Cookies and CDNs don’t play nice

Pro tip

figure out how to monitor the CDN

before you roll it out

Un-learning #7:

Local session state is a convenient

solution for multi-step workflows

two choices:

browser state external cache

That’s right, external cache

Pro-tip:

Write your queries right and the DB is a

good external cache

Un-learning #8:

admins are the only ones experiencing

start-up time

Customer provisioning

Key Idea

decouple app and customer specific data

Key Idea

Key Idea

Un-learning #9:

snowflakes are pretty

allowing variation is great for custom

environments but it screws your dev

speed in SaaS

dev speed centers around making lots

of little changes

cost of each change is something like:

abstract cost * # variations to consider

Options?

minimise the cost of consideration

Options?

make sure variation is observable

programmatically

Options?

capture configurations in a query-able meta-data

service

Pro-tip

upgrades should assert what the world is like

before acting

Immutable infrastructure

Un-learning #10:

logs are pretty

we were then told to log in a super structured way

Un-learning 10-a:

logs are pretty is an un-learning

Invest in a proper logging

infrastructure

Dev Speed Learning Fast

Log Everything

In Conclusion

All that crazy hacker-news stuff

is cool. But you gotta start with the

basics

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

no admins external state

fast five upgrades start up time matters

respect the CDN log everything

thank you!

Damn you Facebook - Raising the bar in SaaS

Submit your feedback: go.atlassian.com/acfacebook