34
Improving email reliability CTO Antti Siiskonen Alma Mediapartners Oy

Improving email reliability

Embed Size (px)

Citation preview

Page 1: Improving email reliability

Improving email reliability

CTO Antti Siiskonen

Alma Mediapartners Oy

Page 2: Improving email reliability

Agenda

• Introduction

• How we failed with email

• How we recovered and improved

• What remains to be done

• Takeaways

Page 3: Improving email reliability

Who's this guy?

• Antti Siiskonen, Tampere University of Technology 1996-2006, MSc Software Engineering, Networks and Protocols

• Plenware Oy 1999-2002

• Atostek Oy & Staselog Oy 2002-2008

• Alma Media Interactive Oy & Alma Mediapartners Oy 2008-

• I live in Tampere, 40, married, three children

• CTO, infrastructure design, problem solving, helping people

• AWS stuff since late 2011, using EC2, VPC, S3, CloudFront, CloudFormation, Elastic Beanstalk, RDS, SES, etc

Page 4: Improving email reliability

Why is he here?

• to learn and to share what we have learned about email and SES

• to promote Alma Media as a company and as an employer• if you are interested in joining us please send me a message!

[email protected]

• I'm also present on pretty much every social media platform (except Tinder)

Page 5: Improving email reliability

What is Alma Mediapartners Oy?

• a part of Alma Media and Alma Markets, only digital services

• etuovi.com, etuovi sisustus, autotalli.com, vuokraovi.com, gofinland.fi

• kivi, nettikoti, websales, autosofta, urakkamaailma, autojerry ..

• 2016 turnover 16.4M€, profit 3.86M€

• some 100 employees, maybe 40 of which are in IT

• software development is done mostly in-house with assistance from subcontractors

Page 6: Improving email reliability

Why is he talking about emails?

• what a boring subject, I know, not much bling bling in this one :-P

• who's using it anyway? aren't everyone using facebook, instagram, snapchat, mobile apps, push notifications or what have you for their messaging?• it's still a relevant channel for many people

• users' attention is divided between bazillion messaging channels but for now we just can't afford to ignore emails, we have to be present there as well

• where are the customers?

Page 7: Improving email reliability

How many emails are we talking about?

• etuovi 40k-140k emails every day, marketing & personals excluded

Page 8: Improving email reliability

Examples of email we send: watchdog emails

• our marketplace services send watchdog emails to inform consumers of matching new products for sale

• integrations push the same announcement data to competitors as well so the one to get their message sent (and read) first wins ie gets most clicks, traffic, market share and business

• most of our emails are watchdog messages sent to consumers• high volume, low importance

Page 9: Improving email reliability

Examples of email we send: lead emails

• lead emails are contact requests sent by consumers to corporate customers• banks, insurance companies, real estate agents etc

• lost emails cause direct loss of customers, business and money• low volume, high importance

Page 10: Improving email reliability

Yet more examples of different messages

• marketing messages & newsletters• high volume, low importance

• personal messages from @etuovi.com addresses• low volume, high importance

• technical messages – password resets etc• low volume, high importance

Page 11: Improving email reliability

Why do lost emails matter so much?

• lost emails give an unreliable impression of our service• this despite SMTP being unreliable by design, the blame is on us

• email can be considered lost if the receiving end refuses to receive our messages or if it is interpreted as junk, clutter or spam – in general if the intended receiver never sees the sent email for any reason

• sometimes customers contact our customer service and complain that some message is missing and we have to figure out why

Page 12: Improving email reliability

How we failed with email

• disaster struck last fall

• after years and years of essentially established and unchanged way of operating we were suddenly caught with our pants down

• here's what happened ..

Page 13: Improving email reliability

Initial outbound email infrastructure

Autotalli

Etuovi

Vuokraovi Relays Internet

O365

gmail

etc

Page 14: Improving email reliability

The mayhem begins

• one of our websites was abused and ended up sending tens of thousands of spammy emails per day without us even knowing• an open html form for sending email is bad (as if we didn't know)

• first indication or trouble was the blacklisting of one of our relay ipaddresses• here's a tip for you: mxtoolbox.com

• some services throttled down the number of emails they're willing to accept from our relay, some completely rejected our emails• think O365, gmail, hotmail .. potentially huge impact, status unknown

Page 15: Improving email reliability

Further down the spiral

• we tried our best to dodge the issue and keep on going

• we changed our blacklisted relay ip address to another• turns out there is a warm-up period!

• receiving systems throttle email from "cold" ip addresses

• as trust builds, they let more and more email pass, but it may take weeks

• our "delayed" email queue blew up and made problems worse

• we redirected all outbound email to non-blacklisted relay and ip• unsurprisingly it was blacklisted in record time as well

• both "delayed" queues blew up and made things even worse

Page 16: Improving email reliability

WELL THAT ESCALATED QUICKLY!

Page 17: Improving email reliability

Taking back control

• we stopped the abuse by removing borked features from websites• we do know how this stuff should be done .. right? RIGHT?

• there is no easy way out from the blacklists once you fall in• some lists allow us to request removal after we have fixed our systems

• others update their records automatically and eventually we go green again

• repeat offenders may exhibit exponential back-off with removals and mustwait even longer for the negative effect to pass: are we sure that it's fixed?

• meanwhile our customers are pissed, our customer service is taking flak and our business people are nervous of how things are going to turn out .. but no pressure!

Page 18: Improving email reliability

So what's wrong with this setup? Everything.

• no monitoring of relays, email counts, bounces, nothing

• no risk management – if one service triggers blacklisting of relay's public ip for any reason all services using the relay will suffer

• no bounce handling or maintenance of email address registry

• no DKIM signatures or DMARC records (SPF was ok though!)

• no feedback channel from email deliveries, only postfix logs

• no GA or any other analytics or tracking for sent emails

• we did pretty much nothing to build or maintain our email sending reputation!

Page 19: Improving email reliability

Planning a better world

• we decided to adopt AWS Simple Email Service (SES)• SMTP relay as a service

• separation of test & qa from production services

• separation of Mediapartners' services from each other• risk management – limit blast radius to a single service

• considered separation of high volume & low importance emails from low volume & high importance email• decided not to go this far yet, let's see how things turn out without

Page 20: Improving email reliability

Outbound email infrastructure today

Our service

AWS SES Internet

AWS SNSAWS

DynamoDBAWS

Lambda

Receivers

Page 21: Improving email reliability

Shifting from the old to the new

• during the summer of 2017 we implemented a load balancer in software for outbound email that balances between our old relaysand SES• this enabled a gradual shift from our old relays to SES

• SES has it's own warm-up feature which balances between shared SES ip addresses and dedicated SES ip addresses• if warm up is incomplete excess email is sent via shared ip addresses

• shared ips may be blacklisted .. or not, you never know

• when completed all email is sent using dedicated addresses

Page 22: Improving email reliability

Some improvements in more detail

• there are multiple standard methods that can be deployed to increasethe trustworthiness of an email source and an individual email

• we are using three methods that all involve creating and maintainingappropriate DNS records• SPF, DKIM and DMARC

• AWS SES supports all of them and they are quite easy to set up

• it gets a bit technical from here on, but please bear with me ..

Page 23: Improving email reliability

Sender Policy Framework (SPF)

• RFC 4408, RFC 6652, RFC 7208 and RFC 7372 (2006-2015)

• for expressing permitted email senders for a domain in a DNS TXT record

• for etuovi.com we updated it's TXT record to hold this:• "v=spf1 include:spf.crometenterprise.com a:mailcannon.hard.ware.fi

include:_spf.emaileri.fi include:spf.protection.outlook.cominclude:amazonses.com include:_spf.salesforce.com ~all"

• email forwarders mask the original sender – SPF will not match

• final "~all" indicates SOFTFAIL, which is also not perfect

Page 24: Improving email reliability

DomainKeys Identified Mail (DKIM)

• RFC 4871, RFC 5672, RFC 6376 (2007-2011)

• for signing outbound email

• encrypted signature hash and the recipe to verify it are added to message headers

• domain specific public decryption key is added to DNS records

• receiver fetches the public key, decrypts the hash value from emailheaders and recalculates the hash to verify the signature

• can be done on infrastructure level without changes to applications

• what should be done when DKIM is missing or verification fails?

Page 25: Improving email reliability

DomainKeys Identified Mail (DKIM)

Page 26: Improving email reliability

How does a spam filter look at SPF or DKIM?

Page 27: Improving email reliability

Domain-based Message Authentication, Reporting and Conformance (DMARC)• defined in RFC 7489 (2015)

• built on top of SPF and DKIM

• again directives are published via DNS TXT records

• allows domain owner to specify how the receiver should deal withSPF and DKIM failures (actually DMARC failures)

• adds "alignment requirement" for domains

• _dmarc.etuovi.com TXT record holds: "v=DMARC1; p=quarantine; rua=mailto:[email protected]; fo=s; adkim=r; aspf=r; pct=100; rf=afrf; ri=86400"

Page 28: Improving email reliability

Domain-based Message Authentication, Reporting and Conformance (DMARC)• p=quarantine

• instruct receivers to quarantine DMARC failures

• ri=86400 & rua=mailto:[email protected]• send aggregate reports to this address every 86400 seconds

• fo=s• report if SPF fails

• provides a neat feedback channel that we didn't have before!

Page 29: Improving email reliability

Current SES setup

Page 30: Improving email reliability

Current reputation status

Page 31: Improving email reliability

What remains to be done

• automatic bounce & complaint handling on application level• we now have something equivalent of a logging system and email sending is

still disabled manually

• we need an automated system that keeps and eye on the DynamoDB and does the work for us

• we need analytics for the sent messages• has the email been opened? has it been interacted with?

• should we cease sending email for passive users after some time? what is the "half life" of a passive email address?

Page 32: Improving email reliability

Some pitfalls

• not knowing who or what are allowed to send email for us• all email is not being sent thru AWS SES

• personal emails egress via O365, marketing emails via partners relays ..

• does our SPF record really cover everything there is ..?

• some emails will not have DKIM signatures .. and it should be ok

• people, in general, have no idea of how email actually works• relays, sender reputation, blacklists, etc .. not common knowledge

Page 33: Improving email reliability

Takeaways

• SMTP might be simple, but automated bulk email sending is not

• do your best to be a "good citizen" in the email world• build, monitor and actively maintain your sender reputation

• don't send "unsolicited email" which IS SPAM by definition

• figure out and keep up with current best practises and implement them

• keep in mind that email is an unreliable channel by design• use other channels where appropriate

Page 34: Improving email reliability

Thank you!

We are looking for new talent so contact me if you're interested!

@AnttiSiiskonen

[email protected]

https://www.linkedin.com/in/anttisiiskonen