indecorous

On Sabbaticals

Wed, 27 Jan 2016 00:00:00 -0600

Sabbaticals - a long(er) term paid break from work - used to be the preserve of academia. But, more and more companies are choosing to offer sabbaticals to their longer-serving employees. It’s a tremendous benefit for employees (and, as I’ll argue below, a great benefit for employers too).

I’m fortunate to have worked for Etsy for over six years now, and when they updated their policy to allow a six week sabbatical after five years (it had been seven), I started planning mine.

I took my sabbatical in July and August of 2015. It was fun, relaxing, and not what I expected. After I came back, I had a few conversations with people about it, and I thought it was worth writing down the things I’ve thought about sabbaticals - why they exist, how to get the most out of one, what to do on your return, etc. (This post is an edited and more general version of the email I sent to others considering their sabbatical at Etsy.)

Everyone’s situations - at home and at work - are different, and our sabbaticals will be equally unique. As such, the following is just a list of things I feel are important, having taken mine.

Why do we have sabbaticals?

Thinking about how to take a good sabbatical, I fell first to thinking about why we have sabbaticals in the first place. Are they for us to learn new skills? See new sights? Have adventures? Sit in our underwear in the dark binging on Netflix and popcorn? Yes, any and all of the above. But, fundamentally, I would argue that the main reason we have a sabbatical is so that we can be extremely not at work.

When you leave work, you start ramping down your “at work” mind and before you get to work you start ramping it back up again. During the work week we never really fully ramp down before we ramp up. At the weekend, we might get close. If we take a week off, we’ll probably be fully away from work for a few days in the middle. I’d argue that two weeks is the minimum time to take off to actually get a proper mental break from work. (Note: the dire state of vacation time allotted to employees at many US companies may well preclude this. I’ll admit my privilege in having a good amount of vacation time available to me in general.)

Even so, we’re often left with anxieties about work: Did I leave everything in a good state? What if I left a bug lying around? These mentally drag us back to work. The common habit of checking work mail during evenings, weekend, and vacations compounds this problem⁰.

A sabbatical is different. A sabbatical forces you to properly relinquish your grip on your job. This is super important for your employer. (It’s also super important for you, but it’s important to recognise that providing sabbaticals is not pure business altruism.) After a number of years at a company, it is super easy to become The Person Who Knows About The Thing, whatever The Thing is. You become a single point of failure because any time The Thing comes up, you’re the one who deals with it. Maybe you can get away with that over a week’s vacation (maybe even two) but a sabbatical forces you to make sure that other people know about The Thing too. This is important because while we take sabbaticals in a very planned, deliberate way, fate can force you to take an unscheduled one through the medium of a critical illness, or an unexpected bus while crossing the road.

First principle: If you think you can’t take a sabbatical, you need to take your sabbatical.

Corollary: You probably should be taking more, longer vacations too.

Tactics vs strategy

When I say that a sabbatical is a break from work, I’m not saying you shouldn’t think about work. I think you should think differently about work.

Part of never ramping down on the day-to-day work is that you often don’t have space to think about the long term - both for yourself and for your job, your team, and the company at large. We tend to think tactically about how to solve the immediate problems at hand, and trust that the longer-term strategic thinking is done by other people. It often is, but that’s underselling yourself. Your company hires very smart, very thoughtful people (yes, you) who care about the company’s mission and the people they serve. After a few years, you’ve seen things (good and bad) and thought things and got a sense of how the arc of the company’s story is curving. Those thoughts are often jumbling around and wanting to settle into some coherent insight, but can’t because you’re so busy doing the day job. Sabbaticals are a chance to clear away the day job and let those thoughts clarify.

I did say thinking about the long term for yourself, too, though. You’ve spent a goodly chunk of time at this company. Are you happy? Are you being stretched? Doing what you want? Is there some itch you want to scratch? Something worrying you that you’re missing? Paying attention to yourself and what you need is really, really important too, and having space to think about it (rather than just getting to the point where you throw up your hands and quit in order to make a change) is super important.

A lot of this thinking can be subconscious. You don’t have to go and sit at the top of a mountain and think about your job for weeks (I mean, you totally can if you want) but making space for strategic thinking in amongst sabbatical activities, and doing some deliberate thinking about it towards the end is useful.

Second principle: Day-to-day work pushes us to think tactically; sabbaticals push us to think strategically. Both are valuable.

Safe disconnection

To achieve this state of unworkness - to realise the benefits to yourself and to your company - you need to disconnect. This can feel really weird. (It can also feel really nice.)

Primarily, I mean disconnecting from email (and IRC, GChat, Slack, etc.). This is the constant drip-feed of work (or worklike stuff) that pulls you back in. If you see something that you can help with, the compulsion to actually go and help is hard to resist, and then you’re screwed.

What you need is trust that everything will be OK in your absence. There are two parts to this: doing your best to hand off everything that’s normally on your plate, and doing your best to make sure that the unexpected will be handled cleanly.

How you deal with those two is largely down to the sort of work you have, but for the latter I would recommend the following out-of-office reply¹:

I’m currently on sabbatical and won’t be checking my email until [return date]. Jane Manager - jmanager@example.com - should be able to handle anything I’d normally deal with. If you specifically need me urgently for some reason, Jane knows how to get in touch with me.

You clearly set expectations - “I’m not going to be reading your email” - and a mechanism for resolving problems. And you provide an escape hatch: if the world is ending and only you can help, your manager has your personal email/phone/batsignal/whatever and can call on you. And they won’t do it. You’ve already done the work to make sure you’re not a single point of failure before you leave. You work with good people. But knowing that escape hatch is there lets you relax knowing that if the worst does come to the worst you can be summoned. (Note that you don’t put your personal email on your out-of-office reply - this only works if you make sure you have a gatekeeper.)

Third principle: Do the work to gain confidence that you can be away, and then trust that that the mechanisms you put in place are good.

Corollary: If you don’t trust your manager, your company has larger problems they need to solve.

On being surplus to requirements

One of the things I kind of struggled with before I went on sabbatical was the idea that I wasn’t actually… necessary. And if my team and the company could do without me for six weeks, then maybe they didn’t need me at all.

This plays into a workplace anti-pattern I tend to think of as “heroism”. You’re the one who swoops in and saves the day. You clean up the messes. You do the reviews that spot the problems that would have been catastrophic. You have the Good Ideas. All of this is absolutely terrible for the company (and for you, long term). Again, the single point of failure. Other people on your team need to step up, and need space to step up. That’s how the company grows and evolves.

Just because you’re not irreplaceable doesn’t mean you’re not valuable. Your perspective, experience, skills, and love of a good pun make you valuable, and those aren’t predicated on you saving the day all the time. In fact saving the day all the time is probably stopping you from applying all those skills to more important problems.

Extended time away lets you come back to work having shed an old, constricting skin. Your team grew more capable in your absence, and you have new freedoms to now make it even better.

Fourth principle: Just because your team can function without you, doesn’t mean they want to.

When to go

This depends on your company’s policy, but in most cases there is no requirement to take your sabbatical the moment you qualify for it. Dates will be constrained by various logistical requirements. Want to spend your sabbatical with your kids? You’re probably limited to school vacation times. Want to go on a grand tour of New England? You might want to do it in the autumn to get the best leaves. Want to do a particular class that’s only offered once a year? You get the idea. Similarly, it is unlikely to be ideal for a product manager to go waltzing off in the middle of a critical project, for a support person to head off into the sunset during your busiest time of the year, etc.

Balancing the work and personal constraints may be tricky, so start talking to your manager early. I know of some people who are planning their sabbaticals at least a year in advance due to time constraints. The earlier you talk, the easier it is to factor your disappearance into project plans.

Fifth principle: There is rarely a good time to leave work for an extended period of time - don’t let this stop you.

What to do

I think most people who are given the opportunity to go on sabbatical struggle to work out what we want to do with the time. It’s an unusual, valuable gift, and we feel a responsibility to use it well.

The good news is that, based on what I’ve described above, it doesn’t actually matter what you do. Anything you choose to do is just icing on the cake.

There is also an element of uncertainty in planning many sabbaticals - if you’re planning far enough ahead, the unexpected may pop up and change everything. I ended up selling my house and moving a week before my sabbatical started, so suddenly had “sort out the new house” as a project I could not have anticipated. I had also planned to do some projects that ended up being impossible due to unforeseen events. Be ready and willing to adapt your plans².

Sixth principle: Allow room for serendipity, the unexpected, and exploration.

Before you go

Sabbaticals should be largely free from work-worry, so do the work to avoid the worry up front. Perhaps more importantly, though, plan your return with your manager. A lot of worry during sabbaticals is about “what will it be like when I come back?” so having set a plan before you leave can alleviate a lot of this concern. What will you be working on? Who will you be working with? What deadlines will you have? Make sure that you have a good sense of all this, and that your manager is aware of (and shares) your expectations. It may not be possible to set up a specific project for your return, but making sure your manager will explicitly have something ready for you is important.

Set up a debrief meeting for when you get back so that you can get up to speed quickly.

This forward planning will make for less worry, and a much better sabbatical (and return). Then set your out-of-office reply and leave with a clear conscience!

Seventh principle: Plan for your return before you leave, to avoid worry.

When you return

Most work environments are fast-paced, and things can change a lot even during a few weeks away. Moreover, you may well have changed in six weeks.

Everything might be just fine and dandy on your return. That’s great!

Equally, it may feel a bit weird when you come back. This is also OK and perfectly normal. I found that peer support was invaluable in managing my return³ - talking to people I trust (including, but not solely, my manager) about what I was feeling, what changes I wanted to make, etc. Allow time for that, and embrace it as a positive aspect of going on sabbatical.

Eighth principle: The sabbatical process doesn’t end after six weeks.

A sabbatical doesn’t mean quitting

Sometimes people quit after going on sabbatical. There, I said it.

People also don’t quit (I came back!). You don’t have to be unhappy or dissatisfied with work to go on sabbatical, nor should going be viewed as a precursor to leaving completely. Some people may realise they want to make changes in their work while they’re away, and for some of them that might mean doing something completely different, but that’s a very individual choice. There may be some degree of correlation between quitting and sabbaticals, but I’d argue that there is no causation there.

I certainly came back having missed everyone I worked with, and while there were things I wanted to change, my time away reaffirmed to me just how much I love Etsy, its people, what it stands for, who it serves, and the impact it can have in the world.

Ninth principle: You don’t have to be unhappy or frustrated in your work to go on sabbatical.

That’s it

That’s a lot of stuff. I’m amazed you’ve got this far. Hopefully these principles are useful for those of you thinking about sabbaticals. They may even be useful for managers with staff with impending sabbaticals, or for companies thinking about implementing or updating a sabbatical program.

One thing I’d emphasise is that many of them apply to taking “regular” vacations too. The ability to disconnect from work is, in my opinion, one of the most valuable skills you can acquire.

Tenth principle: If you’re going to enumerate principles, try to come up with a round number.

Footnotes

0: To be clear, I’m not criticising or trying to shame people who check their work email outside of work hours. There are reasons why that happens (I often do it myself). But, we should be clear about the trade-offs we’re making and the implications of them, and the reasons why we’re doing things.

1: I’d recommend this for vacations too, for what it’s worth.

2: My parents were visiting us during the start of my sabbatical, and as a result one of my favourite sabbatical memories proved to be spending a week with my dad fixing up the dock at my new house. I could not have planned this, but will treasure it.

3: In truth, I had a bit of a Mid-Career Freakout….

Failure Is An Option

Thu, 18 Jun 2015 00:00:00 -0500

This is a transcript of a talk I gave at the Velocity Conference in Santa Clara on 28 May 2015. It’s transcribed from the video recording, but edited slightly for clarity.

I’m Ian, I’m a software engineer at Etsy. For those of you who don’t know, Etsy is an online marketplace for handmade and vintage goods, and I’ve been working there for about five and a half years now. Over the last five years we’ve been on a journey from sort of traditional working practices to a DevOps-style approach. It’s all been fascinating to be part of, but the thing that I’ve found most interesting—the thing that I always find I come back to—is how we’ve started to evolve the way in which we approach failure: how we understand failure, what we care about, how we deal with it. And it’s important. It’s how we learn: we learn from our failures.

What I’m going to talk about today is not a blueprint for organisations, it’s not some magic bullet where failure magically becomes your friend. I’m just going to talk about how Etsy approaches failure, how our philosophies and our ideas around it translate into the day-to-day work, into our tooling, and into our approaches for building stuff.

But before I get into that, I’m going to present three truths. At least I will claim them as truths. You are free to argue. The first of those is: you will create bugs. We’re not perfect. Well-trained, intelligent, motivated engineers still make bugs. We’re building complex systems, and with any system—any code that’s sufficiently complex to be useful—it’s impossible to be certain of being bug-free. And it takes a long time to even get close. We don’t have time.

You will build the wrong thing. Our understanding of our markets, our understanding of our requirements, is imperfect. And our markets and our requirements may well change while we’re developing things. It’s not just that people are moving the goalposts, the whole pitch can be shifting. And that extra time that we took to be sure of being bug-free? Well, everything changed again. And we still have bugs.

And the third truth. Kind of by definition: you will not foresee the unexpected. You just won’t, it’s by definition. Weird stuff happens that you just can’t anticipate. We’re not acting in a bubble. Outside factors—internal factors for that matter—will always tend to complicate things. And when they do, or when you build the wrong thing, or when you have bugs, there is a cost. When we fail, there are costs associated with that.

It costs money. The easiest one to spot. The site goes down, you can’t do business. People can’t pay our sellers money. People can’t sign up to sell, people can’t sign up to buy. But, we could have capacity planning failures. If you have some sort of capacity planning failure, all of a sudden your AWS bill goes through roof, or you’re having to buy a ton of servers and ship someone down to the datacenter to rack them in short order.

It costs time. It costs time to go back and fix those bugs. It costs time to go back and build the thing that you built wrong. You have to rebuild it. It costs staff time that you can’t afford.

It costs data. Data loss, the thing that tends to keep us up at night.

It costs customers. “This site sucks, I’m never going back to this, it’s broken all the time.”

It costs us credibility. It’s hard to even attract customers if you’re failing all the time. And from an engineering point of view, it’s also hard to attract good people to work with you. If you get a reputation for having your site on fire all the time, it’s hard to attract the good engineers that are going to make your company a better place and push you forwards.

But I said that failure is inevitable. Bugs happen. Building the wrong happens. Failure to anticipate happens. We are… doomed. We should go home. We’re done.

But while failure is inevitable, expensive failure is not. How do we make small failures rather than big failures? How do we fail in private rather than in public? How do we fail tactically rather than strategically? How do we reduce the cost of failures?

So the typical response to this inevitability of failure is to erect barriers. We’re going to double- and triple-check everything. We’re going to implement processes, we’re going to implement procedures, we’re going to limit the ability of any individual to do harm. All this gives you is deniability. “I followed the rules.” The site still caught fire.

Instead, what we want is speed. This is an aptly-named conference, Velocity. It may seem a little counter-intuitive, but speed and the resulting flexibility makes you safer in an uncertain world. It sounds a little dangerous, and it is dangerous in the same way that power tools are dangerous. Used well, used skillfully, a power tool will let you get a lot of stuff done very efficiently. But you can also cut off your thumb.

So our general philosophy is that if we spend a lot of time and effort hiring good people (and we do) we should get out of the way and let them get stuff done. And that requires trust. We trust people to do the right thing. If we trust people to act responsibly and in the best interests of our customers and the best interests of the company and the best interests of our community—if we trust them to do that, they do. If we trust developers to take responsibility for their code in production and not just throw it over the wall at Ops, they do. If we trust Ops to talk openly and frankly about the impact of development work on the site, and their resulting on-call schedules and capacity planning, they do. And it feels good to be trusted.

And this gives is flexibility. Speed and flexibility. We get rid of these rigid timelines, we get rid of the straitjacket procedures and the hoops that you have to jump through, and that gives us space to come up with solutions to problems. We can respond quickly, we can respond effectively.

So, we just say “go fast, we trust you!” and we’re done! No… unfortunately not. But, when you accept failure, accept that it will happen, and when you accept that speed and flexibility are the way to respond to this, then you have to start crafting your tooling and your approaches to deal with that.

So let’s look at how we look at bugs. Bugs are sort of the thing that tend to exercise us most commonly as engineers, and the core of what we want to do is deploy. We want to deploy often. With continuous deployment, the way we do it at Etsy, we can deploy master at any time with minimal fuss, and minimal ceremony. That’s the core of what gives us speed and flexibility.

We have Deployinator. It’s a web-based tool, it’s a very open process. It started off as a bunch of scripts for Ops to try to wrestle the code onto the site but slowly over time it evolved into this web-based process. And it’s open to everyone. There are no gatekeepers. It’s not limited just engineers, it’s not limited to Ops or some priesthood of release engineers, it’s open to anyone. We have designers, we have product managers who will deploy their own code. They don’t need to wait for an engineer to act as an intermediary, they can just get on and do their jobs.

If you look, you’ll see it’s as you’d expect: we log who did what and when and what happened. But you’ll also notice up at the top, there’re things like the #push topic—we orchestrate all of this through IRC in the #push channel. I took this screenshot in the evening, and the IRC topic says “Off Hours”. So although we can deploy any time, day or night (and we do), we tend not to deploy in the evenings, when there aren’t so many people around. We can, if we need to, but we prefer not to. It makes it a little safer if there are more people in case things go wrong. But also, it’s nice to encourage people to go home and do other things, rather than be pushing all night.

So if we have this easy process for deploying—we make it fast, we make it easy—the result of that is that we do it often. And because we do it often, we’re pushing small chunks of code. We build something small and we deploy it. We build the next something small and we deploy it. And that has a big impact on code reviews. So code reviews are a core part of what keeps us safe. We’re reviewing each other’s code. And that takes time and effort, but it’s time and effort that pays off, because you’re learning through code reviews. You’re learning about the people’s code that you’re reviewing, you’re learning from their approaches to solving problems. And they’re learning from you, they’re learning from your expertise. And these small chunks are are way easier to review. If I give you a 50 line diff, you’ll probably find ten bugs and a handful of architectural problems. If I give you 500 lines you’re like “Yeah, it’s fine, ship it. We’re good, go ahead.”

So if you look at a typical deploy—this is maybe on the slightly smaller side, but it’s not atypical—it’s 19 files, six people, about 240 lines of code changed, so about 40 lines of code each. It’s not much code per person. That’s easier to review, easier to reason about, and if there’s a failure, then it’s limited to a much smaller surface area. And typically, instead of having to revert, we actually roll forwards. It’s as quick to find the problem, fix it and push it out again as it is to push a revert. So we don’t lose our momentum. We keep moving forward. We solve our problems and we can do it quickly because we have continuous deployment. So we’re reducing the time to resolve the problem, we’re reducing the time it takes to get to our code working. It’s reducing the cost of failure.

But, I’m getting ahead of myself, because we really don’t want the bugs to even get to production in the first place. That’s unfortunate when that happens. The core of avoiding bugs is testing. We’re going to make sure our code works, as best we can. There is very little manual testing at Etsy as a distinct job function. We have QA engineers and they’re amazing, but we tend to use them on areas of high risk, so things like our apps which are harder to do continuous deployment with, or things like checkout where money is on the line, we’ll put dedicated QA resources there to really give it a good kicking and dig in with their specialist skills. But beyond that, QA becomes the responsibility of all engineers, and QA as a job function becomes part of a partnership, a collaboration. We work with QA engineers to understand their skills, to learn from them, and become better testers ourselves. So we have QA as a partner, as a collaborator, rather than QA as a gatekeeper. Or, in the event of failure, QA as a blame sponge. “It’s QA’s fault. They passed it.” No, it was your bug.

Similarly Security. Like QA, they’re collaborators, they’re partners, not gatekeepers. QA and Security help to keep us safe through their specialist knowledge, but they’re sharing that with us. We’d rather get their insight early and fix the problems rather than get it right before we deploy and suddenly have a panic because there’s some major security flaw. We want to partner with Security, we want to partner with QA, rather than have this sort of gatekeeping, adversarial relationship, and that’s going to keep us safer in the long run.

And of course we have automated tests, that’s not a surprise. But even here there are some interesting trade-offs. Because, the more automated tests you have, and the more complex they are, the longer they take to run. And that gives a lower bound on the number of deploys you can do. If your tests take an hour to run, you’re going to have a maximum of maybe seven, eight, nine deploys a day. And if the site catches fire and you want to push, you have to make a choice. Either you’ve got to let it burn for an hour (not desirable) or you’ve got to skip the tests (also not desirable). What we want is for the tests to be fast.

And to do that, we have two approaches. One is that we throw hardware at it. We break the test suite up and we split it across a bunch of very fast test machines. That works very well for speeding things up; we can get the test run times down quite well. But when we started instrumenting a lot of this, we took a look at it, and we looked at the distribution of times for each individual test, and we found that a very small number of tests contributed to the vast majority of the run time. And we looked at them, and we deleted them. Sacrilege. But, in truth, when we looked at them, they’re not actually making us much safer. They’re these arcane tests for some weird regression that’s probably not going to happen, but we’re having to pay this time every time we want to deploy, every time we want to run the tests. Which remember, if we’re deploying fifty times a day, adds up quickly. The amount of safety it gives us is minimal compared to the cost of slowing us down.

Similarly we have no tolerance for flaky tests—a test that just fails at random. Most of you who have dealt with automated tests will have encountered these things. But if you have a test that flakes 1% of the time, if you’re pushing 50 times a day that’s going to happen every two days, on average. And when it happens, it completely derails the push. You’re ticking along nicely, all of a sudden tests are red, something’s gone wrong. Panic. Go and look. OK, what failed? Is this a flaky one? Is it OK? Are we sure it’s actually flaking this time and it’s not actually a real failure. You’ve got doubt, you’ve got uncertainly, you’ve got to dig into things, you’ve got to run the tests again, you’ve got to double check stuff, and your deploy goes to hell. And all the people lined up to deploy after you, they’re also stuck. So it completely ruins the flow of things. And you can guarantee that this happens right when the site’s on fire and you want to deploy right now. So if a test flakes, it either has to be refactored to make it not flake, or it has to be deleted. The cost of the test, because it’s flaky, typically outweighs its benefit. This is a pattern. Does the cost of something outweigh the benefit, the extra safety that it’s giving us?

But really, we don’t want tests to fail at all when we’re deploying, because it derails the push, so we have to try. Try is our way of being able to run the tests fast for developers. We don’t want to run them on our VMs, they’re a little bit underpowered to be honest and the tests take ages to run. Instead we run the try command, which takes your diff, sends it to the try servers, they apply it to master as if you were deploying and they run the tests, essentially simulating what it will be like when you come to deploy. And when everything is green, you can have some measure of confidence that when you deploy the tests are going to pass. Test failures in the deploy queue are comparatively rare because of that. And that not only minimises the risk of bugs, it’s minimising the risk of us not being able to deploy at will when we want to.

So, we’ve run try, everything’s green, we commit our code, we hit the button, the automated tests start running, and we have a Princess. Princess is the name we gave to a kind of pre-production environment. It’s basically all your new code but with production backends. It’s exactly what our members will see when we actually get to production. We are able to poke around while all the automated tests are running. We can poke around and just verify that things do actually work in production. We’ll see exactly what our members are going to see when it goes out. It’s a final manual check—remember, QA aren’t gatekeepers. It’s your responsibility—your code, your responsibility—so we can poke around on Princess, make sure that everything’s working, everything’s doing what we expect it to do, and then when the automated tests are done we can push to production.

We do that some manual testing in production, but we also have our graphs. We measure everything. We have a ton of metrics. We can watch those and see what happens. If things go wrong, graphs will usually move. We summarise those on dashboards.

Here’s a small subset of one of our deploy dashboards, around end-user errors. You can see some interesting features. Three-Armed Sweaters, which is the name given to a particular image on one of our error pages—if that spikes, something’s going wrong. Screwed Users is an evocatively-named metric, but that’s actually an aggregate metric. It’s the sum of a bunch of different metrics, each representing a member having a bad experience, a bad time, some sort of error. We add them up and if it moves, we know that something is going wrong. We can dig into other graphs on other dashboards to work out exactly what, but these are our canaries in the coal mine. You also might see a faint grey line there—that’s a historical average. It not only shows if things are moving, it shows what we typically expect. Is it about in line with the norm or is it deviating. And you’ll see the vertical lines, those correspond to deploys. If you’ve got a graph that’s ticking along nicely, you hit a vertical line, a deploy, and it suddenly spikes or it suddenly drops, you know something’s going wrong, you can be pretty sure it’s related to the deploy. Correlation is not causation, etc., etc., but you can kind of assume it.

So we have the deploy dashboard which is a specifically curated set of graphs, the sort of top-line metrics like user-facing errors, number of checkouts, number of logins, basically a good barometer of the health of the site. But we also make it easy to build and deploy custom dashboards, so individual teams or individual engineers can build their own dashboards to monitor the things they care about. So when they’re pushing out features they can monitor their own dashboards as well as the deploy dashboard and they can see if something’s going wrong,

And we also have logs. We have a web-based UI called Supergrep which streams live logs so we can see, if something goes wrong, all of a sudden Supergrep is full of errors. We know something’s gone wrong, but also those error logs contain things like stack traces. We can link them to GitHub so we can see very quickly what’s going wrong. So graphs and logs are not only changing our mean time to detection—we can see when something’s gone wrong because the logs are filling up or the graphs are moving—it also allows us to respond more quickly. We can reduce the cost of failure because we can dig in, find where in that small changeset the problem came from, fix it, and roll forward.

Now, this all sounds well and good, small changes pushed out regularly, that’s lovely. But it’s not actually how we build features, right? We don’t actually build them little bit by little bit. And that’s true. We have this idea of dark code: code that isn’t executed. So rather than pushing code out and it being immediately live, it goes out and it’s not executed. We’re shipping the code incrementally but we’re not necessarily shipping the features incrementally. So we’ll push it out behind a feature flag. A feature flag is typically an if-statement. It checks a config file: if it’s enabled, execute this block of code. If it isn’t, don’t touch it. That code is dark. It’s not going to affect the end-user experience at all. So we can isolate our new code and have it happily living in production before it’s ever switched on. Individual engineers can override those flags so you can go in and test in production before it gets to anyone else. You can just go in and override the flag.

When it comes to switching it on, it’s not an all-or-nothing thing, it’s not a boolean. We can just send it to 1%, a little sliver of traffic. It’s great for testing performance. If your databases start catching fire at 1%, imagine what would have happened if it had been 100. It’s much better to expose 1% of your traffic to a bug than 100%: you’re reducing that potential cost of failure. There’s no rush, you can go to 2%, maybe you’ll skip on to 10%. Pushing a config is very quick, you’re just deploying one file and running a small set of tests to make sure the config file is safe, so it’s a really quick process. We keep going, we get to 50%, and if something does go wrong, if something goes bang, there’s a bug, I didn’t see it, there’s a problem: it’s OK. Take it back down to 0%. Switch it off. It’s been off in production for ages already, you can be confident that it’s off when you tell it it’s off. You can be confident that you’ve cut the bug back off again, you’ve reduced the cost of failure. And eventually, of course, you get to 100% (hopefully) and that’s the nearest thing we have to a launch. But we’re not deploying a brand new version of Etsy, it’s a new feature or a revision of an existing feature. It’s often a bit of an anticlimax because it’s been in production already for ages. OK, on with the next thing.

But what if we built it wrong? We don’t have bugs, it’s just the wrong thing. We’re not really a “Minimum Viable Product” type organisation, but we do tend to build small things and iterate on them, and that’s reducing the risk of building the wrong thing. Are we completely off? Are we actually solving the problem we’re meant to be solving? Continuous deployment makes this easy because you can iterate and test and tweak and you know that you’re not hooked into some three- or six-month release cycle where you have to get it right otherwise you’re stuck for another three months, six months.

But not only can we switch feature flags on for a slice of traffic, we can switch it on for particular groups. So we can expose our colleagues to the new code. They can be our guinea pigs, and they’re really good guinea pigs because as you can imagine our colleagues are using the site all the time. We all have a nasty habit of spending too much money on Etsy, and many of us are also sellers, so we have a good idea of what works and what doesn’t, we have a good idea of what’s meant to happen. We get not just feedback like “oh this is buggy” but “this doesn’t actually solve the problem” or “this completely messes up my workflow”. We can get this feedback and we can get it in private before we even get out to our sellers or our buyers.

And even when we want to get out to the public, it’s not an all-or-nothing thing. We can set up groups of people that our buyers and sellers can opt into. “We’re testing out a new feature. Join this group and you’ll get access to it. Tell us what you think.” We get that feedback. If it’s awful, if everything’s broken, they can leave the group and they’re back to the original experience. So it minimises the cost of failure for them. They get to come and give us feedback and we learn a great deal from them, but if something’s bad, if the experience is bad or buggy, they can back out of it at no cost.

When I was talking about ramping up and got to 50%, some of you may have thought “that sounds remarkably like A/B testing” and you’d be right. A/B testing is a very common technique for working out “are we building the right thing?”. We take a hypothesis, we build something to test it, and we show it to a subset of our visitors and we compare the two populations. We can measure things, be it conversion rate, bounce rate, whatever metrics you care about you can measure those and you can look and see how the two populations differ. Does it make a difference? Does it make the right sort of difference or does it make things worse? Try to understand why, rinse, repeat, get better.

But we still have the problem of the unexpected. Although we can’t foresee the unexpected, we can expect it. We know that something weird is going to happen, this is inevitable. Speed and flexibility are your friends. Fast, easy deploys allow you to react quickly when the unexpected happens. They allow you to recover.

So, for example: I used to work on Etsy’s Risk Engineering team. We have an on-site messaging system called “Convos” and it’s a known thing that sometimes we get some spam through it, but it’s usually pretty low-level and our existing spam technologies were handling it. Until one day we started to get a new attack, much more sophisticated than any of the ones we’d seen before. Much higher volume and it was getting past all of our defences. This was a problem. We don’t want our members exposed to spam, that’s a bad experience.

But what we could do was look at it and say “what’s the simplest thing we can do to stop this attack now?”. To which the answer was that they have this particular string in them, it’s clearly part of the spam message. If we push out a thing that says “if it contains this message, it’s spam, ban the account, don’t send the message”. Alright, very simple change, very easy to reason about, very easy to have confidence that it’s going to do what we expect and not have any unintended side effects. We wrote it, pushed it out. It took maybe twenty, thirty minutes from us realising that we had to do something to actually getting this simple thing pushed out, and it stopped the attack. It didn’t stop it for long. The spammer noticed, of course, reconfigured the attack and started up again. But it took about an hour, maybe, to notice, during which time we’d had a bunch of conversations about “what’s the next simple thing we can do?”. We already had that ready, we pushed that out, and it stopped the attack again. We did this for probably a day or two, off and on, them reconfiguring the attack, and us getting gradually more and more sophisticated, having more and more time to make good decisions about what we wanted to do.

And we ended up with a very robust set of spam-fighting measures and we ended up with the attack stopping. Without continuous deployment, without the ability to do this sort of cat-and-mouse back-and-forth, we would have either had to switch off Conversations (not good, it’s going to impact on how our sellers and our buyers use the site) or we’d have to leave Convos on and just deal with the spam. Neither of those alternatives are particularly appealing, and we didn’t have to deal with them because we had continuous deployment.

We also have the ability to have partial failures. So if, for example, the Conversations database blows up, we can actually switch off Conversations using our feature flags. We keep feature flags in production so we can switch features off and on as need be. So we can switch off Convos and yes it’s not good because people can’t send their messages, but they can still buy, they can still give our sellers money. Our sellers can still manage their shops, they can continue to do business. It’s not ideal, it’s a degraded experience, but it’s better than a total failure. Similarly, if we rely on third parties and they tell us they’re going to have some scheduled maintenance (or they don’t tell us they’re going to have some unscheduled maintenance) we can wire that particular bit off, we can reduce that surface area. The site doesn’t work as well as it should do, it doesn’t work the way we want it to, but it’s not broken.

But despite all of this we still fail. Despite our best efforts. All of the stuff that we’re doing reduces the risk, it reduces the cost, but it doesn’t eliminate it. But if you accept that that’s going to happen, then you also need to accept that you need to learn from it. The skateboarder Rodney Mullen said “the best skaters are the best fallers”. We want to be really good fallers. We want to be able to learn from our mistakes so that we don’t do that again. They way in which we do that is through blamelessness. The culture of blamelessness at Etsy is core to how we’re learning from failure, and it’s an entire topic all by itself far outside the scope of this talk, but at the core if something goes wrong we’ll have a “blameless postmortem”. We’ll review what went on, we’ll ask what happened. What did people do? Why did they make the decisions that they did? We assume that they’re acting in good faith. We assume that they’re not expecting to take the site down, they just did. Why was there that mismatch between what they expected to happen and what actually happened?

From that we learn. We get these remediation items that come out of our reviews. What do we need to do to make ourselves safer in the future? What can we learn? These are very important. One of the things that we try to avoid though is the knee-jerk response “let’s implement a new process”. Remember that thing with automated tests. Does the extra friction, does the extra time it takes to do a particular process actually contribute safety, or does it (by slowing us down) increase our risk?

We share what we’ve learned. We share it widely. We get these amazing emails where someone’s built an experiment and they’re like “We expected that it would increase conversion by 10%. It didn’t touch conversion but sign-ups dropped by 25%.” Oh. It’s a big failure. We’ve completely failed to understand some part of the site, some interaction, and these are emailed out to the whole company and people are discussing why it failed. We’re not just learning from it, and learning from it publicly (at least publicly within the company), we’re normalising the idea that failure happens and that learning is important. Together, we can understand our systems better for the future.

To the extent that we actually celebrate failure. We have an annual award: The Three-Armed Sweater Award. I mentioned the three-armed sweater that’s on our error page. We actually had an Etsy seller knit us a literal three-armed sweater and it’s awarded each year to the individual or team who breaks the site most spectacularly while attempting to do something important. As an example, my team won the inaugural Three-Armed Sweater when a system that we’d built went kind of a little bit rogue and started banning all of the admin accounts on the site, all of my colleagues were like “oh, we’ve been banned!” Sorry…. It was a bad day. It was not a good time. But that system still exists at Etsy. We fixed the problem that caused it to do that, and now it’s—day-in, day-out—part of our risk mitigation strategies, and it’s keeping our members safe. We were doing something important, we were building something important. We made a mistake and it was unfortunate, but we learned from that mistake.

Similarly the t-shirt that I’m wearing today—the four stars and a horse—we actually printed up t-shirts to celebrate a bug because it was sufficiently awesome. Somehow the half-star glyph started rendering on some platforms as a horse. (One of the platforms was guessing emoji and apparently “half-star” is close to “horse”.) Obviously we fixed the problem, but we also had this amazing email thread with hundreds of messages with horse-based puns and then we screen-printed t-shirts. We celebrate failure, especially the most spectacular ones.

It’s really about celebrating pushing the envelope. If you never push the envelope, any progress you make will be slow. And you’ll still have failures. If you do push the envelope, you are going to have failures and some of them might be spectacular, but if we celebrate these spectacular failures we make sure that everyone understands that we shouldn’t be timid, we shouldn’t be fearful. Our tooling provides us with the ability to recover from these failures (even the spectacular ones).

Mark Zuckerberg said “move fast and break things”, which is fine, but I think my colleague Dannel Jurado said it better:

Move Fast and Break Things But Then Fix Them Because You Broke Them and You Can Move Fast Come On You’re Better Than This™
— Dannel Jurado (@DeMarko) August 16, 2013

This is the Etsy Way.

And with that, I leave it to you. I’ve basically described this idea, how we take this idea, this fundamental idea “speed makes us safe”, and I’ve worked through some of the ramifications of this: how it impacts day-to-day, what decisions we make as a result. This is where we are today, it’s not where we were last year, it’s certainly not where we were five years ago. It’s a process, we’re always iterating, we’re always improving, we’re finding new things to tweak. If you’ve sat here and you’re looking at things and you’re disagreeing with me, that’s great because it means that you’re thinking about why you disagree with me. You’re articulating why you disagree. It’s helping you define your own goals, your own philosophies, and helping you make clear the trade-offs that matter to you, the trade-offs that you think are going to make your organisation better and that are going to reduce your risk of failure. And for that, I wish you the best of luck.

Thank you.

Imperfect allies

Wed, 25 Mar 2015 00:00:00 -0500

Acting as an ally is hard, and mistakes are costly. This shouldn’t deter allies, but merely change how they approach their work.

As you learn about discrimination against a group you’re not part of, there comes a point at which (I hope) you want to help: you want to become an ally. An ally is a member of a privileged group who works to enable opportunity, access, and equality for members of a less-privileged class. Allies use their privilege, their advantages, to bring about change.

This is not without risk.

Potential allies often have concerns like:

What if I say the wrong thing?
What if I do harm instead of good?
What if people judge me poorly because I’ve made mistakes while trying to be an ally?

These are good, valid concerns to have, and asking yourself questions like these is a really important step in thinking about what you should do and say.

As a member of a privileged group, one of your privileges is the ability to do nothing. You can’t say the wrong thing if you remain silent. For the most part you would not be judged poorly by society. There is no expectation that you would work as an ally: doing nothing is the norm, and doing something can be scary.

We have seen examples of ostensible allies being called out and criticised for their words and actions. It’s enough to give any potential ally pause.

Allyship requires moving into a position of unfamiliarity and care, and it requires taking some risk. But we must not let fear of this risk—the fear of imperfection—cause us to shy away from acting. The work is too important to shirk.

Melissa McEwan’s On The Fixed State Ally Model vs. Process Model Ally Work describes a good way to approach the problem. (Go and read it all, summarising it won’t do it justice.)

By viewing “being an ally” as an ongoing process, rather than as a point of identity, we give ourselves room to make mistakes. We are a work in progress. Criticism still stings, but it is no longer an assault on our identity. It is through our mistakes that we learn and grow and become better allies. Moreover, if someone takes the time to give you constructive criticism, it is in many ways a compliment: they believe you can do better, and have taken the risk of potentially angering or upsetting you to help you get there.

I’ve messed up (quite badly at times) in the past. Each time my self-identification as an ally took a major bashing. However, once I adopted McEwan’s “process model” mindset, these provided learning opportunities. I was more open to criticism, and less defensive. I still felt embarrassed and ashamed, but some good was coming of it. I’m better than I was; I can be better still.

So, a few general guidelines to help avoid the worst risks of doing harm. (Note that many of the examples here refer to gender-related discrimination, but most of it could apply to other disenfranchised groups too.)

Act with sound intentions

Think carefully. Are you doing this to make yourself look good, or to make the world a better place? Is it about you, or about the people you’re trying to help? Are you at risk of (even inadvertently) undermining? Are you being a concern troll? Are you acting because you think the person you’re helping is unable to help themselves, or because you want to help take some of the load?

Sound intentions alone aren’t enough, of course, and even well-meaning people may still blunder, but some self-reflection and thought goes a long way to avoiding many of the pitfalls that lie in wait for would-be allies.

Always be learning (and paying attention)

There are lots of resources for would-be allies. Educate yourself. Yes, there’s a lot to read and thinking about this stuff can be really hard, but not putting the effort in is exercising your privilege.

Moreover, this is an area of continued debate, and ideas are continually evolving. Keep actively paying attention to the discussions around you and learn from them too.

Don’t feel compelled to actively participate in the discussion—your voice as a member of the majority group is not always necessary. Listen and learn, rather than taking up space in the discussion. (If you do join in, especially on social media, do some research before asking questions—others have probably asked and received answers before—and pay attention to how you interact.)

Act small

The most notable examples of people being publicly lambasted for claiming allyship are speaking on a grand scale: CEOs at major conferences, authors in Op Ed pieces in major publications, etc. This is not the place where you start with allyship.

Start with personal interactions—supporting your colleagues, for example. Most of the boxes on the Tech Diversity Bingo card are about things you can do within the workplace, often within your team or department, but they can still have a massive positive impact.

Personal interactions are also a way to help build trust with the people you’re trying to help. This, alongside the process of sensitising other potential allies to the issues of diversity, helps you build a collective will and ability to tackle diversity issues on a larger scale.

Local, small-scale failures can still be harmful and horrible, but typically you have more of a support structure to help you fix the damage, and the scope of the harm will hopefully be limited.

Measure twice, cut once

This is not a “move fast and break things” endeavour. It’s OK to take a moment to think about what you’re about to do or say, rather than leaping into action. Send a considered email about the use of homophobic language in your office to senior managers, rather than starting a shouting match in hallway. Maybe you don’t stand up and shout “I OBJECT!” when a man interrupts a woman during a meeting. The woman may be about to correct him herself (in which case make your approval and support clear). Or you can talk to her afterwards to offer support in resolving the problem.

Once you get the hang of it, and you’re more used to the nuances of interaction that are happening around you, it’s probably easier to act in a more timely manner, but it’s OK to start slow—you can still be helpful.

Be a follower, not a leader

Sigel Phoenix’s On Being An Ally is another “read it all” post (and also touches on some of the Process-vs-Fixed-State issues from McEwan’s post) but one of the quotes that really stuck with me is

Being an ally involves something more radical than simply saying, I will work against my own privilege (and yes, that’s radical in itself). It also involves saying, The first step in combating my privilege will be stepping out of the position of power.

Make sure you’re there to lend support when needed. Following the guidance and lead of someone from the non-privileged group means that you’re more likely to stay on a constructive, helpful path.

Listen and believe

Listening and believing people from a non-privileged group is a vital skill for any ally, but it goes double for when what they’re saying is criticism of you. It is ridiculously difficult to avoid being defensive, but if someone says that you have harmed them in some way, you have harmed them. If someone says you’ve offended them, you’ve offended them. You have a duty of care to work out how you can make amends, and to learn from those mistakes, and defensiveness will not help with either of those.

Acting as an ally isn’t easy, but it seems to me to be unconscionable to choose to do anything else. Where there is discrimination and unfairness (which, in almost all cases, I benefit from) I need to work to level the playing field. I’m just going to start off using a shovel, rather than a bulldozer.

I have been fortunate to have had a number of people review this post, for which I am extremely thankful. Any mistakes that remain are my own….

The fireworks of karma

Mon, 01 Sep 2014 00:00:00 -0500

We’ve had trouble lately with people nearby setting off fireworks in the evening. I like fireworks—don’t get me wrong, one of the reasons I studied chemistry was because of them—but the attraction begins to pall after so many nights of dealing with kids who can’t sleep and a dog who thinks the world is ending but that it might stop if he can only fit all 100lb of him into your lap.

This evening, as the crack and whistle of more rockets starts to echo through the neighbourhood, my wife decides to see who’s actually doing it. She walks down the street and eventually finds a man standing on his back deck, launching rockets out of his hand and over his neighbours’ roofs.

Deciding that she might as well attempt to reason with him, she calls out and explains that it’s a school night, people are trying to get their kids to sleep, and would he mind stopping. He is rather an arse about it, but after she points out that she’s politely asking him to stop, he says that he will.

She walks back towards the house, and as she does so a police car drives slowly by. The officer pulls up next to her, winds down his window, and asks if she’s seen anyone lighting fireworks. (Clearly we’re not the only ones growing annoyed.) She says “yes” and that he said he’d stop. As she turns to point out the man’s house, she sees that he has—despite his agreement not to—come back out onto his deck with a lit firework in his hand.

At this point he looks over, and sees my wife talking to the police officer. He panics. He runs into his house. Still holding the firework.

There is a muffled “boom”. The delightful sound of karma meeting stupidity.

Putting the Ops in DevOps

Mon, 01 Sep 2014 00:00:00 -0500

I was recently made aware of The Road To DevOps, a piece on the ActiveState blog, which reminded me that I’ve been meaning to write this post for a while, and finally got me to write it.

The author, Phil Whelan, posits a commonly-held false dichotomy: that you can come to DevOps via a developer-driven path, or an ops-driven path (and argues that the former is more satisfactory). There is (at least) one more way: a collaborative effort from both developers and operations engineers. What highlights the error is that he uses Etsy as an example of this developer-driven path to DevOps:

There are two roads to DevOps. The first is from the birth of hot new startups like Netflix and Etsy. This is generally a group of developers who have no desire to hire an Operations person. There are enough good open-source tools available and infrastructure services like Amazon Web Services are easy to consume. They have the know-how to build their infrastructure themselves, but they do not want the burden of operational maintenance, so they automate everything they can. They also build efficient test-driven pipelines that run all the way from development to production.

Etsy didn’t start out as a DevOps organisation. When I joined in 2009 we certainly had an operations person (we had a number of them, and they were great). We were (and still are) using our own hardware, not AWS.

Development was a sort of waterfallish agiley type situation. And it wasn’t working out all that well. Hiring John Allspaw (and later Kellan Elliott-McCrea) brought in a lot of hard-won knowledge about development and web operations from Flickr. Operations engineers and developers both embraced the idea that we could do better together, and we began, slowly, to transform how we did things. But at no point did we do away with Ops. (In fact, the post by Adrian Cockcroft that Whelan links to in his post is notably rebutted by John.)

Patrick Debois challenged me at DevOps Days Minneapolis to answer whether we can say “you’re doing it wrong” about DevOps, whether there is a True DevOps Path from which the benighted have fallen. I was, in the end, obliged to say no. I think there are core tenets of the DevOps philosophy (such as empathy, as described by Katherine Daniels) but multiple ways to solve the problems. But the nearest I can come to “you’re doing it wrong” is when people announce the death of Ops as a job function.

It’s true that in very small teams, a good operations engineer may be missing, and the developers may well do just fine with the operational aspects of the work. But when you come to scale your work, operations engineers are worth their weight in gold.

Heinlein may have said that specialization is for insects but in truth it’s very, very hard to excel in both the development and operational aspects of building and running a large computer system. These people do exist, but trying to find them and hire them is hard (and expensive). Moreover, beyond a certain (small) scale it’s actually woefully inefficient to have everyone doing everything. (I know the argument goes that AWS et al. are making it unnecessary to do traditional Ops but even without racking your own servers you’re still managing and tuning machines and traditional sysadmin skills are necessary.)

Professional developers and professional operations engineers complement each other, and in a DevOps environment as Etsy does it they’re working together side-by-side to achieve shared goals. I’ve tended to prefer the term “Dev/Ops cooperation” rather than “DevOps” (something Etsy’s then-CTO, now-CEO Chad Dickerson wrote about back in 2010 as we started on our own path to DevOps).

Devs do do some of the work that Ops would traditionally do, and Ops do ship code. Typically engineers exist on a spectrum between pure dev and pure ops—there are some who overlap both worlds, but they’re not the type we aim to hire exclusively. (And they’re also not their own special privileged “DevOps” team.)

But, fundamentally, “doing away with Ops” seems to me like madness. We gain so much from the professionalism, skills, and dedication of our Ops engineers. Working with them is one of the great privileges of working at Etsy.

How I write talks

Mon, 25 Aug 2014 00:00:00 -0500

I recently delivered a talk called Fallible Humans at DevOps Days Minneapolis, which was well received. Although I’ve spoken in front of audiences before—internally at Etsy, at local meetups, student groups, etc.—this was my first “proper” conference delivering a prepared talk.

In one of the afternoon Open Spaces, we discussed the topic of “Public Speaking For Beginners”, and I discussed my approach to writing, preparing, and delivering my talk and I thought it might be useful to document it here too.

In the past I’ve always been a “slides first” writer: I’ve had an idea of the story to tell, and my work begins by opening up Keynote and adding slides around which I wrap words. However, last year I ran an internal Ignite night at Etsy and found that this approach failed horribly for the Ignite format. Because you’re so constrained by the fixed duration of each slide, if you write your slides first you are compelled to talk (or have awkward silences) to suit each slide and if you make the wrong decisions it pulls the pacing of the talk awry.

As a result, I’ve been trying to be a “prose first” writer, and this worked very well for my DevOps Days talk. Essentially, I began with an essay. (For Ignites, I suggested to speakers to begin with an anecdote such as you might tell at a party.) What this allowed me to do was to concentrate on telling a story and expressing my ideas without the constraints of slides and transitions. What was the message I actually wanted to get across? What was the best way to tell the story? Were there any particular turns of phrase that were pleasing or memorable?

As I look back, I realise that I had subconsciously chosen to use the structure of the three-act play: exposition, development, resolution. I wouldn’t necessarily advise forcing yourself to use that structure, but talks and presentations do have performance and dramatic elements to them, and the parallels to plays and storytelling are useful things to consider. In my case, it ended up boiling down to “What is blame?”, “Why do we blame?”, and “How are we going to make things better?”.

The talk didn’t start to gel in my head until I had a “hook”—a particular, catchy idea to get things started. I had casually used the word “scapegoat” in my talk’s subtitle when I proposed it, but one morning I woke up thinking “talk about the history of the word ‘scapegoat’ first”, and then I had something to get the ball rolling. Reading up on the history of the word then made me think “it sucked to be a goat” and there, really, was the hook: we can empathise with the goat, draw an immediate parallel to our own situation, and it can get a laugh. With the hook, a little bit of historical research, the background reading I’d already done, and the stage-setting I’d done in my proposal, the actual words for the first act flowed fairly freely.

I then hit a wall, not having consciously expressed my intention to do a “what, why, how” approach. I knew I’d run out of things to talk about but didn’t know where to go next. (The next talk I do will probably involve more conscious structuring….)

I did know that I wanted my talk to be practical. Particularly with the genre of “technical culture talks” it’s very easy to talk about philosophy and generalities without giving concrete advice on what to do about it, leaving the audience feeling a little adrift. It made intuitive sense that this should go towards the end: the end of your talk should leave people wanting to stand up, head to work, and get cracking. I decided that if I had the beginning sorted, I could work on the end, and see what (if anything) was missing in the middle.

When dealing with technical incident reviews, we refer to SMART criteria—Specific, Measurable, Achievable, Relevant, Time-bound—when coming up with remediation items. These can be a useful tool for reviewing advice you may give in a talk too (although “time-bound” depends rather on how the audience might take up your advice). I tried to think of just a few basic principles, since I only had half an hour in total for the talk, and there’s only so much people can take in at once before they start to tune out everything you say. Each principle then got some explanation, ideally with a concrete example or reference to back it up.

Each time I had a reasonable chunk of text written but had hit a wall, I’d do a run through with a stopwatch to see how I was doing for time. This both gave me an idea of how much room I had left and helped me get a feel for how what I’d written actually flowed in practice and what needed adjusting. Usually the next thing I needed to say would come to me as I reached the gap I’d left.

In hindsight, obsessing over time was a mistake. I think it’s generally better to complete your draft before timing it, so that you’re not tempted to edit before you’re done if you go over time. You’ll have a better sense of what can be trimmed when you have everything in front of you. I was lucky that I hit 28 minutes of good talking on the first draft and didn’t need to cut anything.

Once I was happy with the talk in essay form, I opened Keynote and began to work through my slides. I generally stick to the idea that you should have as few words as possible on each slide. A transition to a new slide is a moment for the audience to read, orient themselves for the next segment, and then return to listening to what I’m saying. (As an added advantage, in the age of live-tweeting talks short phrases can also help provide ready-made tweets that fit nicely within the 140 character limit. If you want to go all-in on the Twitter thing, you can follow the example of Lara Swanson and auto-tweet from your presentation.) I’ll sometimes add sub-headings or highlight pithy sentences when writing the initial essay if they seem to suggest good slide content; beyond that it’s just a case of seeing where the natural breaks, pauses, and subject shifts occur.

I don’t read off the slide, although the words there may well be echoed in what I’m saying. The main time I violate that rule is for quotes. Good quotes and citations can anchor sections of the talk, and the words are usually important. Having them on the slide, and reading them as part of talk, reinforces their importance (and allows you to stress particular parts if you wish). The trick is to only transition to the slide at the point where you’re ready to read it, so that you’re reading with the audience. If you transition too early, they’ll be reading (and not listening) while you’re talking about something else, then bored when you read it to them.

I find reading prepared text verbatim, either from note cards in hand or from notes on a slide, results in my delivery becoming stilted and awkward. For each slide, I took the relevant text from the essay and broke it into a few bullet points. These went into the speaker notes for the slide to act as an aide-mémoire. I didn’t try to memorise the essay word-for-word (since that doesn’t leave much room for flexibility) but rather to know the material well enough that I could talk fluently and be sure to cover all the necessary information. With the speaker notes, I could relax knowing I’d still have pointers to remind me of the salient points I wanted to cover to which I could refer if I suddenly got lost mid-slide.

Slide and notes done, I did a complete run through with the speaker notes visible and my essay next to it so I could refer to it if (well, when) I forgot what I intended to say. With that experience, I could adjust the notes and slides and do a run through without the essay to guide me. Further runs were a matter of satisfying myself that any time I went astray, I could recover. (As with engineering, it’s all about minimising the cost of failure….)

Once I felt fairly fluent, I recorded a demo version. This allowed me to export a video of it, which I could then watch and review. (For me, it takes some effort to listen to myself talk, but it’s very valuable for improving my delivery.) I could also upload it to Dropbox and share with people I hoped would review it. (One of the many advantages of working at Etsy is having a large number of colleagues who are experienced presenters, have a deep knowledge of your subject, or quite often both.) I got some useful feedback, updated the slides, and called it good.

My original plan, having about four or five weeks between having my talk accepted and the actual conference, was to write early and have plenty of time to rehearse and gain confidence. It turns out that words will only come out of my brain and take written form when squeezed out by the pressure of a looming deadline, so I ended up finishing with about two days left. This at least meant that the material was fresh in my mind.

On the first day of the conference, I was fortunate to have lunch (and a very engaging discussion) with Patrick Debois. His advice, when we discussed about my talk for the next day, was to chat to people at the conference and be willing and able to tweak my content if interesting ideas emerged. As it happened, someone proposed an Open Space that afternoon on blameless postmortems, and I was able to get some useful ideas from there which I could incorporate.

The talk itself was something of a blur. Some tips, though:

Turn off notifications. There’s nothing quite like an IM notification popping up on screen in the middle of your talk.
Do without wifi. Even better than turning off notifications is blocking out the outside world entirely. Given the famously patchy nature of conference wifi networks, and the vagaries of tethered mobile services, where possible it’s best to have your material on your computer, and be able perform without subjecting yourself to these extra sources of failure.
Clean up your desktop. I recall one speaker at a different conference accidentally revealing some secret clients due to some file names visible when his presentation was closed.
Use a remote. I use a Targus remote which has served me well. It fits nicely in my hand and having the buttons at my fingertips allows me to let the slide transitions flow smoothly without large, overt physical gestures (like leaning forward to press the spacebar). Since I tend to pace as I talk, this frees me up considerably. It is possible to use a smartphone as a remote, but the one time I tried that the wifi network failed and I was stuck; I’ve used a dedicated device since. I also find my iPhone doesn’t feel as comfortable and forgettable in my hand as my remote.
Try to avoid pacing. I made the AV crew’s job much harder as they had to do a lot of panning on the close shot for the live stream.
Have a bottle of water. Not only will you probably find your mouth gets dry as you talk, it provides a useful cover for pauses when you suddenly realise you can’t recall what you were going to say next. And if worst comes to worst, you can choke on the water and get yourself carried off stage.
At the end, try to remain sufficiently alert to remember to remove your mic. The event’s AV technicians may well collar you and do this for you, but I did manage to forget and be talking to someone after my talk while my mic was switched on. Not quite as bad as going to the toilet while broadcasting, but best avoided.
Feel free to hide afterwards. Most of the people I’ve spoken to who have done talks say that all they want to do after being on stage is to go somewhere quiet and not see or speak to anyone. There’s plenty of time for people to talk to you once you’ve had a chance to recover.

Talking at DevOps Days Minneapolis was a great experience (as was the entire conference) and I’m definitely looking forward to applying what I’ve learned so far to the next talk I get to give.

Fallible Humans

Sun, 20 Jul 2014 00:00:00 -0500

This is a transcript of a talk I gave at DevOps Days Minneapolis on 18 July 2014. It’s transcribed from memory, but should be a close approximation to what I said at the time.

I’ve been a software engineer at Etsy for almost five years now. What that has meant is that I’ve been lucky to have been involved in our transformation into a DevOps culture from the start. One of the aspects of this that has particularly engaged me has been our attitude to, and philosophy about, blame.

But first, some history.

In ancient Israel, each year the priests would take two goats. They would ritually place the sins of the Israelites onto these goats. One goat would be taken into the temple and sacrificed. The other—the scapegoat—would be driven into the wilderness to die.

Whether this was an effective process is a question for theologians, but we can say with certainty that it sucked to be a goat.

The ancient Greeks, never ones to do things by halves, would respond to calamities such as plague, famine, or invasion by selecting a person, usually a cripple, or a beggar, or a criminal. This person—the pharmakos—would be beaten, stoned, and driven from the city. Modern science suggests this didn’t work very well.

But this idea that we can achieve cleansing, that we can make our situation better, by the expulsion of some individual is remarkably persistent, and the word “scapegoat” is deeply entrenched in our language, and our thinking.

What we can observe in those historical examples is that in both cases the victim is what we might term an “other”, an outsider: a cripple, a criminal, even an animal. We still tend to seek to blame those different from ourselves and our groups today.

In the usual software engineering environment, others can be found in our functional silos: dev here, ops there, marketing over there, legal last seen two weeks ago talking to some people in suits. Ops didn’t keep to their SLA. Dev made unreasonable demands two days before launch. Legal made us change everything halfway through building the product. More insidiously, we find others in our unconscious biases—things like gender, or educational background.

With blame comes fear: fear of loss of status amongst our peers, fear of never getting promoted or even losing our job, fear of never getting to work on the interesting projects, the risky projects.

This isn’t healthy. It’s also not effective. We don’t learn from our mistakes this way. The Greeks kept succumbing to plagues because it turns out that beating up a beggar is less effective at preventing disease than coming up with a functional sanitation system.

In a DevOps environment, we’re actively working to eliminate these functional silos, and as we increase collaboration throughout the product development lifecycle, we remove these convenient “others”. So while all of this is applicable to any organisation, I’d argue that it’s absolutely vital for a DevOps organisation.

So. Why do we blame?

With the unknown, one is confronted with danger, discomfort, and care; the first instinct is to abolish these painful states. First principle: any explanation is better than none.

— Friedrich Nietzsche, The Twilight Of The Idols

The first explanation in most cases is “human error”. Look at news articles or investigation reports about accidents and the words come up time and again. In software engineering, our dominant narrative is that computers are understandable, deterministic, and safe. Humans interfere. Humans do the wrong thing at the wrong time. Humans are fallible.

The psychologist B. F. Skinner was the inventor of the Operant Conditioning Chamber, or Skinner Box. This device allowed experimenters to give a rat a treat if it did something they wanted, or an electric shock if it did something they didn’t. The learnings from these experiments have been widely adopted, from Facebook games to engineering management [kidding about the managers, I promise].

A person who has been punished is not thereby simply less inclined to behave in a given way; at best, he learns how to avoid punishment.

— B. F. Skinner, Beyond Freedom and Dignity

What Skinner observed, though, was that punishment merely taught the rats to avoid punishment. Blame doesn’t teach us to avoid failure; blame merely teaches us to avoid being blamed.

Because failure happens anyway: it’s an emergent property of complex systems. In a complex system, we simply can’t know all of the interactions that may arise, or all of the ways in which it might fail. And we are working with complex systems: from the complex social systems of our companies, to the software we write and all the ways the parts of that software interact. From the servers we run on, to the networks our data travels across. From the client devices and browsers a web app is served to, to the complex, unpredictable humans who use it. It’s complex systems all the way down, and these fail.

So if failure happens, all we’re teaching people to do is to sweep failure under the carpet, to hide mistakes and direct blame elsewhere. What we’re not doing is learning.

And we should be learning! Failures are windows into complex systems. They allow us to understand interactions that were hidden to us before. They’re opportunities to make our systems safer. (Not safe, alas, but at least we can try to get some of the way there.)

Here’s the thing: we don’t come to work to do a bad job. You don’t arrive at your desk on a Monday morning and say “today, I shall bring the site down”. If you do, your organisation has bigger problems than any incident you may cause—problems with hiring, retention, and culture, and these should be addressed. But assuming we really do come to work to do a good job, why do we still make mistakes? Why are humans still fallible?

In reality, what we are labouring under are multiple, conflicting constraints. Get the feature out on time, without bugs, and without using too many resources. Monitor everything, alert in good time, but don’t cause alert fatigue. It turns out that what humans are really, really good at is balancing these conflicting constraints and in doing so they allow systems to operate more safely.

How we do this—day in, day out—is by choosing between efficiency and thoroughness: what Erik Hollnagel terms the Efficiency-Thoroughness Trade-Off (or ETTO) Principle. We can be efficient—we can get lots of stuff done—but we can’t be thorough and dive deep into every issue. We can be thorough and really dig into the task at hand and understand it well but this takes time: it is inefficient. More commonly, rather than choosing one extreme or another, we’re deciding, through experience and learned heuristics, where on this spectrum of efficiency and thoroughness we need to operate for any given system at any given time, based on our understanding of the data available and the constraints we’re under.

You’d think, from a systems safety point of view, we’d really want to operate on the thoroughness side of things, to really understand what’s going on. And yet this brings its own problems.

If I’d observed all the rules, I’d never have got anywhere.

— Marilyn Monroe

One of the most effective forms of industrial action isn’t a strike, but a “work to rule”. Follow every rule, every regulation, to the letter, every time. The organisation rapidly grinds to a halt because it actually relies on people knowing what to skip, where to trade thoroughness for efficiency. And an inability to react to changing circumstances rapidly can lead to failure too.

When you make a change to some code, do you have everyone in the company check your diffs? When you deploy, do you check every graph of every metric you collect? Or do you commit, skip the tests, push to production, shout “YOLO!” and go for lunch?

In reality, what you do depends strongly on what you’re working on. If you’re doing some complex brain surgery on your master database you will likely tend towards thoroughness: lots of consultation and discussion, game days, contingency planning, and when the time comes to do the work, everyone is at their desk and knows what to do. Whereas if you’re pushing out a small CSS change, you’ll eyeball it, run the tests, deploy, eyeball it in production, and call it a day.

But, because these are complex systems, it is likely that there is some unknown interaction lurking in the database work you’ve planned, some hidden source of failure. Maybe you’ll get lucky, maybe you won’t. And I can tell you that a colleague of mine once took Etsy down entirely using only a CSS change.

As such, the fallibility of humans is not that we willfully do the wrong thing. Rather, we do what makes sense to us at the time. We do the thing that we believe will give us the result we want, based on our understanding of the system when we make our decision. And sometimes we’re wrong.

Given that formulation, blame becomes ludicrous. Instead we find ourselves asking questions like “Why?”. Why did what you did make sense at the time? Why did you think it would yield a good result? What was your understanding of the system? What constraints were you trying to satisfy?

You won’t get good answers to these questions if people are scared. Instead, you move to a place of trust. You hire well, you train well, and you trust that people are doing the right thing. And if things fail, you understand that this happens and you learn from it.

Blamelessness is not just about being nice (although that’s a pleasant side effect). It’s enabling you to get to an understanding of a problem that is impossible without it.

But, how do you go about it?

The first and most important part is to stop blaming yourself. This may sound all very “self help guru”, but it’s important. When we experience failure, when we get that “ice water down the spine” feeling as we see things going horribly wrong, we experience that Nietzschean anxiety and the first thing we come to is “I’m a failure, I don’t know what I’m doing, I suck”. That way lies Impostor Syndrome and misery, and we recoil from it and shut down that introspection, and we shut down our ability to learn.

Overcoming this is hard, but it’s very valuable. The great thing is that it works even if you’re in an organisation that hasn’t embraced blamelessness. If you learn one thing from this post, it should be to ask yourself “why did what I did make sense to me at the time?” and to ask that often.

But it’s easier to be blameless in a group. A supportive group of peers can help you avoid and overcome that tendency to self-blame, and provide both extra perspective and deeper lines of questioning to help you understand and learn from the failure. Another benefit of group discussions is that it reinforces the notion that blamelessness is a Thing That We Do. It’s not just something we heard about at a conference that sounded good, but actually part and parcel of our daily work.

With groups, it is beneficial to cast the net widely. At Etsy, reviews are open to all. Anyone from any part of the company can attend and participate, broadening the perspectives available to us. For example, we’ve even had our Chief Financial Officer come to reviews, particularly those involving money. What she provides is not only her intelligence and insight, but also the perspective of the Finance department. What are Finance’s goals and constraints? How do they align with Engineering’s constraints? How do they differ and conflict? We get a better, more holistic picture of incident as it applies to the entire organisation, rather than just within Engineering’s narrow view. We learn better.

Make a habit of doing blameless reviews. You don’t need to wait for failure to strike. You can review successes: what went right? Or near misses: what did we do to save the day? You can review things other than engineering incidents. How did your hiring process work? Did your PR campaign do what you expected? This habit of review both embeds the philosophy of blamelessness and hones your skills for looking at incidents and learning from them.

While I said the overcoming self-blame is hard, overcoming hindsight is even harder. Hindsight is a fundamental part of how humans look at the past and understand what happened. The problem is that, as the saying goes, hindsight is 20/20. We look back and see that the catastrophe that befell us was inevitable. The steps that led us there are clear as day and those who didn’t see it at the time are apparently idiots.

We find ourselves with counterfactuals: statements and reasoning about things that didn’t happen. More simply, the “coulda, woulda, shouldas”.

“He could have talked to Jennifer in Ops and she would have told him that the change would kill the database.”

“A good engineer would have tested this more thoroughly.”

Statements like these aren’t helpful. Why did he not talk to Jennifer? Did he not know to? Did he talk to someone else who said it would be fine? Had he made these changes a hundred times before without incident? Would it have been fine had it not been for some other operation that was also loading the databases at the time?

We see phrases like “a good” engineer all the time too, and they are incredibly blameful. “A good engineer would have tested this more thoroughly. You didn’t test this more thoroughly. You are a bad engineer.” But in reality, what we’re doing is making a judgement about an efficiency-thoroughness trade-off that this engineer made. With hindsight we see that it was wrong, but why was that not clear at the time? And bear in mind that a different decision at the time might have led to us sitting in a review asking “why did the feature not go out on time and why did you use so many QA resources?”.

Typically, in a review we find ourselves seeking a root cause. We are conditioned to think of failure as “Y happened because X” and we want to know what X was. But in a complex system failure, this rarely happens. Instead of a simple cause-and-effect process, we have multiple failures, each necessary to the broader calamity but not individually sufficient to cause it. Picking one of these and labeling it the root cause is false and misleading.

Likewise, we look back with hindsight and construct a nice, straight, linear sequence of events that we believe describe the failure: X then Y then Z then bang. But in reality, we’re dealing with that wibbly-wobbly ball of timey-wimey stuff that is complex system failure. Multiple failures all mixed together in a big jumble, not neat and tidy and certainly not linear as observed and experienced by those involved at the time.

As such root causes are constructed, not found. They are merely the place where you stop digging and say “this’ll do”. Remember that incident reviews are also subject to efficiency-thoroughness trade-offs.

Out of this, the idea is to learn. How you learn depends very much on your organisation, those involved, and the nature of the failure at hand, but typically you end up with a bunch of remediation items—things you intend to do to make sure that this never happens again. But try very hard to seek good remediation items. Review every one critically and ask what hazards it may cause, and what costs. It could be that the cost of remediation (in time, complexity, increased risk elsewhere in the system, etc.) exceeds the cost of the failure you’re trying to avoid. It could be that a failure is just the cost of doing business, and should it crop up again you’ll just deal with it.

Similarly, a common response is “we’ll implement a process”. A process such that no-one can ever stray from the correct path again. This, too, is problematic.

Process is an embedded reaction to prior stupidity.

— Clay Shirky, Wikis, Grafitti, and Process

Shirky’s adage is pithy, yes, but also rather blameful: “stupidity” isn’t something we want to see in reviews. But, his point is well taken. Processes are the scar tissue that builds up from your failures. Left unchecked, the growth of this tissue slows you down, and you find yourself less and less able to move and respond to problems. Every process you implement is one more thing for those fallible humans to make efficiency-thoroughness trade-off decisions on: follow the process and be thorough, or skip it and get stuff done (in which case you have learned precisely nothing and gained no increase in safety at all).

Once you’ve worked out what you’ve learned, it is necessary to communicate it. At Etsy, as a blameless organisation, what we observe is that those closest to the incident—those normally most likely to be blamed—actually leading the charge in review, and taking ownership of remediation afterwards. We get emails to the company saying “this is what I did, this is why it was bad, this is how the site went down, this is what we’ve learned and what we’re going to do in the future”: the very opposite of what you’d expect in a blameful environment. In fact, this behaviour is now expected: the norm, not the exception.

Now, this is harder in a culture where blamelessness isn’t 100% adopted, but the best way to promote the philosophy is to demonstrate its results. Managers have a huge role to play here, both in fostering blamelessness within their teams, and in advocating for it amongst their peers and their own managers.

As with all parts of DevOps culture, there is no “royal road” to blamelessness. I can’t give you a Docker image for blamelessness. You can’t Chef blamelessness out to your colleagues however much you might wish to do so. Despite all the advice here I don’t have a simple list of steps to take that will result in a blameless company. Instead, it’s an evolution: an evolution of your organisation and of the people within it. For those of you who want to make a start, I’ve compiled a reading list to help you.

Embracing blamelessness will make your organisation a happier, healthier, safer place, and as you set to work on it, I wish you the very best of luck.

An accidental Argentinian

Sat, 28 Jun 2014 00:00:00 -0500

So, sometimes I’m mistaken for an Argentinian government agency.

I know what you’re thinking: “Ian, it could happen to anyone”. Perhaps you’re right.

For years now I’ve been “indec” in most online forums (to the extent that I sometimes find myself signing up for new services purely to reserve the nick—it’s a painful moment when I can’t get my identity of choice later on). I bought indecorous.com at the point where dictionary word dot-com names were getting harder to come by, and was happy with it. But the first IRC server I logged on to (irc.perl.org, if memory serves) had an eight character nick limit, and so I truncated to indec. So it goes.

When I joined Twitter, of course I chose @indec. I’d get the usual spam about buying followers and suchlike, but then occasionally I’d find someone would “RT” something I hadn’t actually tweeted, usually in Spanish. It was a little odd and perplexing but I gave it no particular mind because, well, Twitter.

Eventually, though, my curiosity got the better of me and I replied back to one of these RTers, asking what was going on. It turns out that in Argentina there exists the Instituto Nacional de Estadística y Censos, or INDEC. This is the agency responsible for collecting and processing statistical data for Argentina—inflation rates and so forth. What makes it interesting is that it is commonly assumed that all of the statistics it produces are wrong and overly optimistic to suit the whims of the current political powers.

RT @indec: #ARG 7 - #NIG 1
— Jorge Lanata PPT (@Lanataenel13) June 25, 2014

It’s a fascinating glimpse into a different political culture, but it doesn’t half mess up my notifications….

A/B testing complexity

Mon, 16 Jun 2014 00:00:00 -0500

We tend to think of A/B testing as a tool for testing the validity of product decisions, or for empirically determining them. Should this button be blue, or red? Does moving more pictures above the fold increase conversion? These are good and useful questions that A/B testing can help answer.

However, A/B tests are typically being carried out within the bounds of a complex system. For web sites, that may include the programming language and its underlying libraries, templating systems, web servers, databases, networks, client devices, web browsers, operating systems, and those lovely agents of chaos: human users. The thing about complex systems is that they give rise to completely unexpected interactions (and, unfortunately, failure modes). The good news is that your A/B test is here to help.

The example that got me thinking about this is fairly simple. At Etsy, we (unsurprisingly) want the mobile web experience to be a smooth as possible. One area of concern is the difficulty people have typing in passwords on (typically virtual) mobile keyboards. Usually as you type, you’re shown the last letter you typed in, which then turns to an asterisk or other such symbol as you move on. Anyone who has tried to sign in on a mobile device has probably experienced the frustration of mis-typing that arises from the process. The aim of hiding the password is to prevent someone from looking over your shoulder at your screen and memorising it, but typically a mobile device (being smaller and held closer to the user) is harder to snoop. Our hypothesis was that members signing in on mobile might prefer to have their password shown in clear text so that they could get it right first time.

The experiment had three groups: the control (hidden as usual), one where the password was shown on the screen in clear text with an option to hide it if the member wished, and a third where it was hidden with the option to reveal it. We could measure the rate of login failure, and also the frequency with which members in the two experimental groups chose to toggle the feature on or off. A solid hypothesis, a fairly classic experimental setup. So far so good.

The results were awful.

Login failures went up dramatically in the group with visible passwords, completely contrary to our expectations. We weren’t sure if the change would make things better, but we were unprepared for it making things worse. Why was our product sense, on the face of it entirely reasonable, so wrong?

The answer, after much head-scratching, was that it wasn’t. Instead, it was a confounding factor from a completely different part of the complex system: clear text inputs on most mobile OSes have autocorrect.

Had we not taken the time to do the experiment, we would have had no way to know that this failure mode was occurring. Chances are it would have made only a slight change in the overall rate of login failures, probably not something we would have noticed in our day-to-day scanning of our dashboards. The problem would probably have persisted until eventually some frustrated member complained about it on the forums, by which time the damage would have been all the greater.

So, when planning your A/B tests, always be aware that they may end up telling you about something quite unexpected, and entirely unrelated to the question you intended to ask.

Random acts of leadership

Fri, 13 Jun 2014 00:00:00 -0500

One of the things I like best about working for Etsy is how easy it is to enthuse about the people I work with. There are various anecdotes I can tell about “business as unusual”, but one of my favourites came from when I was only a few months into my new job.

I work remotely, which means I travel to New York every so often to see everyone and get some face-to-face time with my colleagues. My second trip came just before Christmas of 2009 (always try to get to the Etsy Christmas party), and my return flight happened to coincide with a large snowstorm. Flights out of all New York airports were hit, but my airport—Newark—was shut down completely, and the airline was telling me the next flight I could get would be two days later.

All else being equal, there are worse places to get an extra two days than New York, but my wife was home alone, coping with an 8 month old baby who was teething and doing her best to avoid sleep at all costs. The only thing keeping her going was the knowledge that I would be home soon. There was a note of hysteria in her voice when I related the news.

As I scrambled to work out how to get home to rescue my wife, Chad Dickerson (then Etsy’s CTO, now CEO) encountered me looking fraught and asked what was wrong. When I explained, he just asked “How can I help?”. I said I’d found a flight out that evening on a different airline. “Book it. Here’s my card.”

I think those who know Chad better than I did then wouldn’t be terribly surprised by this, but right then it was entirely unexpected. One, quick, no-questions-asked “book it” took me from frazzled panic to calm relief in an instant (at least once the booking had gone through). Moreover, it demonstrated to me and my wife that our welfare mattered to Chad and to Etsy. Understanding that an employee’s family’s happiness is also important is significant. After that, future trips always came with the knowledge that Etsy had our back.

It was a small but powerful demonstration of a leadership philosophy that would naturally lead Etsy to becoming a certified B Corp. People matter. Their happiness and welfare matters. Good leaders know this and act on it.