Screaming in the Cloud
aws-section-divider
Audio Icon
Building a Partnership with Your Cloud Provider with Micheal Benedict
Episode Summary
Micheal Benedict, who leads Engineering Productivity at Pinterest, is here to keep things varied by not just talking to folks who work at a cloud provider. Sometimes it's good to see the cloud from another angle, and given Pinterest’s mind boggling commitment to AWS, Micheal is here to give us just that. Corey and Micheal talk about how Pinterest has been on the cloud since the get go, and their current, and developing, partnership with AWS. With an emphasis on the “partner” aspect of their working relationship with AWS, Micheal is here to tell us how these two massive entities are building a strong connection. They talk about what Micheal’s team is up to, how Pinterest is talking about mult-cloud, and more!
Episode Show Notes and Transcript
About Micheal 
Micheal Benedict leads Engineering Productivity at Pinterest. He and his team focus on developer experience, building tools and platforms for over a thousand engineers to effectively code, build, deploy and operate workloads on the cloud. Mr. Benedict has also built Infrastructure and Cloud Governance programs at Pinterest and previously, at Twitter -- focussed on managing cloud vendor relationships, infrastructure budget management, cloud migration, capacity forecasting and planning and cloud cost attribution (chargeback). 


Links:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: You know how git works right?


Announcer: Sorta, kinda, not really. Please ask someone else!


Corey: Thats all of us. Git is how we build things, and Netlify is one of the best way I’ve found to build those things quickly for the web. Netlify’s git based workflows mean you don't have to play slap and tickle with integrating arcane non-sense and web hooks, which are themselves about as well understood as git. Give them a try and see what folks ranging from my fake Twitter for pets startup, to global fortune 2000 companies are raving about. If you end up talking to them, because you don't have to, they get why self service is important—but if you do, be sure to tell them that I sent you and watch all of the blood drain from their faces instantly. You can find them in the AWS marketplace or at www.netlify.com. N-E-T-L-I-F-Y.com


Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they’re all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don’t dispute that but what I find interesting is that it’s predictable. They tell you in advance on a monthly basis what it’s going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you’re one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you’ll receive a $100 in credit. Thats v-u-l-t-r.com slash screaming.


Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Every once in a while, I like to talk to people who work at very large companies that are not in fact themselves a cloud provider. I know it sounds ridiculous. How can you possibly be a big company and not make money by selling managed NAT gateways to an unsuspecting public? But I’m told it can be done here to answer that question. And hopefully at least one other is Pinterest. It’s head of engineering productivity, Micheal Benedict. Micheal, thank you for taking the time to join me today.


Micheal: Hi, Corey, thank you for inviting me today. I’m really excited to talk to you.


Corey: So, exciting times at Pinterest in a bunch of different ways. It was recently reported—which of course, went right to the top of my inbox as 500,000 people on Twitter all said, “Hey, this sounds like a ‘Corey would be interested in it’ thing.” It was announced that you folks had signed a $3.2 billion commitment with AWS stretching until 2028. Now, if this is like any other large-scale AWS contract commitment deal that has been made public, you were probably immediately inundated with a whole bunch of people who are very good at arithmetic and not very good at business context saying, “$3.2 billion? You could build massive data centers for that. Why would anyone do this?” And it’s tiresome, and that’s the world in which we live. But I’m guessing you heard at least a little bit of that from the peanut gallery.


Micheal: I did, and I always find it interesting when direct comparisons are made with the total amount that’s been committed. And like you said, there’s so many nuances that go into how to perceive that amount, and put it in context of, obviously, what Pinterest does. So, I at least want to take this opportunity to share with everyone that Pinterest has been on the cloud since day one. When Ben initially started the company, that product was launched—it was a simple Django app—it was launched on AWS from day one, and since then, it has grown to support 450-plus million MAUs over the course of the decade.


And our infrastructure has grown pretty complex. We started with a bunch of EC2 machines and persisting data in S3, and since then we have explored an array of different products, in fact, sometimes working very closely with AWS, as well and helping them put together a product roadmap for some of the items they’re working on as well. So, we have an amazing partnership with them, and part of the commitment and how we want to see these numbers is how does it unlock value for Pinterest as a business over time in terms of making us much more agile, without thinking about the nuances of the infrastructure itself. And that’s, I think, one of the best ways to really put this into context, that it’s not a single number we pay at the end [laugh] of the month, but rather, we are on track to spending a certain amount over a period of time, so this just keeps accruing or adding to that number. And we basically come out with an amazing partnership in AWS, where we have that commitment and we’re able to leverage their products and full suite of items without any hiccups.


Corey: The most interesting part of what you said is the word partner. And I think that’s the piece that gets lost an awful lot when we talk about large-scale cloud negotiations. It’s not like buying a car, where you can basically beat the crap out of the salesperson, you can act as if $400 price difference on a car is the difference between storm out of the dealership and sign the contract. Great, you don’t really have to deal with that person ever again.


In the context of a cloud provider, they run your production infrastructure, and if they have a bad day, I promise you’re going to have a bad day, too. You want to handle those negotiations in a way that is respectful of that because they are your partner, whether you want them to be or not. Now, I’m not suggesting that any cloud provider is going to hold an awkward negotiation against the customer, but at the same time, there are going to be scenarios in which you’re going to want to have strong relationships, where you’re going to need to cash in political capital to some extent, and personally, I’ve never seen stupendous value in trying to beat the crap out of a company in order to get another tenth of a percent discount on a service you barely use, just because someone decided that well, we didn’t do well in the last negotiation so we’re going to get them back this time.


That's great. What are you actually planning to do as a company? Where are you going? And the fact that you just alluded to, that you’re not just a pile of S3 and EC2 instances speaks, in many ways, to that. By moving into the differentiated service world, suddenly you’re able to do things that don’t look quite as much like building a better database and start looking a lot more like servicing your users more effectively and well.


Micheal: And I think, like you said, I feel like there’s like a general skepticism in viewing that the cloud providers are usually out there to rip you apart. But in reality, that’s not true. To your point, as part of the partnership, especially with AWS and Pinterest, we’ve got an amazing relationship going on, and behind the scenes, there’s a dedicated team at Pinterest, called the Infrastructure Governance Team, a cross-functional team with folks from finance, legal, engineering, product, all sitting together and working with our AWS partners—even the AWS account managers at the times are part of that—to help us make both Pinterest successful, and in turn, AWS gets that amazing customer to work with in helping build some of their newer products as well. And that’s one of the most important things we have learned over time is that there’s two parts to it; when you want to help improve your business agility, you want to focus not just on the bottom line numbers as they are. 
It’s okay to pay a premium because it offsets the people capital you would have to invest in getting there.


And that’s a very tricky way to look at math, but that’s what these teams do; they sit down and work through those specifics. And for what it’s worth, in our conversations, the AWS teams always come back with giving us very insightful data on how we’re using their systems to help us better think about how we should be pricing or looking things ahead. And I’m not the expert on this; like I said, there’s a dedicated team sitting behind this and looking through and working through these deals, but that’s one of the important takeaways I hope the users—or the listeners of this podcast then take away that you want to treat your cloud provider as your partner as much as possible. They’re not always there to screw you. That’s not their goal. And I apologize for using that term. It is important that you set that expectations that it’s in their best interest to actually make you successful because that’s how they make money as well.


Corey: It’s a long-term play. I mean, they could gouge you this quarter, and then you’re trying to evacuate as fast as possible. Well, they had a great quarter, but what’s their long-term prospect? There are two competing philosophies in the world of business; you can either make a lot of money quickly, or you can make a little bit of money and build it over time in a sustained way. And it’s clear the cloud providers are playing the long game on this because they basically have to.


Micheal: I mean, it’s inevitable at this point. I mean, look at Pinterest. It is one of those success stories. Starting as a Django app on a bunch of EC2 machines to wherever we are right now with having a three-plus billion dollar commitment over a span of couple of years, and we do spend a pretty significant chunk of that on a yearly basis. So, in this case, I’m sure it was a great successful partnership.


And I’m hoping some of the newer companies who are building the cloud from the get-go are thinking about it from that perspective. And one of the things I do want to call out, Corey, is that we did initially start with using the primitive services in AWS, but it became clear over time—and I’m sure you heard of the term multi-cloud and many of that—you know, when companies start evaluating how to make the most out of the deals they’re negotiating or signing, it is important to acknowledge that the cost of any of those evaluations or even thinking about migrations never tends to get factored in. And we always tend to treat that as being extremely simple or not, but those are engineering resources you want to be spending more building on the product rather than these crazy costly migrations. So, it’s in your best interest probably to start using the most from your cloud provider, and also look for opportunities to use other cloud providers—if they provide more value in certain product offerings—rather than thinking about a complete lift-and-shift, and I’m going to make DR as being the primary case on why I want to be moving to multi-cloud.


Corey: Yeah. There’s a question, too, of the numbers on paper look radically different than the reality of this. You mentioned, Pinterest has been on AWS since the beginning, which means that even if an edict had been passed at the beginning, that, “Thou shalt never build on anything except EC2 and S3. The end. Full stop.”


And let’s say you went down that rabbit hole of, “Oh, we don’t trust their load balancers. We’re going to build our own at home. We have load balancers at home. We’ll use those.” It’s terrible, but even had you done that and restricted yourselves just to those baseline building blocks, and then decide to do a cloud migration, you’re still looking back at over a decade of experience where the app has been built unconsciously reflecting the various failure modes that AWS has, the way that it responds to API calls, the latency in how long it takes to request something versus it being available, et cetera, et cetera.


So, even moving that baseline thing to another cloud provider is not a trivial undertaking by any stretch of the imagination. But that said—because the topic does always come up, and I don’t shy away from it; I think it’s something people should go into with an open mind—how has the multi-cloud conversation progressed at Pinterest? Because there’s always a multi-cloud conversation.


Micheal: We have always approached it with some form of… openness. It’s not like we don’t want to be open to the ideas, but you really want to be thinking hard on the business case and the business value something provides on why you want to be doing x. In this case, when we think about multi-cloud—and again, Pinterest did start with EC2 and S3, and we did keep it that way for a long time. We built a lot of primitives around it, used it—for example, my team actually runs our bread and butter deployment system on EC2. We help facilitate deployments across a 100,000-plus machines today.


And like you said, we have built that system keeping in mind how AWS works, and understanding the nuances of region and AZ failovers and all of that, and help facilitate deployments across 1000-plus microservices in the company. So, thinking about leveraging, say, a Google Cloud instance and how that works, in theory, we can always make a case for engineering to build our deployment system and expand there, but there’s really no value. And one of the biggest cases, usually, when multi-cloud comes in is usually either negotiation for price or actually a DR strategy. Like, what if AWS goes down in and us-east-1? Well, let’s be honest, they’re powering half the internet [laugh] from that one single—


Corey: Right.


Micheal: Yeah. So, if you think your business is okay running when AWS goes down and half the internet is not going to be working, how do you want to be thinking about that? So, DR is probably not the best reason for you to be even exploring multi-cloud. Rather, you should be thinking about what the cloud providers are offering as a very nuanced offering which your current cloud provider is not offering, and really think about just using those specific items.


Corey: So, I agree that multi-cloud for DR purposes is generally not necessarily the best approach with the idea of being able to failover seamlessly, but I like the idea for backups. I mean, Pinterest is a publicly-traded company, which means that among other things, you have to file risk disclosures and be responsive to auditors in a variety of different ways. There are some regulations to start applying to you. And the idea of, well, AWS builds things out in a super effective way, region separation, et cetera, whenever I talk to Amazonians, they are always 
surprised that anyone wouldn’t accept that, “Oh, if you want backups use a different region. Problem solved.”


Right, but it is often easier for me to have a rehydrate the business level of backup that would take weeks to redeploy living on another cloud provider than it is for me to explain to all of those auditors and regulators and financial analysts, et cetera why I didn’t go ahead and do that path. So, there’s always some story for okay, what if AWS decides that they hate us and want to kick us off the platform? Well, that’s why legal is involved in those high-level discussions around things like risk, and indemnity, and termination for convenience and for cause clauses, et cetera, et cetera. The idea of making an all-in commitment to a cloud provider goes well beyond things that engineering thinks about. And it’s easy for those of us with engineering backgrounds to be incredibly dismissive of that of, “Oh, indemnity? Like, when does AWS ever lose data?” “Yeah, but let’s say one day they do. What is your story going to be when asked some very uncomfortable questions by people who wanted you to pay attention to this during the negotiation process?” It’s about dotting the i’s and crossing the t’s, especially with that many commas in the contractual commitments.


Micheal: No, it is true. And we did evaluate that as an option, but one of the interesting things about compliance, and especially auditing as well, we generally work with the best in class consultants to help us work through the controls and how we audit, how we look at these controls, how to make sure there’s enough accountability going through. The interesting part was in this case, as well, we were able to work with AWS in crafting a lot of those controls and setting up the right expectations as and when we were putting proposals together as well. Now, again, I’m not an expert on this and I know we have a dedicated team from our technical program management organization focused on this, but early on we realized that, to your point, the cost of any form of backups and then being able to audit what’s going in, look at all those pipelines, how quickly we can get the data in and out it was proving pretty costly for us. So, we were able to work out some of that within the constructs of what we have with our cloud provider today, and still meet our compliance goals.


Corey: That’s, on some level, the higher point, too, where everything is everything comes down to context; everything comes down to what the business demands, what the business requires, what the business will accept. And I’m not suggesting that in any case, they’re wrong. I’m known for beating the ‘Multi-cloud is a bad default decision’ drum, and then people get surprised when they’ll have one-on-one conversations, and they say, “Well, we’re multi-cloud. Do you think we’re foolish?” “No. You’re probably doing the right thing, just because you have context that is specific to your business that I, speaking in a general sense, certainly don’t have.”


People don’t generally wake up in the morning and decide they’re going to do a terrible job or no job at all at work today, unless they’re Facebook’s VP of Integrity. So, it’s not the sort of thing that lends itself to casual tweet size, pithy analysis very often. There’s a strong dive into what is the level of risk a business can accept? And my general belief is that most companies are doing this stuff right. The universal constant in all of my consulting clients that I have spoken to about the in-depth management piece of things is, they’ve always asked the same question of, “So, this is what we’ve done, but can you introduce us to the people who are doing it really right, who have absolutely nailed this and gotten it all down?” “It’s, yeah, absolutely no one believes that that is them, even the folks who are, from my perspective, pretty close to having achieved it.”


But I want to talk a bit more about what you do beyond just the headline-grabbing large dollar figure commitment to a cloud provider story. What does engineering productivity mean at Pinterest? Where do you start? Where do you stop?


Micheal: I want to just quickly touch upon that last point about multi-cloud, and like you said, every company works within the context of what they are given and the constraints of their business. It’s probably a good time to give a plug to my previous employer, Twitter, who are doing multi-cloud in a reasonably effective way. They are on the data centers, they do have presence on Google Cloud, and AWS, and I know probably things have changed since a couple of years now, but they have embraced that environment pretty effectively to cater to their acquisitions who were on the public cloud, help obviously, with their initial set of investments in the data center, and still continue to scale that out, and explore, in this case, Google Cloud for a variety of other use cases, which sounds like it’s been extremely beneficial as well.


So, to your point, there is probably no right way to do this. There’s always that context, and what you’re working with comes into play as part of making these decisions. And it’s important to take a lot of these with a grain of salt because you can never understand the decisions, why they were made the way they were made. And for what it’s worth, it sort of works out in the end. [laugh]. I’ve rarely heard a story where it’s never worked out, and people are just upset with the deals they’ve signed. So, hopefully, that helps close that whole conversation about multi-cloud.


Corey: I hope so. It’s one of those areas where everyone has an opinion and a lot of them do not necessarily apply universally, but it’s always fun to take—in that case, great, I’ll take the lesser trod path of everyone’s saying multi-cloud is great, invariably because they’re trying to sell you something. Yeah, I have nothing particularly to sell, folks. My argument has always been, in the absence of a compelling reason not to, pick a provider and go all in. I don’t care which provider you pick—which people are sometimes surprised to hear.


It’s like, “Well, what if they pick a cloud provider that you don’t do consulting work for?” Yeah, it turns out, I don’t actually need to win every AWS customer over to have a successful working business. Do what makes sense for you, folks. From my perspective, I want this industry to be better. I don’t want to sit here and just drum up business for myself and make self-serving comments to empower that. Which apparently is a rare tactic.


Micheal: No, that’s totally true, Corey. One of the things you do is help people with their bills, so this has come up so many times, and I realize we’re sort of going off track a bit from that engineering productivity discussion—


Corey: Oh, which is fine. That’s this entire show’s theme, if it has one.


Micheal: [laugh]. So, I want to briefly just talk about the whole billing and how cost management works because I know you spend a lot of time on that and you help a lot of these companies be effective in how they manage their bills. These questions have come up multiple times, even at Pinterest. We actually in the past, when I was leading the infrastructure governance organization, we were working with other companies of our similar size to better understand how they are looking into getting visibility into their cost, setting sort of the right controls and expectations within the engineering organization to plan, and capacity plan, and effectively meet those plans in a certain criteria, and then obviously, if there is any risk to that, actively manage risk. That was like the biggest thing those teams used to do.


And we used to talk a lot trade notes, and get a better sense of how a lot of these companies are trying to do—for example, Netflix, or Lyft, or Stripe. I recall Netflix, content was their biggest spender, so cloud spending was like way down in the list of things for them. [laugh]. But regardless, they had an active team looking at this on a day-to-day basis. So, one of the things we learned early on at Pinterest is that start investing in those visibility tools early on.


No one can parse the cloud bills. Let’s be honest. You’re probably the only person who can reverse… [laugh] engineer an architecture diagram from a cloud bill, and I think that’s like—definitely you should take a patent for that or something. But in reality, no one has the time to do that. You want to make sure your business leaders, from your finance teams to engineering teams to head of the executives all have a better understanding of how to parse it.


So, investing engineering resources, take that data, how do you munch it down to the cost, the utilization across the different vectors of offerings, and have a very insightful discussion. Like, what are certain action items we want to be taking? It’s very easy to see, “Oh, we overspent EC2,” and we want to go from there. But in reality, that’s not just that thing; you will start finding out that EC2 is being used by your Hadoop infrastructure, which runs hundreds of thousands of jobs. Okay, now who’s actually responsible for that cost? You might find that one job which is accruing, sort of, a lot of instance hours over a period of time and a shared multi-tenant environment, how do you attribute that cost to that particular cost center?


Corey: And then someone left the company a while back, and that job just kept running in perpetuity. No one’s checked the output for four years, I guess it can’t be that necessarily important. And digging into it requires context. It turns out, there’s no SaaS tool to do this, which is unfortunate for those of us who set out originally to build such a thing. But we discovered pretty early on the context on this stuff is incredibly important.


I love the thing you’re talking about here, where you’re discussing with your peer companies about these things because the advice that I would give to companies with the level of spend that you folks do is worlds apart from what I would advise someone who’s building something new and spending maybe 500 bucks a month on their cloud bill. Those folks do not need to hire a dedicated team of people to solve for these problems. At your scale, yeah, you probably should have had some people in [laugh] here looking at this for a while now. And at some point, the guidance changes based upon scale. And if there’s one thing that we discover from the horrible pages of Hacker News, it’s that people love applying bits of wisdom that they hear in wildly inappropriate situations.


How do you think about these things at that scale? Because, a simple example: right now I spend about 1000 bucks a month at The Duckbill Group, on our AWS bill. I know. We have one, too. Imagine that. And if I wind up just committing admin credentials to GitHub, for example, and someone compromises that and start spinning things up to mine all the Bitcoin, yeah, I’m going to notice that by the impact it has on the bill, which will be noticeable from orbit.


At the level of spend that you folks are at, at company would be hard-pressed to spin up enough Bitcoin miners to materially move the billing needle on a month-to-month basis, just because of the sheer scope and scale. At small bill volumes, yeah, it’s pretty easy to discover the thing that spiking your bill to three times normal. It’s usually a managed NAT gateway. At your scale, tripling the bill begins to look suspiciously like the GDP of a small country, so what actually happened here? Invariably, at that scale, with that level of massive multiplier, it’s usually the simplest solution, an error somewhere in the AWS billing system. Yes, they exist. Imagine that.


Micheal: They do exist, and we’ve encountered that.


Corey: Kind of heartstopping, isn’t it?


Micheal: [laugh]. I don’t know if you remember when we had the big Spectre and the Meltdown, right, and those were interesting scenarios for us because we had identified a lot of those issues early on, given the scale we operate, and we were able to, sort of, obviously it did have an impact on the builds and everything, but that’s it; that’s why you have these dedicated teams to fix that. But I think one of the points you made, these are large bills and you’re never going to have a 3x jump the next day. We’re not going to be seeing that. And if that happens, you know, God save us. [laugh].


But to your point, one of the things we do still want to be doing is look at trends, literally on a week-over-week basis because even a one percentage move is a pretty significant amount, if you think about it, which could be funding some other aspects of the business, which we would prefer to be investing on. So, we do want to have enough rigor and controls in place in our technical stack to identify and alert when something is off track. And it becomes challenging when you start using those higher-order services from your public cloud provider because there’s no clear insights on how do you, kind of, parse that information. One of the biggest challenges we had at Pinterest was tying ownership to all these things.


No, using tags is not going to cut it. It was so difficult for us to get to a point where we could put some sense of ownership in all the things and the resources people are using, and then subsequently have the right conversation with our ads infrastructure teams, or our product teams to help drive the cost improvements we want to be seeing. And I wouldn’t be surprised if that’s not a challenge already, even for the smaller companies who have bills in the tunes of tens and thousands, right?


Corey: It is. It’s predicting the spend and trying to categorize it appropriately; that’s the root of all AWS bill panic on the corporate level. It’s not that the bill is 20% higher, so we’re going to go broke. Most companies spend far more on payroll than they do on infrastructure—as you mentioned with Netflix, content is a significantly larger [laugh] expense than any of those things; real estate, it’s usually right up there too—but instead it’s, when you’re trying to do business forecasting of, okay, if we’re going to have an additional 1000 monthly active users, what will the cost for us be to service those users and, okay, if we’re seeing a sudden 20% variance, if that’s the new normal, then well, that does change our cost projections for a number of years, what happens? When you’re public, there starts to become the question of okay, do we have to restate earnings or what’s the deal here?


And of course, all this sidesteps past the unfortunate reality that, for many companies, the AWS bill is not a function of how many customers you have; it’s how many engineers you hired. And that is always the way it winds up playing out for some reason. “It’s why did we see a 10% increase in the bill? Yeah, we hired another data science team. Oops.” It’s always seems to be the data science folks; I know I’d beat up on those folks a fair bit, and my apologies. And one day, if they analyze enough of the data, they might figure out why.


Micheal: So, this is where I want to give a shout out to our data science team, especially some of the engineers working in the Infrastructure Governance Team putting these charts together, helping us derive insights. So, definitely props to them.


I think there’s a great segue into the point you made. As you add more engineers, what is the impact on the bottom line? And this is one of the things actually as part of engineering productivity, we think about as well on a long-term basis. Pinterest does have over 1000-plus engineers today, and to large degree, many of them actually have their own EC2 instances today. And I wouldn’t say it’s a significant amount of cost, but it is a large enough number, were shutting down a c5.9xl can actually fund a bunch of conference tickets or something else.


And then you can imagine that sort of the scale you start working with at one point. The nuance here is though, you want to make sure there’s enough flexibility for these engineers to do their local development in a sustainable way, but when moving to, say production, we really want to tighten the flexibility a bit so they don’t end up doing what you just said, spin up a bunch of machines talking to the API directly which no one will be aware of.


I want to share a small anecdote because when back in the day, this was probably four years ago, when we were doing some analysis on our bills, we realized that there was a huge jump every—I believe Wednesday—in our EC2 instances by almost a factor of, like, 500 to 600 instances. And we’re like, “Why is this happening? What is going on?” And we found out there was an obscure job written by someone who had left the company, calling an EC2 API to spin up a search cluster of 500 machines on-demand, as part of pulling that ETL data together, and then shutting that cluster down. Which at times didn’t work as expected because, you know, obviously, your Hadoop jobs are very predictable, right?


So, those are the things we were dealing with back in the day, and you want to make sure—since then—this is where engineering productivity as team starts coming in that our job is to enable every engineer to be doing their best work across code building and deploying the services. And we have done this.


Corey: Right. You and I can sit here and have an in-depth conversation about the intricacies of AWS billing in a bunch of different ways because in different ways we both specialize in it, in many respects. But let’s say that Pinterest theoretically was foolish enough to hire me before I got into this space as an engineer, for terrifying reasons. And great. I start day one as a typical software developer if such a thing could be said to exist. How do you effectively build guardrails in so that I don’t inadvertently wind up spinning up all the EC2 instances available to me within an account, which it turns out are more than one might expect sometimes, but still leave me free to do my job without effectively spending a nine-month safari figuring out how AWS bills work?


Micheal: And this is why teams like ours exist, to help provide those tools to help you get started. So today, we actually don’t let anyone directly use AWS APIs, or even use the UI for that matter. And I think you’ll soon realize, the moment you hit, like, probably 30 or 40 people in your organization, you definitely want to lock it down. You don’t want that access to be given to anyone or everyone. And then subsequently start building some higher-order tools or abstraction so people can start using that to control effectively.


In this case, if you’re a new engineer, Corey, which it seems like you were, at some point—


Corey: I still write code like I am, don’t worry.


Micheal: [laugh]. So yes, you would get access to our internal tool to actually help spin up what we call is a dev app, where you get a chance to, obviously, choose the instance size, not the instance type itself, and we have actually constrained the instance types we have approved within Pinterest as well. We don’t give you the entire list you get a chance to choose and deploy to. We actually have constraint to based on the workload types, what are the instance types we want to support because in the future, if we ever want to move from c3 to c5—and I’ve been there, trust me—it is not an easy thing to do, so you want to make sure that you’re not letting people just use random instances, and constrain that by building some of these tools. As a new engineer, you would go in, you’d use the tool, and actually have a dev app provisioned for you with our Pinterest image to get you started.


And then subsequently, we’ll obviously shut it down if we see you not being using it over a certain amount of time, but those are sort of the guardrails we’ve put in over there so you never get a chance to directly ever use the EC2 APIs, or any of those AWS APIs to do certain things. The similar thing applies for S3 or any of the higher-order tools which AWS will provide, too.


Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking databases, observability, management, and security.


And - let me be clear here - it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build.


With Always Free you can do things like run small scale applications, or do proof of concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free. No asterisk. Start now. Visit https://snark.cloud/oci-free that's https://snark.cloud/oci-free.


Corey: How does that interplay with AWS launches yet another way to run containers, for example, and that becomes a valuable potential avenue to get some business value for a developer, but the platform you built doesn’t necessarily embrace that capability? Or they release a feature to an existing tool that you use that could potentially be a just feature capability story, much more so than a cost savings one. How do you keep track of all of that and empower people to use those things so they’re not effectively trying to reimplement DynamoDB on top of 
EC2?


Micheal: That’s been a challenge, actually, in the past for us because we’ve always been very flexible where engineers have had an opportunity to write their own solutions many a times rather than leveraging the AWS services, and of late, that’s one of the reasons why we have an infrastructure organization—an extremely lean organization for what it’s worth—but then still able to achieve outsized outputs. Where we evaluate a lot of these use cases, as they come in and open up different aspects of what we want to provide say directly from AWS, or build certain abstractions on top of it. Every time we talk about containers, obviously, we always associate that with something like Kubernetes and offerings from there on; we realized that our engineers directly never ask for those capabilities. They don’t come in and say, “I need a new container orchestration system. Give that to me, and I’m going to be extremely productive.”


What people actually realize is that if you can provide them effective tools and that can help them get their job done, they would be happy with it. For example, like I said, our deployment system, which is actually an open-source system called Teletraan. That is the bread and butter at Pinterest at which my team runs. We operate 100,000-plus machines. We have actually looked into container orchestration where we do have a dedicated Kubernetes team looking at it and helping certain use cases moved there, but we realized that the cost of entire migrations need to be evaluated against certain use cases which can benefit from being on Kubernetes from day one. You don’t want to force anyone to move there, but give them the right incentives to move there. Case in point, let’s upgrade your OS. Because if you’re managing machines, 
obviously everyone loves to upgrade their OSes.


Corey: Well, it’s one of the things I love savings plans versus RIs; you talk about the c3 to c5 migration and everyone has a story about one of those, but the most foolish or frustrating reason that I ever saw not to do the upgrade was what we bought a bunch of Reserved Instances on the C3s and those have a year-and-a-half left to run. And it’s foolish not on the part of customers—it’s economically sound—but on the part of AWS where great, you’re now forcing me to take a contractual commitment to something that serves me less effectively, rather than getting out of the way and letting me do my job. That’s why it’s so important to me at least, that savings plans cover Fargate and Lambda, I wish they covered SageMaker instead of SageMaker having its own thing because once again, you’re now architecturally constrained based upon some ridiculous economic model that they have imposed on us. But that’s a separate rant for another time.


Micheal: No, we actually went through that process because we do have a healthy balance of how we do Reserved Instances and how we look at on-demand. We’ve never been big users have spot in the past because just the spot market itself, we realized that putting that pressure on our customers to figure out how to manage that is way more. When I say customers, in this case, engineers within the organization.


Corey: Oh, yes. “I want to post some pictures on Pinterest, so now I have to understand the spot market. What?” Yeah.


Micheal: [laugh]. So, in this case, when we even we’re moving from C3 to C5—and this is where the partnership really plays out effectively, right, because it’s also in the best interest of AWS to deprecate their aging hardware to support some of these new ones where they could also be making good enough premium margins for what it’s worth and give the benefit back to the user. So, in this case, we were able to work out an extremely flexible way of moving to a C5 as soon as possible, get help from them, actually, in helping us do that, too, allocating capacity and working with them on capacity management. I believe at one point, we were actually one of the largest companies with a C3 footprint and it took quite a while for us to move to C5. But rest assured, once we moved, the savings was just immense. We were able to offset any of those RI and we were able to work behind the scenes to get that out. But obviously, not a lot of that is considered in a small-scale company just because of, like you said, those constraints which have been placed in a contractual obligation.


Corey: Well, this is an area in which I will give the same guidance to companies of your scale as well as small-scale companies. And by small-scale, I mean, people on the free tier account, give or take, so I do mean the smallest of the small. Whenever you wind up in a scenario where you find yourself architecturally constrained by an economic barrier like this, reach out to your account manager. I promise you have one. Every account, even the tiny free tier accounts, have an account manager.


I have an account manager, who I have to say has probably one of the most surreal jobs that AWS, just based upon the conversations I throw past him. But it’s reaching out to your provider rather than trying to solve a lot of this stuff yourself by constraining how you’re building things internally is always the right first move because the worst case is you don’t get anywhere in those conversations. Okay, but at least you explored that, as opposed to what often happens is, “Oh, yeah. I have a switch over here I can flip and solve your entire problem. Does that help anything?”


Micheal: Yeah.


Corey: You feel foolish finding that out only after nine months of dedicated work, it turns out.


Micheal: Which makes me wonder, Corey. I mean, do you see a lot of that happening where folks don’t tend to reach out to their account managers, or rather treat them as partners in this case, right? Because it sounds like there is this unhealthy tension, I would say, as to what is the best help you could be getting from your account managers in this case.


Corey: Constantly. And the challenge comes from a few things, in my experience. The first is that the quality of account managers and the technical account managers—the folks who are embedded many cases with your engineering teams in different ways—does vary. AWS is scaling wildly and bursting at the seams, and people are hard to scale.


So, some are fantastic, some are decidedly less so, and most folks fall somewhere in the middle of that bell curve. And it doesn’t take too many poor experiences for the default to be, “Oh, those people are useless. They never do anything we want, so why bother asking them?” And that leads to an unhealthy dynamic where a lot of companies will wind up treating their AWS account manager types as a ticket triage system, or the last resort of places that they’ll turn when they should be involved in earlier conversations.


I mean, take Pinterest as an example of this. I’m not sure how many technical account managers you have assigned to your account, but I’m going to go out on a limb and guess that the ratio of technical account managers to engineers working on the environment is incredibly lopsided. It’s got to be a high ratio just because of the nature of how these things work. So, there are a lot of people who are actively working on things that would almost certainly benefit from a more holistic conversation with your AWS account team, but it doesn’t occur to them to do it just because of either perceived biases around levels of competence, or poor experiences in the past, or simply not knowing the capabilities that are there. If I could tell one story around the AWS account management story, it would be talk to folks sooner about these 
things.


And to be clear, Pinterest has this less than other folks, but AWS does themselves no favors by having a product strategy of, “Yes,” because very often in service of those conversations with a number of companies, there is the very real concern of are they doing research so that they can launch a service that competes with us? Amazon as a whole launching a social network is admittedly one of the most hilarious ideas I [laugh] can come up with and I hope they take a whack at it just to watch them learn all these lessons themselves, but that is again, neither here nor there.


Micheal: That story is very interesting, and I think you mentioned one thing; it’s just that lack of trust, or even knowing what the account managers can actually do for you. There seems to be just a lack of education on that. And we also found it the hard way, right? I wouldn’t say that Pinterest figured this out on day one. We evolved sort of a relationship over time. Yes, our time… engagements are, sort of, lopsided, but we were able to negotiate that as part of deals as we learned a bit more on what we can and we cannot do, and how these individuals are beneficial for Pinterest as well. And—


Corey: Well, here’s a question for you, without naming names—and this might illustrate part of the challenge customers have—how long has 
your account manager—not the technical account managers, but your account manager—been assigned to your account?


Micheal: I’ve been at Pinterest for five years and I’ve been working with the same person. And he’s amazing.


Corey: Which is incredibly atypical. At a lot of smaller companies, it feels like, “Oh, I’m your account manager being introduced to you.” And, “Are you the third one this year? Great.” What happens is that if the account manager excels, very often they get promoted and work with a smaller number of accounts at larger spend, and whereas if they don’t find that AWS is a great place for them for a variety of reasons, they go somewhere else and need to be backfilled.


So, at the smaller account, it’s, “Great. I’ve had more account managers in a year than you’ve had in five.” And that is often the experience when you start seeing significant levels of rotation, especially on the customer engineering side where you wind up with you have this big kickoff, and everyone’s aware of all the capabilities and you look at it three years later, and not a single person who was in that kickoff is still involved with the account on either side, and it’s just sort of been evolving evolutionarily from there. One thing that we’ve done in some of our larger accounts as part of our negotiation process is when we see that the bridges have been so thoroughly burned, we will effectively request a full account team cycle, just because it’s time to get new faces in where the customer, in many cases unreasonably, is not going to say, “Yeah but a year-and-a-half ago you did this terrible thing and we’re still salty about it.” Fine, whatever. I get it. People relationships are hard. Let’s go ahead and swap some folks out so that there are new faces with new perspectives because that helps.


Micheal: Well, first off, if you had so many switches in account manager, I think that’s something speaks about [laugh] how you’ve been working, too. I’m just kidding. There are a bu—


Corey: Entirely possible. In seriousness, yes. But if you talk to—like, this is not just me because in my case, yeah, I feel like my account 
manager is whoever drew the short straw that week because frankly, yeah, that does seem like a great punishment to wind up passing out to someone who is underperforming. But for a lot of folks who are in the mid-tier, like, spending $50 to $100,000 a month, this is a very common story.


Micheal: Yeah. Actually, we’ve heard a bit about this, too. And like you said, I think maintaining context is the most thing. You really want your account manager to vouch for you, really be your champion in those meetings because AWS, like you said is so large, getting those exec time, and reviews, and there’s so many things that happen, your account manager is the champion for you, or right there. And it’s important and in fact in your best interest to have a great relationship with them as well, not treat them as, oh yet another vendor.


And I think that’s where things start to get a bit messy because when you start treating them as yet another vendor, there is no incentive for them to do the best for you, too. You know, people relationships are hard. But that said though, I think given the amount of customers like these cloud companies are accruing, I wouldn’t be surprised; every account manager seems to be extremely burdened. Even in our case, although I’ve been having a chance to work with this one person for a long time, we’ve actually expanded. We have now multiple account managers helping us out as we’ve started scaling to use certain aspects of AWS which we’ve never explored before.


We were a bit constrained and reserved about what service we want to use because there have been instances where we have tried using something and we have hit the wall pretty immediately. API rate limits, or it’s not ready for primetime, and we’re like, “Oh, my God. Now, what do we do?” So, we have a bit more cautious. But that said, over time, having an account manager who understands how you work, what scale you have, they’re able to advocate with the internal engineering teams within the cloud provider to make the best of supporting you as a customer and tell that success story all the way out.


So yeah, I can totally understand how this may be hard, especially for those small companies. For what it’s worth, I think the best way to really think about it is not treat them as your vendor, but really go out on a limb there. Even though you signed a deal with them, you want to make sure that you have the continuing relationship with them to have—represent your voice better within the company. Which is probably hard. [laugh].


Corey: That’s always the hard part. Honestly, if this were the sort of thing that were easy to automate, or you could wind up building out something that winds up helping companies figure out how to solve these things programmatically, talk about interesting business problems that are only going to get larger in the fullness of time. This is not going away, even if AWS stopped signing up new customers entirely right now, they would still have years of growth ahead of them just from organic growth. And take a company with the scale of Pinterest and just think of how many years it would take to do a full-on exodus, even if it became priority number one. It’s not realistic in many cases, which is why I’ve never been a big fan of multi-cloud as an approach for negotiation. Yeah, AWS has more data on those points than any of us do; they’re not worried about it. It just makes you sound like an unsophisticated negotiator. Pick your poison and lean in.


Micheal: That is the truth you just mentioned, and I probably want to give a call out to our head of infrastructure, [Coburn 00:42:13]. He’s also my boss, and he had brought this perspective as well. As part of any negotiation discussions, like you just said, AWS has way more data points on this than what we think we can do in terms of talking about, “Oh, we are exploring this other cloud provider.” And it’s—they would be like, “Yeah. Do tell me more [laugh] how that’s going.”


And it’s probably in the best interest to never use that as a negotiation tactic because they clearly know the investments that’s going to build on what you’ve done, so you might as well be talking more—again, this is where that relationship really plays together because you want both of them to be successful. And it’s in their best interest to still keep you happy because the good thing about at least companies of our size is that we’re probably, like, one phone call away from some of their executive team, where we could always talk about what didn’t work for us. And I know not everyone has that opportunity, but I’m really hoping and I know at least with some of the interactions we’ve had with the AWS teams, they’re actively working and building that relationship more and more, giving access to those customer advisory boards, and all of them to have those direct calls with the executives. I don’t know whether you’ve seen that in your experience in helping some of these companies?


Corey: Have a different approach to it. It turns out when you’re super loud and public and noisy about AWS and spend too much time in Seattle, you start to spend time with those people on a social basis. Because, again, I’m obnoxious and annoying to a lot of AWS folks, but I’m also having an obnoxious habit of being right in most of the things I’m pointing out. And that becomes harder and harder to ignore. I mean, part of the value that I found in being able to do this as a consultant is that I begin to compare and contrast different customer environments on a consistent ongoing basis.


I mean, the reason that negotiation works well from my perspective is that AWS does a bunch of these every week, and customers do these every few years with AWS. And well, we do an awful lot of them, too, and it’s okay, we’ve seen different ways things can get structured and it doesn’t take too long and too many engagements before you start to see the points of commonality in how these things flow together. So, when we wind up seeing things that a customer is planning on architecturally and looking to do in the future, and, “Well, wait a minute. Have you talked to the folks negotiating the contract about this? Because that does potentially have bearing and it provides better data than what AWS is gathering just through looking at overall spend trends. So yeah, bring that up. That is absolutely going to impact the type of offer you get.”


It just comes down to understanding the motivators that drive folks and it comes down to, I think understanding the incentives. I will say that across the board, I have never yet seen a deal from AWS come through where it was, “Okay, at this point you’re just trying to hoodwink the customer and get them to sign on something that doesn’t help them.” I’ve seen mistakes that can definitely lead to that impression, and I’ve seen areas where they’re doing data is incomplete and they’re making assumptions that are not borne out in reality. But it’s not one of those bad faith type—


Micheal: Yeah.


Corey: —of negotiations. If it were, I would be framing a lot of this very differently. It sounds weird to say, “Yeah, your vendor is not trying to screw you over in this sense,” because look at the entire IT industry. How often has that been true about almost any other vendor in the fullness of time? This is something a bit different, and I still think we’re trying to grapple with the repercussions of that, from a negotiation standpoint and from a long-term business continuity standpoint, when your faith is linked—in a shared fate context—with your vendor.


Micheal: It’s in their best interest as well because they’re trying to build a diversified portfolio. Like, if they help 100 companies, even if one of them becomes the next Pinterest, that’s great, right? And that continued relationship is what they’re aiming for. So, assuming any bad faith over there probably is not going to be the best outcome, like you said. And two, it’s not a zero-sum game.


I always get a sense that when you’re doing these negotiations, it’s an all-or-nothing deal. It’s not. You have to think they’re also running a business and it’s important that you as your business, how okay are you with some of those premiums? You cannot get a discount on everything, you cannot get the deal or the numbers you probably want almost everything. And to your point, architecturally, if you’re moving in a certain direction where you think in the next three years, this is what your usage is going to be or it will come down to that, obviously, you should be investing more and negotiating that out front rather than managed NAT [laugh] gateways, I guess. So, I think that’s also an important mindset to take in as part of any of these negotiations. Which I’m assuming—I don’t know how you folks have been working in the past, but at least that’s one of the key items we have taken in as part of any of these discussions.


Corey: I would agree wholeheartedly. I think that it just comes down to understanding where you’re going, what’s important, and again in some cases knowing around what things AWS will never bend contractually. I’ve seen companies spend six weeks or more trying to get to negotiate custom SLAs around services. Let me save everyone a bunch of time and money; they will not grant them to you.


Micheal: Yeah.


Corey: I promise. So, stop asking for them; you’re not going to get them. There are other things they will negotiate on that they’re going to be highly case-dependent. I’m hesitant to mention any of them just because, “Well, wait a minute, we did that once. Why are you talking about that in public?” I don’t want to hear it and confidentiality matters. But yeah, not everything is negotiable, but most things are, so figuring out what levers and knobs and dials you have is important.


Micheal: We also found it that way. AWS does cater to their—they are a platform and they are pretty clear in how much engagement—even if 
we are one of their top customers, there’s been many times where I know their product managers have heavily pushed back on some of the requests we have put in. And that makes me wonder, they probably have the same engagement even with the smallest of customers, there’s always an implicit assumption that the big fish is trying to get the most out of your public cloud providers. To your point, I don’t think that’s true. We’re rarely able to negotiate anything exclusive in terms of their product offerings just for us, if that makes sense.


Case in point, tell us your capacity [laugh] for x instances or type of instances, so we as a company would know how to plan out our scale-ups or scale-downs. That’s not going to happen exclusively for you. But those kind of things are just, like, examples we have had a chance to work with their product managers and see if, can we get some flexibility on that? For what it’s worth, though, they are willing to find a middle ground with you to make sure that you get your answers and, obviously, you’re being successful in your plans to use certain technologies they offer or [unintelligible 00:48:31] how you use their services.


Corey: So, I know we’ve gone significantly over time and we are definitely going to do another episode talking about a lot of the other things that you’re involved in because I’m going to assume that your full-time job is not worrying about the AWS bill. In fact, you do a fair number of things beyond that; I just get stuck on that one, given that it is but I eat, sleep, breathe, and dream about.


Micheal: Absolutely. I would love to talk more, especially about how we’re enabling our engineers to be extremely productive in this new world, and how we want to cater to this whole cloud-native environment which is being created, and make sure people are doing their best work. But regardless, Corey, I mean, this has been an amazing, insightful chat, even for me. And I really appreciate you having me on the show.


Corey: No, thank you for joining me. If people want to learn more about what you’re up to, and how you think about things, where can they find you? Because I’m also going to go out on a limb and assume you’re also probably hiring, given that everyone seems to be these days.


Micheal: Well, that is true. And I wasn’t planning to make a hiring pitch but I’m glad that you leaned into that one. Yes, we are hiring and you can find me on Twitter at twitter dot com slash M-I-C-H-E-A-L. I am spelled a bit differently, so make sure you can hit me up, and my DMs are open. And obviously, we have all our open roles listed on pinterestcareers.com as well.


Corey: And we will, of course, put links to that in the [show notes 00:49:45]. Thank you so much for taking the time to speak with me today. I really appreciate it.


Micheal: Thank you, Corey. It was really been great on your show.


Corey: And I’m sure we’ll do it again in the near future. Micheal Benedict, Head of Engineering Productivity at Pinterest. I am Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with a long rambling comment about exactly how many data centers Pinterest could build instead.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.


Announcer: This has been a HumblePod production. Stay humble.
View Full TranscriptHide Full Transcript