The Realities of Working in Data with Emily Gorcenski

Episode Summary

Emily Gorcenski, Data & AI Service Line Lead at Thoughtworks, joins Corey on Screaming in the Cloud to discuss how big data is changing our lives - both for the better, and the challenges that come with it. Emily explains how data is only important if you know what to do with it and have a plan to work with it, and why it’s crucial to understand the use-by date on your data. Corey and Emily also discuss how big data problems aren’t universal problems for the rest of the data community, how to address the ethics around AI, and the barriers to entry when pursuing a career in data.

Episode Show Notes & Transcript

About Emily

Emily Gorcenski is a principal data scientist and the Data & AI Service Line Lead of ThoughtWorks Germany. Her background in computational mathematics and control systems engineering has given her the opportunity to work on data analysis and signal processing problems from a variety of complex and data intensive industries. In addition, she is a renowned data activist and has contributed to award-winning journalism through her use of data to combat extremist violence and terrorism. The opinions expressed are solely her own.

Links Referenced:

ThoughtWorks: https://www.thoughtworks.com/
Personal website: https://emilygorcenski.com
Twitter: https://twitter.com/EmilyGorcenski
Mastodon: https://mastodon.green/@[email protected]

Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. My guest today is Emily Gorcenski, who is the Data and AI Service Line Lead over at ThoughtWorks. Emily, thank you so much for joining me today. I appreciate it.

Emily: Thank you for having me. I’m happy to be here.

Corey: What is it you do, exactly? Take it away.

Emily: Yeah, so I run the data side of our business at ThoughtWorks, Germany. That means data engineering work, data platform work, data science work. I’m a data scientist by training. And you know, we’re a consulting company, so I’m working with clients and trying to help them through the, sort of, morphing landscape that data is these days. You know, should we be migrating to the cloud with our data? What can we migrate to the cloud with our data? Where should we be doing with our data scientists and how do we make our data analysts’ lives easier? So, it’s a lot of questions like that and trying to figure out the strategy and all of those things.

Corey: You might be one of the most perfectly positioned people to ask this question to because one of the challenges that I’ve run into consistently and persistently—because I watch a lot of AWS keynotes—is that they always come up with the same talking point, that data is effectively the modern gold. And data is what unlocks value to your busin—“Every business agrees,” because someone who’s dressed in what they think is a nice suit on stage is saying that it’s, “Okay, you’re trying to sell me something. What’s the deal here?” Then I check my email and I discover that Amazon has sent me the same email about the same problem for every region I’ve deployed things to in AWS. And, “Oh, you deploy this to one of the Japanese regions. We’re going to send that to you in Japanese as a result.”

And it’s like, okay, for a company that says data is important, they have no idea who any of their customers are at this point, is that is the takeaway here. How real is, “Data is important,” versus, “We charge by the gigabyte so you should save all of your data and then run expensive things on top of it.”

Emily: I think data is very important, if you know what you’re going to do with it and if you have a plan for how to work with it. I think if you look at the history of computing, of technology, if you go back 20 years to maybe the early days of the big data era, right? Everyone’s like, “Oh, we’ve got big data. Data is going to be big.” And for some reason, we never questioned why, like, we were thinking that the ‘big’ in ‘big data’ meant big is in volume and not ‘big’ as in ‘big pharma.’

This sort of revolution never really happened for most companies. Sure, some companies got a lot of value from the, sort of, data mining and just gather everything and collect everything and if you hit it with a big computational hammer, insights will come out and somehow there’s insights will make you money through magic. The reality is much more prosaic. If you want to make money with data, you have to have a plan for what you’re going to do with data. You have to know what you’re looking for and you have to know exactly what you’re going to get when you look at your data and when you try to answer questions with it.

And so, when we see somebody like Amazon not being able to correlate that the fact that you’re the account owner for all of these different accounts and that the language should be English and all of these things, that’s part of the operational problem because it’s annoying, to try to do joins across multiple tables in multiple regions and all of those things, but it’s also part—you know, nobody has figured out how this adds value for them to do that, right? There’s a part of it where it’s like, this is just professionalism, but there’s a part of it, where it’s also like… whatever. You’ve got Google Translate. Figure out yourself. We’re just going to get through it.

I think that… as time has evolved from the initial waves of the big data era into the data science era, and now we’re in, you know, all sorts of different architectures and principles and all of these things, most companies still haven’t figured out what to do with data, right? They’re still investing a ton of money to answer the same analytics questions that they were answering 20 years ago. And for me, I think that’s a disappointment in some regards because we do have better tools now. We can do so many more interesting things if you give people the opportunity.

Corey: One of the things that always seemed a little odd was, back when I wielded root credentials in anger—anger,’ of course, being my name for the production environment, as opposed to, “Theory,” which is what I call staging because it works in theory, but not in production. I digress—it always felt like I was getting constant pushback from folks of, “You can’t delete that data. It’s incredibly important because one day, we’re going to find a way to unlock the magic of it.” And it’s, “These are web server logs that are 15 years old, and 98% of them by volume are load balancer health checks because it turns out that back in those days, baby seals got more hits than our website did, so that’s not really a thing that we wind up—that’s going to add much value to it.” And then from my perspective, at least, given that I tend to live, eat, sleep, breathe cloud these days, AWS did something that was refreshingly customer-obsessed when they came out with Glacier Deep Archive.

Because the economics of that are if you want to store a petabyte of data, with a 12-hour latency on request for things like archival logs and whatnot, it’s $1,000 a month per petabyte, which is okay, you have now hit a price point where it is no longer worth my time to argue with you. We’re just not going to delete anything ever again. Problem solved. Then came GDPR, which is neither here nor there and we actually want to get rid of those things for a variety of excellent legal reasons. And the dance continues.

But my argument against getting rid of data because it’s super expensive no longer holds water in the way that it wants did for anything remotely resembling a reasonable amount of data. Then again, that’s getting reinvented all the time. I used to be very, I guess we’ll call it, I guess, a data minimalist. I don’t want to store a bunch of data, mostly because I’m not a data person. I am very bad thinking in that way.

I consider SQL to be the chests of the programming world and I’m not particularly great at it. And I also unlucky and have an aura, so if I destroy a bunch of stateless web servers, okay, we can all laugh about that, but let’s keep me the hell away from the data warehouse if we still want a company tomorrow morning. And that was sort of my experience. And I understand my bias in that direction. But I’m starting to see magic get unlocked.

Emily: Yeah, I think, you know, you said earlier, there’s, like, this mindset, like, data is the new gold or data is new oil or whatever. And I think it’s actually more true that data is the new milk, right? It goes bad if you don’t use it, you know, before a certain point in time. And at a certain point in time, it’s not going to be very offensive if you just leave it locked in the jug, but as soon as you try to open it, you’re going to have a lot of problems. Data is very, very cheap to store these days. It’s very easy to hold data; it’s very expensive to process data.

And I think that’s where the shift has gone, right? There’s sort of this, like, Oracle DBA legacy of, like, “Don’t let the software developers touch the prod database.” And they’ve kind of kept their, like, arcane witchcraft to themselves, and that mindset has persisted. But now it’s sort of shifted into all of these other architectural patterns that are just abstractions on top of this, don’t let the software engineers touch the data store, right? So, we have these, like, streaming-first architectures, which are great. They’re great for software devs. They’re great for software devs. And they’re great for data engineers who like to play with big powerful technology.

They’re terrible if you want to answer a question, like, “How many customers that I have yesterday?” And these are the things that I think are some of the central challenges, right? A Kappa architecture—you know, streaming-first architecture—is amazing if you want to improve your application developer throughput. And it’s amazing if you want to build real-time analytics or streaming analytics into your platform. But it’s terrible if you want your data lake to be navigable. It’s terrible if you want to find the right data that makes sense to do the more complex things. And it becomes very expensive to try to process it.

Corey: One of the problems I think I have that is that if I take a look at the data volumes that I work with in my day-to-day job, I’m dealing with AWS billing data as spit out by the AWS billing system. And there isn’t really a big data problem here. If you take a look at some of the larger clients, okay, maybe I’m trying to consume a CSV that’s ten gigabytes. Yes, Excel is going to violently scream itself to death if I try to wind up loading it there, and then my computer smells like burning metal all afternoon. But if it fits in RAM, it doesn’t really feel like it’s a big data problem, on some level.

And it just feels that when I look at the landscape of all the different tools you can use for things like this, they just feel like it’s more or less, hmm, “I have a loose thread on my shirt. Could you pass me that chainsaw for a second?” It just seems like stupendous overkill for anything that I’m working with. Counterpoint; that the clients I’m working with have massive data farms and my default response when I meet someone who’s very good at an area that I don’t do a lot of work in is—counterintuitively to what a lot of people apparently do on Twitter—is not the default assumption of oh, “I don’t know anything about that space. It must be worthless and they must be dumb.”

No. That is not the default approach to take anything, from my perspective. So, it’s clear there’s something very much there that I just don’t see slash understand. That is a very roundabout way of saying what could be uncharitably distilled down to, “So, is your entire career bullshit?” But no, it is clearly not.

There is value being extracted from this and it’s powerful. I just think that there’s been an industry-wide, relatively poor job done of explaining that value in ways that don’t come across as contrived or profoundly disturbing.

Emily: Yeah, I think there’s a ton of value in doing things right. It gets very complicated to try to explain the nuances of when and how data can actually be useful, right? Oftentimes, your historical data, you know, it really only tells you about what happened in the past. And you can throw some great mathematics at it and try to use it to predict the future in some sense, but it’s not necessarily great at what happens when you hit really hard changes, right?

For example, when the Coronavirus pandemic hit and purchaser and consumer behavior changed overnight. There was no data in the data set that explained that consumer behavior. And so, what you saw is a lot of these things like supply chain issues, which are very heavily data-driven on a normal circumstance, there was nothing in that data that allowed those algorithms to optimize for the reality that we were seeing at that scale, right? Even if you look at advanced logistics companies, they know what to do when there’s a hurricane coming or when there’s been an earthquake or things like that. They have disaster scenarios.

But nobody has ever done anything like this at the global scale, right? And so, what we saw was this hard reset that we’re still feeling the repercussions of today. Yes, there were people who couldn’t work and we had lockdowns and all that stuff, but we also have an effect from the impact of the way that we built the systems to work with the data that we need to shuffle around. And so, I think that there is value in being able to process these really, really large datasets, but I think that actually, there’s also a lot of value in being able to solve smaller, simpler problems, right? Not everything is a big data problem, not everything requires a ton of data to solve.

It’s more about the mindset that you use to look at the data, to explore the data, and what you’re doing with it. And I think the challenge here is that, you know, everyone wants to believe that they have a big data problem because it feels like you have to have a big data problem if you—

Corey: All the cool kids are having this kind of problem.

Emily: You have to have big data to sit at the grownup's table. And so, what’s happened is we’ve optimized a lot of tools around solving big data problems and oftentimes, these tools are really poor at solving normal data problems. And there’s a lot of money being spent in a lot of overkill engineering in the data space.

Corey: On some level, it feels like there has been a dramatic misrepresentation of this. I had an article that went out last year where I called machine-learning selling pickaxes into a digital gold rush. And someone I know at AWS responded to that and probably the best way possible—she works over on their machine-learning group—she sent me a foam Minecraft pickaxe that now is hanging on my office wall. And that gets more commentary than anything, including the customized oil painting I have of Billy the Platypus fighting an AWS Billing Dragon. No, people want to talk about the Minecraft pickaxe.

It’s amazing. It’s first, where is this creativity in any of the marketing that this department is putting out? But two it’s clearly not accurate. And what it took for me to see that was a couple of things that I built myself. I built a Twitter thread client that would create Twitter threads, back when Twitter was a place that wasn’t overrun by some of the worst people in the world and turned into BirdChan.

But that was great. It would automatically do OCR on images that I uploaded, it would describe the image to you using Azure’s Cognitive Vision API. And that was magic. And now I see things like ChatGPT, and that’s magic. But you take a look at the way that the cloud companies have been describing the power of machine learning in AI, they wind up getting someone with a doctorate whose first language is math getting on stage for 45 minutes and just yelling at you in Star Trek technobabble to the point where you have no idea what the hell they’re saying.

And occasionally other data scientists say, “Yeah, I think he’s just shining everyone on at this point. But yeah, okay.” It still becomes unclear. It takes seeing the value of it for it to finally click. People make fun of it, but the Hot Dog, Not A Hot Dog app is the kind of valuable breakthrough that suddenly makes this intangible thing very real for people.

Emily: I think there’s a lot of impressive stuff and ChatGPT is fantastically impressive. I actually used ChatGPT to write a letter to some German government agency to deal with some bureaucracy. It was amazing. It did it, was grammatically correct, it got me what I needed, and it saved me a ton of time. I think that these tools are really, really powerful.

Now, the thing is, not every company needs to build its own ChatGPT. Maybe they need to integrate it, maybe there’s an application for it somewhere in their landscape of product, in their landscape of services, in the landscape of their interim internal tooling. And I would be thrilled actually to see some of that be brought into reality in the next couple of years. But you also have to remember that ChatGPT is not something that came because we have, like, a really great breakthrough in AI last year or something like that. It stacked upon 40 years of research.

We’ve gone through three new waves of neural networking in that time to get to this point, and it solves one class of problem, which is honestly a fairly narrow class of problem. And so, what I see is a lot of companies that have much more mundane problems, but where data can actually still really help them. Like how do you process Cambodian driver’s licenses with OCR, right? These are the types of things that if you had a training data set that was every Cambodian person’s driver’s license for the last ten years, you’re still not going to get the data volumes that even a day worth of Amazon’s marketplace generates, right? And so, you need to be able to solve these problems still with data without resorting to the cudgel that is a big data solution, right?

So, there’s still a niche, a valuable niche, for solving problems with data without having to necessarily resort to, we have to load the entire internet into our stream and throw GPUs at it all day long and spend hundreds of—tens of millions of dollars in training. I don’t know, maybe hundreds of millions; however much ChatGPT just raised. There’s an in-between that I think is vastly underserved by what people are talking about these days.

Corey: There is so much attention being given to this and it feels almost like there has been a concerted and defined effort to almost talk in circles and remove people from the humanity and the human consequences of what it is that they’re doing. When I was younger, in my more reckless years, I was never much of a fan of the idea of government regulation. But now it has become abundantly clear that our industry, regardless of how you want to define industry, how—describe a society—cannot self-regulate when it comes to data that has the potential to ruin people’s lives. I mean, I spent a fair bit of my time in my career working in financial services in a bunch of different ways. And at least in those jobs, it was only money.

The scariest thing I ever dealt with, from a data perspective is when I did a brief stint at Grindr because that was the sort of problem where if that data gets out, people will die. And I have not had to think about things like that have that level of import before or since, for which I’m eternally grateful. “It’s only money,” which is a weird thing for a guy who fixes cloud bills for a living to say. And if I say that in a client call, it’s not going to go very well. But it’s the truth. Money is one of those things that can be fixed. It can be addressed in due course. There are always opportunities there. Someone just been outed to their friends, family, and they feel their life is now in shambles around them, you can’t unring that particular bell.

Emily: Yeah. And in some countries, it can lead to imprisonment, or—

Corey: It can lead to death sentences, yes. It’s absolutely not acceptable.

Emily: There’s a lot to say about the ethics of where we are. And I think that as a lot of these high profile, you know, AI tools have come out over the last year or so, so you know, Stable Diffusion and ChatGPT and all of this stuff, there’s been a lot of conversation that is sort of trying to put some counterbalance on what we’re seeing. And I don’t know that it’s going to be successful. I think that, you know, I’ve been speaking about ethics and technology for a long time and I think that we need to mature and get to the next level of actually addressing the ethical problems in technology. Because it’s so far beyond things like, “Oh, you know, if there’s a biased training data set and therefore the algorithm is biased,” right?

Everyone knows that by now, right? And the people who don’t know that, don’t care. We need to get much beyond where, you know, these conversations about ethics and technology are going because it’s a manifold problem. We have issues with the people labeling this data are paid, you know, pennies per hour to deal with some of the most horrific content you’ve ever seen. I mean, I’m somebody who has immersed myself in a lot of horrific content for some of the work that I have done, and this is, you know, so far beyond what I’ve had to deal with in my life that I can’t even imagine it. You couldn’t pay me enough money to do it and we’re paying people in developing nations, you know, a buck-thirty-five an hour to do this. I think—

Corey: But you must understand, Emily, that given the standard of living where they are, that that is perfectly normal and we wouldn’t want to distort local market dynamics. So, if they make a buck-fifty a day, we are going to be generous gods and pay them a whopping dollar-seventy a day, and now we feel good about ourselves. And no, it’s not about exploitation. It’s about raising up an emerging market. And other happy horseshit that lies people tell themselves.

Emily: Yes, it is. Yes, it is. And we’ve built—you know, the industry has built its back on that. It’s raised itself up on this type of labor. It’s raised itself up on taking texts and images without permission of the creators. And, you know, there’s—I’m not a lawyer and I’m not going to play one, but I do know that derivative use is something that at least under American law, is something that can be safely done. It would be a bad world if derivative use was not something that we had freely available, I think, and on the balance.

But our laws, the thing is, our laws don’t account for the scale. Our laws about things like fair use, derivative use, are for if you see a picture and you want to take your own interpretation, or if you see an image and you want to make a parody, right? It’s a one-to-one thing. You can’t make 5 million parody images based on somebody’s art, yourself. These laws were never built for this scale.

And so, I think that where AI is exploiting society is it’s exploiting a set of ethics, a set of laws, and a set of morals that are built around a set of behavior that is designed around normal human interaction scales, you know, one person standing in front of a lecture hall or friends talking with each other or things like that. The world was not meant for a single person to be able to speak to hundreds of thousands of people or to manipulate hundreds of thousands of images per day. It’s actually—I find it terrifying. Like, the fact that me, a normal person, has a Twitter following that, you know, if I wanted to, I can have 50 million impressions in a month. This is not a normal thing for a normal human being to have.

And so, I think that as we build this technology, we have to also say, we’re changing the landscape of human ethics by our ability to act at scale. And yes, you’re right. Regulation is possibly one way that can help this, but I think that we also need to embed cultural values in how we’re using the technology and how we’re shaping our businesses to use the technology. It can be used responsibly. I mean, like I said, ChatGPT helped me with a visa issue, sending an email to the immigration office in Berlin. That’s a fantastic thing. That’s a net positive for me; hopefully, for humanity. I wasn’t about to pay a lawyer to do it. But where’s the balance, right? And it’s a complex topic.

Corey: It is. It absolutely is. There is one last topic that I would like to talk to you about that’s a little less heavy. And I’ve got to be direct with you that I’m not trying to be unkind, but you’ve disappointed me. Because you mentioned to me at one point, when I asked how things were going in your AWS universe, you said, “Well, aside from the bank heist, reasonably well.”

And I thought that you were blessed as with something I always look for, which is the gift of glorious metaphor. Unfortunately, as I said, you’ve disappointed me. It was not a metaphor; it was the literal truth. What the hell kind of bank heist could possibly affect an AWS account? This sounds like something out of a movie. Hit me with it.

Emily: Yeah, you know, I think in the SRE world, we tell people to focus on the high probability, low impact things because that’s where it’s going to really hurt your business, and let the experts deal with the black swan events because they’re pretty unlikely. You know, a normal business doesn’t have to worry about terrorists breaking into the Google data center or a gang of thieves breaking into a bank vault. Apparently, that is something that I have to worry about because I have some data in my personal life that I needed to protect, like all other people. And I decided, like a reasonable and secure and smart human being who has a little bit of extra spending cash that I would do the safer thing and take my backup hard drive and my old phones and put them in a safety deposit box at an old private bank that has, you know, a vault that’s behind the meter-and-a-half thick steel door and has two guards all the time, cameras everywhere. And I said, “What is the safest possible thing that you can do to store your backups?” Obviously, you put it in a secure storage location, right? And then, you know, I don’t use my AWS account, my personal AWS account so much anymore. I have work accounts. I have test accounts—

Corey: Oh, yeah. It’s honestly the best way to have an AWS account is just having someone else having a payment instrument attached to it because otherwise oh God, you’re on the hook for that yourself and nobody wants that.

Emily: Absolutely. And you know, creating new email addresses for new trial accounts is really just a pain in the ass. So, you know, I have my phone, you know, from five years ago, sitting in this bank vault and I figured that was pretty secure. Until I got an email [laugh] from the Berlin Polizei saying, “There has been a break-in.” And I went and I looked at the news and apparently, a gang of thieves has pulled off the most epic heist in recent European history.

This is barely in the news. Like, unless you speak German, you’re probably not going to find any news about this. But a gang of thieves broke into this bank vault and broke open the safety deposit boxes. And it turns out that this vault was also the location where a luxury watch consigner had been storing his watches. So, they made off with some, like, tens of millions of dollars of luxury watches. And then also the phone that had my 2FA for my Amazon account. So, the total value, you know, potential theft of this was probably somewhere in the $500 million range if they set up a SageMaker instance on my account, perhaps.

Corey: This episode is sponsored in part by Honeycomb. I’m not going to dance around the problem. Your. Engineers. Are. Burned. Out. They’re tired from pagers waking them up at 2 am for something that could have waited until after their morning coffee. Ring Ring, Who’s There? It’s Nagios, the original call of duty! They’re fed up with relying on two or three different “monitoring tools” that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there’s a better way. Observability tools like Honeycomb (and very little else becau se they do admittedly set the bar) show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It’s great for your business, great for your engineers, and, most importantly, great for your customers. Try FREE today at honeycomb.io/screaminginthecloud. That’s honeycomb.io/screaminginthecloud.

Corey: The really annoying part that you are going to kick yourself on about this—and I’m not kidding—is, I’ve looked up the news articles on this event and it happened, something like two or three days after AWS put out the best release of last years, or any other re:Invent—past, present, future—which is finally allowing multiple MFA devices on root accounts. So finally, we can stop having safes with these things or you can have two devices or you can have multiple people in Covid times out of remote sides of different parts of the world and still get into the thing. But until then, nope. It’s either no MFA or you have to store it somewhere ridiculous like that and access becomes a freaking problem in the event that the device is lost, or in this case stolen.

Emily: [laugh]. I will just beg the thieves, if you’re out there, if you’re secretly actually a bunch of cloud engineers who needed to break into a luxury watch consignment storage vault so that you can pay your cloud bills, please have mercy on my poor AWS account. But also I’ll tell you that the credit card attached to it is expired so you won’t have any luck.

Corey: Yeah. Really sad part. Despite having the unexpired credit card, it just means that the charge won’t go through. They’re still going to hold you responsible for it. It’s the worst advice I see people—

Emily: [laugh].

Corey: Well, intentioned—giving each other on places like Reddit where the other children hang out. And it’s, “Oh, just use a prepaid gift card so it can only charge you so much.” It’s yeah, and then you get exploited like someone recently was and start accruing $60,000 a day in Lambda charges on an otherwise idle account and Amazon will come after you with a straight face after a week. And, like, “Yes, we’d like our $360,000, please.”

Emily: Yes.

Corey: “We tried to charge the credit card and wouldn’t you know, it expired. Could you get on that please? We’d like our money faster if you wouldn’t mind.” And then you wind up in absolute hell. Now, credit where due, they in every case I am aware of that is not looking like fraud’s close cousin, they have made it right, on some level. But it takes three weeks of back and forth and interminable waiting.

And you’re sitting there freaking out, especially if you’re someone who does not have a spare half-million dollars sitting around. Imagine who—“You sound poor. Have you tried not being that?” And I’m firmly convinced that it a matter of time until someone does something truly tragic because they don’t understand that it takes forever, but it will go away. And from my perspective, there’s no bigger problem that AWS needs to fix than surprise lifelong earnings bills to some poor freaking student who is just trying to stand up a website as part of a class.

Emily: All of the clouds have these missing stairs in them. And it’s really easy because they make it—one of the things that a lot of the cloud providers do is they make it really easy for you to spin up things to test them. And they make it really, really hard to find where it is to shut it all down. The data science is awful at this. As a data scientist, I work with a lot of data science tools, and every cloud has, like, the spin up your magical data science computing environment so that your data scientist can, like, bang on the data with you know, high-performance compute for a while.

And you know, it’s one click of a button and you type in a couple of na—you know, a couple of things name, your service or whatever, name your resource. You click a couple buttons and you spin it up, but behind the scenes, it’s setting up a Kubernetes cluster and it’s setting up some storage bucket and it’s setting up some data pipelines and it’s setting up some monitoring stuff and it’s setting up a VM in order to run all of this stuff. And the next thing that you know, you’re burning 100, 200 euro a day, just to, like, to figure out if you can load a CSV into pandas using a Jupyter Notebook. And you’re like—when you try to shut it all down, you can’t. It’s you have to figure, oh, there is a networking thing set up. Well, nobody told me there’s a networking thing set up. You know? How do I delete that?

Corey: You didn’t say please, so here you go. Without for me, it’s not even the giant bill going from $4 a month in S3 charges to half a million bucks because that is pretty obvious from the outside just what the hell’s been happening. It’s the little stuff. I am still—since last summer—waiting for a refund on $260 of ‘because we said so’ SageMaker credits because of a change of their billing system, for a 45-minute experiment I had done eight months before that.

Emily: Yep.

Corey: Wild stuff. Wild stuff. And I have no tolerance for people saying, “Oh, you should just read the pricing page and understand it better.” Yeah, listen, jackhole. I do this for a living. If I can fall victim to it, anyone can. I promise. It is not that I don’t know how the billing system works and what to do to avoid unexpected charges.

And I’m just luck—because if I hadn’t caught it with my systems three days into the month, it would have been a $2,000 surprise. And yeah, I run a company. I can live with that. I wouldn’t be happy, but whatever. It is immaterial compared to, you know, payroll.

Emily: I think it’s kind of a rite of passage, you know, to have the $150 surprise Redshift bill at the end of the month from your personal test account. And it’s sad, you know? I think that there’s so much better that they can do and that they should do. Sort of as a tangent, one of the challenges that I see in the data space is that it’s so hard to break into data because the tooling is so complex and it requires so much extra knowledge, right? If you want to become a software developer, you can develop a microservice on your machine, you can build a web app on your machine, you can set up Ruby on Rails, or Flask, or you know, .NET, or whatever you want. And you can do all of that locally.

And you can learn everything you need to know about React, or Terraform, or whatever, running locally. You can’t do that with data stuff. You can’t do that with BigQuery. You can’t do that with Redshift. The only way that you can learn this stuff is if you have an account with that setup and you’re paying the money to execute on it. And that makes it a really high barrier for entry for anyone to get into this space. It makes it really hard to learn. Because if you want to learn anything by doing, like many of us in the industry have done, it’s going to cost you a ton of money just to [BLEEP] around and find out.

Corey: Yes. And no one likes the find out part of those stories.

Emily: Nobody likes to find out when it comes to your bill.

Corey: And to tie it back to the data story of it, it is clearly some form of batch processing because it tries to be an eight-hour consistency model. Yeah, I assume for everything, it’s 72. But what that means is that you are significantly far removed from doing a thing and finding out what that thing costs. And that’s the direct charges. There’s always the oh, I’m going to set things up and it isn’t going to screw you over on the bill. You’re just planting a beautiful landmine you’re going to stumble blindly into in three months when you do something else and didn’t realize what that means.

And the worst part is it feels victim-blamey. I mean, this is my pro—I guess this is one of the reasons I guess I’m so down on data, even now. It’s because I contextualize it in a sense of the AWS bill. No one’s happy dealing with that. You ever met a happy accountant? You have not.

Emily: Nope. Nope [laugh]. Especially when it comes to clouds stuff.

Corey: Oh yeah.

Emily: Especially these days, when we’re all looking to save energy, save money in the cloud.

Corey: Ideally, save the planet. Sustainability and saving money align on the axis of ‘turn that shit off.’ It’s great. We can hope for a brighter tomorrow.

Emily: Yep.

Corey: I really want to thank you for being so generous with your time. If people want to learn more, where can they find you? Apparently filing police reports after bank heists, which you know, it’s a great place to meet people.

Emily: Yeah. You know, the largest criminal act in Berlin is certainly a place you want to go to get your cloud advice. You can find me, I have a website. It’s my name, emilygorcenski.com.

You can find me on Twitter, but I don’t really post there anymore. And I’m on Mastodon at some place because Mastodon is weird and kind of a mess. But if you search me, I’m really not that hard to find. My name is harder to spell, but you’ll see it in the podcast description.

Corey: And we will, of course, put links to all of this in the show notes. Thank you so much for your time. I really appreciate it.

Emily: Thank you for having me.

Corey: Emily Gorcenski, Data and AI Service Line Lead at ThoughtWorks. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insipid, insulting comment, talking about why data doesn’t actually matter at all. And then the comment will disappear into the ether because your podcast platform of choice feels the same way about your crappy comment.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

The Realities of Working in Data with Emily Gorcenski

Episode Summary

Episode Show Notes & Transcript

You might also like

Reliable Software by Default with Jeremy Edberg

See Why GenAI Workloads Are Breaking Observability with Wayne Segar

Presenting at re:Invent with Matt Berk and Bowen Wang

Get the Newsletter

Sponsor an Episode