Keeping Life on the Internet Friction Free with Jason Frazier

Episode Summary

This is a bit unusual for this episode! Our friends at Redis have asked us to interview Jason Frazier, who does not, nor has ever, worked at Redis. Rather, Jason is the Software Engineering Manager at Ekata, who is striving to be a global leader in online identity verification. Because, as it turns out, we live in the age where anybody can put anything anywhere all over the web! Jason goes into details on Ekata, a Redis customer, and their efforts to reduce fraudulent activity online. He discusses their efforts to make things frictionless for the customer and the balance they need to strike. He also lays out why Ekata has chosen to use Redis in detail. Jason also notes the importance of striking a balance between the consumers willingness for privacy, versus convenience and patience.

Episode Show Notes & Transcript

About Jason
Jason Frazier is a Software Engineering Manager at Ekata, a Mastercard Company. Jason’s team is responsible for developing and maintaining Ekata’s product APIs. Previously, as a developer, Jason led the investigation and migration of Ekata’s Identity Graph from AWS Elasticache to Redis Enterprise Redis on Flash, which brought an average savings of $300,000/yr.


Links:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.


Corey: Today’s episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that’s built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you’re defining those as, which depends probably on where you work. It’s getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that’s exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn’t eat all the data you’ve gotten on the system, it’s exactly what you’ve been looking for. Check it out today at min.io/download, and see for yourself. That’s min.io/download, and be sure to tell them that I sent you.


Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This one is a bit fun because it’s a promoted episode sponsored by our friends at Redis, but my guest does not work at Redis, nor has he ever. Jason Frazier is a Software Engineering Manager at Ekata, a Mastercard company, which I feel, like, that should have some sort of, like, music backstopping into it just because, you know, large companies always have that magic sheen on it. Jason, thank you for taking the time to speak with me today.


Jason: Yeah. Thanks for inviting me. Happy to be here.


Corey: So, other than the obvious assumption, based upon the fact that Redis is kind enough to be sponsoring this episode, I’m going to assume that you’re a Redis customer at this point. But I’m sure we’ll get there. Before we do, what is Ekata? What do you folks do?


Jason: So, the whole idea behind Ekata is—I mean, if you go to our website, our mission statement is, “We want to be the global leader in online identity verification.” What that really means is, in more increasingly digital world, when anyone can put anything they want into any text field they want, especially when purchasing anything online—


Corey: You really think people do that? Just go on the internet and tell lies?


Jason: I know. It’s shocking to think that someone could lie about who they are online. But that’s sort of what we’re trying to solve specifically in the payment space. Like, I want to buy a new pair of shoes online, and I enter in some information. Am I really the person that I say I am when I’m trying to buy those shoes? To prevent fraudulent transactions. That’s really one of the basis that our company goes on is trying to reduce fraud globally.


Corey: That’s fascinating just from the perspective of you take a look at cloud vendors at the space that I tend to hang out with, and a lot of their identity verification of, is this person who they claim to be, in fact, is put back onto the payment providers. Take Oracle Cloud, which I periodically beat up but also really enjoy aspects of their platform on, where you get to their always free tier, you have to provide a credit card. Now, they’ll never charge you anything until you affirmatively upgrade the account, but—“So, what do you do need my card for?” “Ah, identity and fraud verification.” So, it feels like the way that everyone else handles this is, “Ah, we’ll make it the payment networks’ problem.” Well, you’re now owned by Mastercard, so I sort of assume you are what the payment networks, in turn, use to solve that problem.


Jason: Yeah, so basically, one of our flagship products and things that we return is sort of like a score, from 0 to 400, on how confident we are that this person is who they are. And it’s really about helping merchants help determine whether they should either approve, or deny, or forward on a transaction to, like, a manual review agent. As well as there’s also another use case that’s even more popular, which is just, like, account creation. As you can imagine, there’s lots of bots on everyone’s [laugh] favorite app or website and things like that, or customers offer a promotion, like, “Sign up and get $10.”


Well, I could probably get $10,000 if I make a thousand random accounts, and then I’ll sign up with them. But, like, make sure that those accounts are legitimate accounts, that’ll prevent, like, that sort of promo abuse and things like that. So, it’s also not just transactions. It’s also, like, account openings and stuff, make sure that you actually have real people on your platform.


Corey: The thing that always annoyed me was the way that companies decide, oh, we’re going to go ahead and solve that problem with a CAPTCHA on it. It’s, “No, no, I don’t want to solve machine learning puzzles for Google for free in order to sign up for something. I am the customer here; you’re getting it wrong somewhere.” So, I assume, given the fact that I buy an awful lot of stuff online, but I don’t recall ever seeing anything branded with Ekata that you do this behind the scenes; it is not something that requires human interaction, by which I mean, friction.


Jason: Yeah, for sure. Yeah, yeah. It’s behind the scenes. That’s exactly what I was about to segue to is friction, is trying to provide a frictionless experience for users. In the US, it’s not as common, but when you go into Europe or anything like that, it’s fairly common to get confirmations on transactions and things like that.


You may have to, I don’t know text—or get a code text or enter that online to basically say, like, “Yes, I actually received this.” But, like, helping—and the reason companies do that is for that, like, extra bit of security and assurance that that’s actually legitimate. And obviously, companies would like to prefer not to have to do that because, I don’t know, if I’m trying to buy something, this website makes me do something extra, the site doesn’t make me do anything extra, I’m probably going to go with that one because it’s just more convenient for me because there’s less friction there.


Corey: You’re obviously limited in how much you can say about this, just because it’s here’s a list of all the things we care about means that great, you’ve given me a roadmap, too, of things to wind up looking at. But you have an example or two of the sort of the data that 
you wind up analyzing to figure out the likelihood that I’m a human versus a robot.


Jason: Yeah, for sure. I mean, it’s fairly common across most payment forms. So, things like you enter in your first name, your last name, your address, your phone number, your email address. Those are all identity elements that we look at. We have two data stores: We have our Identity Graph and our Identity Network.


The Identity Graph is what you would probably think of it, if you think of a web of a person and their identity, like, you have a name that’s linked to a telephone, and that name is also linked to an address. But that address used to have previous people living there, so on and so forth. So, the various what we just call identity elements are the various things we look at. It’s fairly common on any payment form, I’m sure, like, if you buy something on Amazon versus eBay or whatever, you’re probably going to be asked, what’s your name? What’s your address? What’s your email address? What’s your telephone?


Corey: It’s one of the most obnoxious parts of buying things online from websites I haven’t been to before. It’s one of the genius ideas behind Apple Pay and the other centralized payment systems. Oh, yeah. They already know who you are. Just click the button, it’s done.


Jason: Yeah, even something as small as that. I mean, it gets a little bit easier with, like, form autocompletes and stuff like, oh, just type J and it’ll just autocomplete everything for me. That’s not the worst of the world, but it is still some amount of annoyance and friction. [laugh].


Corey: So, as I look through all this, it seems like one of the key things you’re trying to do since it’s in line with someone waiting while something is spinning in their browser, that this needs to be quick. It also strikes me that this is likely not something that you’re going to hit the same people trying to identify all the time—if so, that is its own sign of fraud—so it doesn’t really seem like something can be heavily cached. Yet you’re using Redis, which tells me that your conception of how you’re using it might be different than the mental space that I put Redis into what I’m thinking about where this ridiculous architecture diagram is the Redis part going to go?


Jason: Yeah, I mean, like, whenever anyone says Redis, thinks of Redis, I mean, even before we went down this path, you always think of, oh, I need a cache, I’ll just stuff in Redis. Just use Redis as a cache here and there. I don’t know, some small—I don’t know, a few tens, hundreds gigabytes, maybe—cache, spin that up, and you’re good. But we actually use Redis as our primary data store for our Identity Graph, specifically for the speed that we can get. Because if you’re trying to look for a person, like, let’s say you’re buying something for your brother, how do we know if that’s true or not? Because you have this name, you’re trying to send it to a different address, like, how does that make sense? But how do we get from Corey to an address? Like, oh, maybe used to live with your brother?


Corey: It’s funny, you pick that as your example; my brother just moved to Dublin, so it’s the whole problem of how do I get this from me to someone, different country, different names, et cetera? And yeah, how do you wind up mapping that to figure out the likelihood that it is either credit card fraud, or somebody actually trying to be, you know, a decent brother for once in my life?


Jason: [laugh]. So, I mean, how it works is how you imagine you start at some entry point, which would probably be your name, start there and say, “Can we match this to this person’s address that you believe you’re sending to?” And we can say, “Oh, you have a person-person relationship, like he’s your brother.” So, it maps to him, which we can then get his address and say, “Oh, here’s that address. That matches what you’re trying to send it to. Hey, this makes sense because you have a legitimate reason to be sending something there. You’re not just sending it to some random address out in the middle of nowhere, for no reason.”


Corey: Or the drop-shipping scams, or brushing scams, or one of—that’s the thing is every time you think you’ve seen it all, all you have to do is look at fraud. That’s where the real innovation seems to be happening, [laugh] no matter how you slice it.


Jason: Yeah, it’s quite an interesting space. I always like to say it’s one of those things where if you had the human element in it, it’s not super easy, but it’s like, generally easy to tell, like, okay, that makes sense, or, oh, no, that’s just complete garbage. But trying to do it at scale very fast in, like, a general case becomes an actual substantially harder problem. [laugh]. It’s one of those things that people can probably do fairly well—I mean, that’s why we still have manual reviews and things like that—but trying to do it automatically or just with computers is much more difficult. [laugh].


Corey: Yeah, “Hee hee, I scammed a company out of 20 bucks is not the problem you’re trying to avoid for.” It’s the, “Okay, I just did that ten million times and now we have a different problem.”


Jason: Yeah, exactly. I mean, one of the biggest losses for a lot of companies is, like, fraudulent transactions and chargebacks. Usually, in the case on, like, e-commerce companies—or even especially like nowadays where, as you can imagine, more people are moving to a more online world and doing shopping online and things like that, so as more people move to online shopping, some companies are always going to get some amount of chargebacks on fraudulent transactions. But when it happens at scale, that’s when you start seeing many losses because not only are you issuing a chargeback, you probably sent out some products, that you’re now out some physical product as well. So, it’s almost kind of like a double-whammy. [laugh].


Corey: So, as I look through all this, I tended to always view Redis in terms of, more or less, a key-value store. Is that still accurate? Is that how you wind up working with it? Or has it evolved significantly past them to the point where you can now do relational queries against it?


Jason: Yeah, so we do use Redis as a key-value store because, like, Redis is just a traditional key-value store, very fast lookups. When we first started building out Identity Graph, as you can imagine, you’re trying to model people to telephones to addresses; your first thought is, “Hey, this sounds a whole lot like a graph.” That’s sort of what we did quite a few years ago is, let’s just put it in some graph database. But as time went on and as it became much more important to have lower and lower latency, we really started thinking about, like, we don’t really need all the nice and shiny things that, like, a graph database or some sort of graph technology really offers you. All we really need to do is I need to get from point A to point B, and that’s it.


Corey: Yeah, [unintelligible 00:10:35] graph database, what’s the first thing I need to do? Well, spend six weeks in school trying to figure out exactly what the hell of graph database is because they’re challenging to wrap your head around at the best of times. Then it just always seemed overpowered for a lot of—I don’t want to say simple use cases; what you’re doing is not simple, but it doesn’t seem to be leveraging the higher-order advantages that graph database tends to offer.


Jason: Yeah, it added a lot of complexity in the system, and [laugh] me and one of our senior principal engineers who’s been here for a long time, we always have a joke: If you search our GitHub repository for… we’ll say kindly-worded commit messages, you can see a very large correlation of those types of commit messages to all the commits to try and use a graph database from multiple years ago. It was not fun to work with, just added too much complexity, and we just didn’t need all that shiny stuff. So, that’s how we really just took a step back. Like, we really need to do it this way. We ended up effectively flattening the entire graph into an adjacency list.


So, a key is basically some UUID to an entity. So, Corey, you’d have some UUID associated with you and the value would be whatever your information would be, as well as other UUIDs to links to the other entities. So, from that first retrieval, I can now unpack it, and, “Oh, now I have a whole bunch of other UUIDs I can then query on to get that information, which will then have more IDs associated with it,” is more or less sort of how we do our graph traversal and query this in our graph queries.


Corey: One of the fun things about doing this sort of interview dance on the podcast as long as I have is you start to pick up what people are saying by virtue of what they don’t say. Earlier, you wound up mentioning that we often use Redis for things like tens, or hundreds of gigabytes, which sort of leaves in my mind the strong implication that you’re talking about something significantly larger than that. Can you disclose the scale of data we’re talking about her?


Jason: Yeah. So, we use Redis as our primary data store for our Identity Graph, and also for—soon to be for our Identity Network, which is our other database. But specifically for our Identity Graph, scale we’re talking about, we do have some compression added on there, but 
if you say uncompressed, it’s about 12 terabytes of data that’s compressed, with replication into about four.


Corey: That’s a relatively decent compression factor, given that I imagine we’re not talking about huge datasets.


Jason: Yeah, so this is actually basically driven directly by cost: If you need to store less data, then you need less memory, therefore, you need to pay for less.


Corey: So, our users once again have shored up my longtime argument that when it comes to cloud, cost and architecture are in fact the same thing. Please, continue by all means.


Jason: I would be lying if I said that we didn’t do weekly slash monthly reviews of costs. Where are we spending costs in AWS? How can 
we improve costs? How can we cut down on costs? How can you store less—


Corey: You are singing my song.


Jason: It is a [laugh] it is a constant discussion. But yeah, so we use Zstandard compression, which was developed at Facebook, and it’s a dictionary-based compression. And the reason we went for this is—I mean like if I say I want to compress, like, a Word document down, like, you can get very, very, very high level of compression. It exists. It’s not that interesting, everyone does it all the time.


But with this we’re talking about—so in that, basically, four or so terabytes of compressed data that we have, it’s something around four to four-and-a-half billion keys and values, and so in that we’re talking about each key-value only really having anywhere between 50 and 100 bytes. So, we’re not compressing very large pieces of information. We’re compressing very small 50 to 100 byte JSON values that we have give UUID keys and JSON strings stored as values. So, we’re compressing these 50 to 100 byte JSON strings with around 70, 80% compression. I mean, that’s using Zstandard with a custom dictionary, which probably gave us the biggest cost savings of all, if you can [unintelligible 00:14:32] your dataset size by 60, 70%, that’s huge. [laugh].


Corey: Did you start off doing this on top of Redis, or was this an evolution that eventually got you there?


Jason: It was an evolution over time. We were formally Whitepages. I mean, Whitepages started back in the late-90s. It really just started off as a—we just—


Corey: You were a very early adopter of Redis [laugh]. Yeah, at that point, like, “We got a time machine and started using it before it existed.” Always a fun story. Recruiters seem to want that all the time.


Jason: Yeah. So, when we first started, I mean, we didn’t have that much data. It was basically just one provider that gave us some amount of data, so it was kind of just a—we just need to start something quick, get something going. And so, I mean, we just did what most people do just do the simplest thing: Just stuff it all in a Postgres database and call it good. Yeah, it was slow, but hey, it was back a long time ago, people were kind of okay with a little bit—


Corey: The world moved a bit slower back then.


Jason: Everything was a bit slower, no one really minded too much, the scale wasn’t that large. But business requirements always change over time and they evolve, and so to meet those ever-evolving business requirements, we move from Postgres, and where a lot of the fun commit messages that I mentioned earlier can be found is when we started working with Cassandra and Titan. That was before my time before I had started, but from what I understand, that was a very fun time. But then from there, that’s when we really kind of just took a step back and just said, like, “There’s so much stuff that we just don’t need here. Let’s really think about this, and let’s try to optimize a bit more.”


Like, we know our use case, why not optimize for our use case? And that’s how we ended up with the flattened graph storage stuffing into Redis. Because everyone thought of Redis as a cache, but everyone also knows that—why is it a cache? Because it’s fast. [laugh]. We need something that’s very fast.


Corey: I still conceptualize it as an in-memory data store, just because when I turned on disk persistence model back in 2011, give or take, it suddenly started slamming the entire data store to a halt for about three seconds every time it did it. It was, “What’s this piece of crap here?” And it was, “Oh, yeah. Turns out there was a regression on Zen, which is what AWS is used as a hypervisor back then.” And, “Oh, yeah.”


So, fork became an expensive call, it took forever to wind up running. So oh, the obvious lesson we take from this is, oh, yeah, Redis is not designed to be used with disk persistence. Wrong lesson to take from the behavior, but did cement, in my mind at least, the idea that this is something that we tend to use only as an in-memory store. It’s clear that the technology has evolved, and in fact, I’m super glad that Redis threw you my direction to talk to you about this stuff because until talking to you, I was still—I got to admit—sort of in the position of thinking of it still as an in-memory data store because the fact that Redis says otherwise because they’re envisioning it being something else, well okay, marketers going to market. You’re a customer; it’s a lot harder for me to talk smack about your approach to this thing, when I see you doing it for, let’s be serious here, what is a very important use case. If identity verification starts failing open and everyone claims to be who they say they are, that’s something is visible from orbit when it comes to the macroeconomic effect.


Jason: Yeah, exactly. It’s actually funny because before we move to primarily just using Redis, before going to fully Redis, we did still use Redis. But we used ElastiCache, we had it loaded into ElastiCache, but we also had it loaded into DynamoDB as sort of a, I don’t want this to fail because we weren’t comfortable with actually using Redis as a primary database. So, we used to use ElastiCache with a fallback to DynamoDB, just in that off chance, which, you know, sometimes it happens, sometimes it didn’t. But that’s when we basically just went searching for new technologies, and that’s actually how we landed on Redis on Flash, which is a kind of breaks the whole idea of Redis as an in-memory database to where it’s Redis, but it’s not just an in-memory database, you also have flashback storage.


Corey: So, you’ll forgive me if I combine my day job with this side project of mine, where I fixed the horrifying AWS bills for large companies. My bias, as a result, is to look at infrastructure environments primarily through the lens of AWS bill. And oh, great, go ahead and use an enterprise offering that someone else runs because, sure, it might cost more money, but it’s not showing up on the AWS bill, therefore, my job is done. Yeah, it turns out that doesn’t actually work or the answer to every AWS billing problem is to migrate to Azure to GCP. Turns out that doesn’t actually solve the problem that you would expect.


But you’re obviously an enterprise customer of Redis. Does that data live in your AWS account? Is it something using as their managed service and throwing over the wall so it shows up as data transfer on your side? How is that implemented? I know they’ve got a few different models.


Jason: There’s a couple of aspects onto how we’re actually bill. I mean, so like, when you have ElastiCache, you’re just billed for your, I don’t know, whatever nodes using, cache dot, like, r5 or whatever they are… [unintelligible 00:19:12]


Corey: I wish most people were using things that modern. But please, continue.


Jason: But yeah, so you basically just build for whatever last cache nodes you have, you have your hourly rate, I don’t know, maybe you might reserve them. But with Redis Enterprise, the way that we’re billed is there’s two aspects. One is, well, the contract that we signed 
that basically allows us to use their technology [unintelligible 00:19:31] with a managed service, a managed solution. So, there’s some amount that we pay them directly within some contract, as well as the actual nodes themselves that exist in the cluster. And so basically the way that this is set up, is we effectively have a sub-account within our AWS account that Redis Labs has—or not Redis Labs; Redis Enterprise—has access to, which they deploy directly into, and effectively using VPC peering; that’s how we allow our applications to talk directly to it.


So, we’re built directly—or so the actual nodes of the cluster, which are i3.8x, I believe, on they basically just run EC2 instances. All of those instances, those exist on our bill. Like, we get billed for them; we pay for them. It’s just basically some sub-account that they have access to that they can deploy into. So, we get billed for the instances of the cluster as well as whatever we pay for our enterprise contract. So, there’s sort of two aspects to the actual billing of it.


Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they’re all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don’t dispute that but what I find interesting is that it’s predictable. They tell you in advance on a monthly basis what it’s going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you’re one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you’ll receive a $100 in credit. Thats V-U-L-T-R.com slash screaming.


Corey: So, it’s easy to sit here as an engineer—and believe me, having been one for most of my career, I fall subject to this bias all the time—where it’s, “Oh, you’re going to charge me a management fee to run this thing? Oh, that’s ridiculous. I can do it myself instead,” because, at least when I was learning in my dorm room, it was always a “Well, my time is free, but money is hard to come by.” And shaking off that perspective as my career continued to evolve was always a bit of a challenge for me. Do you ever find yourself or your team drifting toward the direction of, “Well, what we’re paying for Redis Enterprise for? We could just run it ourselves with the open-source version and save whatever it is that they’re charging on top of that?”


Jason: Before we landed on Redis on Flash, we had that same thought, like, “Why don’t we just run our own Redis?” And the decision to that is, well, managing such a large cluster that’s so important to the function of our business, like, you effectively would have needed to hire someone full time to just sit there and stare at the cluster the whole time just to operate it, maintain it, make sure things are running smoothly. And it’s something that we made a decision that, no, we’re going to go with a managed solution. It’s not easy to manage and maintain clusters of that size, especially when they’re so important to business continuity. [laugh]. From our eyes, it was just not worth the investment for us to try and manage it ourselves and go with the fully managed solution.


Corey: But even when we talk about it, it’s one of those well—it’s—everyone talks about, like, the wrong side of it first, the oh, it’s easier if things are down if we wind up being able to say, “Oh, we have a ticket open,” rather than, “I’m on the support forum and waiting for people to get back to me.” Like, there’s a defensibility perspective. We all just sort of, like sidestep past the real truth of it of, yeah, the people who are best in the world running and building these things are right now working on the problem when there is one.


Jason: Yeah, they’re the best in the world at trying to solve what’s going on. [laugh].


Corey: Yeah, because that is what we’re paying them to do. Oh, right. People don’t always volunteer for for-profit entities. I keep forgetting that part of it.


Jason: Yeah, I mean, we’ve had some very, very fun production outages that just randomly happened because to our knowledge, we would just like—I would, like… “I have no idea what’s going on.” And, you know, working with their support team, their DevOps team, honestly, it was a good, like, one-week troubleshooting. When we were validating the technology, we accidentally halted the database for seemingly no reason, and we couldn’t possibly figure out what’s going on. We kept talking to—we were talking to their DevOps team. They’re saying, “Oh, we see all these writes going on for some reason.” We’re like, “We’re not sending any writes. Why is there writes?”


And that was the whole back and forth for almost a week, trying to figure out what the heck was going on, and it happened to be, like, a very subtle case, in terms of, like, the how the keys and values are actually stored between RAM and flash and how it might swap in and out of flash. And like, all the way down to that level where I want to say we probably talked to their DevOps team at least two to three times, like, “Could you just explain this to me?” Like, “Sure,” like, “Why does this happen? I didn’t know this was a thing.” So, on and so forth. Like, there’s definitely some things that are fairly difficult to try and debug, which definitely helps having that enterprise-level solution.


Corey: Well, that’s the most valuable thing in any sort of operational experience where, okay, I can read the documentation and all the other things, and it tells me how it works. Great. The real value of whether I trust something in production is whether or not I know how it breaks where it’s—


Jason: Yeah.


Corey: —okay—because the one thing you want to hear when you’re calling someone up is, “Oh, yeah. We’ve seen this before. This is what you do to fix it.” The worst thing in the world is, “Oh, that’s interesting. We’ve never seen that before.” Because then oh, dear Lord, we’re off in the mists of trying to figure out what’s going on here, while production is down.


Jason: Yeah kind of like, “What is this database do, like, in terms of what do we do?” Like, I mean, this is what we store our Identity Graph in. This has the graph of people’s information. If we’re trying to do identity verification for transactions or anything, for any of our products, I mean, we need to be able to query this database. It needs to be up.


We have a certain requirement in terms of uptime, where we want it at least, like, four nines of uptime. So, we also want a solution that, hey, even if it wants to break, don’t break that bad. [laugh]. There’s a difference between, “Oh, a node failed and okay, like, we’re good in 10, 20 seconds,” versus, “Oh, node failed. You lost data. You need to start reloading your dataset, or you can’t query this anymore.” [laugh]. There’s a very large difference between those two.


Corey: A little bit, yeah. That’s also a great story to drive things across. Like, “Really? What is this going to cost us if we pay for the enterprise version? Great. Is it going to be more than some extortionately large number because if we’re down for three hours in the course of a year, that’s we owe our customers back for not being able to deliver, so it seems to me this is kind of a no-brainer for things like that.”


Jason: Yeah, exactly. And, like, that’s part of the reason—I mean, a lot of the things we do at Ekata, we usually go with enterprise-level for a lot of things we do. And it’s really for that support factor in helping reduce any potential downtime for what we have because, well, if we don’t consider ourselves comfortable or expert-level in that subject, I mean, then yeah, if it goes down, that’s terrible for our customers. I mean, it’s needed for literally every single query that comes through us.


Corey: I did want to ask you, but you keep talking about, “The database” and, “The cluster.” That seems like you have a single database or a single cluster that winds up being responsible for all of this. That feels like the blast radius of that thing going down must be enormous. Have you done any research into breaking that out into smaller databases? What is it that’s driven you toward this architectural pattern?


Jason: Yeah, so for right now, so we have actually three regions were deployed into. We have a copy of it in us-west in AWS, we have one an eu-central-1, and we also have one, an ap-southeast-1. So, we have a complete copy of this database in three separate regions, as well as we’re spread across all the available availability zones for that region. So, we try and be as multi-AZ as we can within a specific region. So, we have thought about breaking it down, but having high availability, having multiple replication factors, having also, you know, it stored in multiple data centers, provides us at least a good level of comfortability.


Specifically, in our US cluster, we actually have two. We literally also—with a lot of the cost savings that we got, we actually have two. We have one that literally sits idle 24/7 that we just call our backup and our standby where it’s ready to go at a moment’s notice. Thankfully, we haven’t had to use it since I want to say its creation about a year-and-a-half ago, but it sits there in that doomsday scenario: “Oh, my gosh, this cluster literally cannot function anymore. Something crazy catastrophic happened,” and we can basically hot swap back into another production-ready cluster as needed, if needed.


Because the really important thing is that if we broke it up into two separate databases if one of them goes down, that could still fail your entire query. Because what if that’s the database that held your address? We can still query you, but we’re going to try and get your address and well, there, your traversal just died because you can no longer get that. So, even trying to break it up doesn’t really help us too much. We can still fail the entire traversal query.


Corey: Yeah, which makes an awful lot of sense. Again, to be clear, you’ve obviously put thought into this goes way beyond the me hearing something in passing and saying, “Hey, you considered this thing?” Let’s be very clear here. That is the sign of a terrible junior consultant. “Well, it sounds like what you built sucked. Did you consider building something that didn’t suck?” “Oh, thanks, Professor. Really appreciate your pointing that out.” It’s one of those useful things.


Jason: It’s like, “Oh, wow, we’ve been doing this for, I don’t know, many, many years.” It’s like, “Oh, wow, yeah. I haven’t thought about that one yet.” [laugh].


Corey: So, it sounds like you’re relatively happy with how Redis has worked out for you as the primary data store. If you were doing it all 
again from scratch, would you make the same technology selection there or would you go in a different direction?


Jason: Yeah, I think I’d make the same decision. I mean, we’ve been using Redis on Flash for at this point three, maybe coming up to four years at this point. There’s a reason we keep renewing our contract and just keep continuing with them is because, to us, it just fits our use case so well, and we very much choose to continue going with this direction in this technology.


Corey: What would you have them change as far as feature enhancements and new options being enabled there? Because remember, asking them right now in front of an audience like this puts them in a situation where they cannot possibly refuse. Please, how would you improve Redis from where it is now?


Jason: I like how you think. That’s [laugh] a [fair way to 00:28:42] to describe it. There’s a couple of things for optimizations that can always be done. And, like, specifically with, like, Redis on Flash, there’s some issue we had with storing as binary keys that to my knowledge hasn’t necessarily been completed yet that basically prevents us from storing as binary, which has some amount of benefit because well, binary keys require less memory to store. When you’re talking about 4 billion keys, even if you’re just saving 20 bytes of key, like you’re talking about potentially hundreds of gigabytes of savings once you—


Corey: It adds up with the [crosstalk 00:29:13].

Jason: Yeah, it adds up pretty quick. [laugh]. So, that’s probably one of the big things that we’ve been in contact with them about fixing that hasn’t gotten there yet. The other thing is, like, there’s a couple of, like, random… gotchas that we had to learn along the way. It does add a little bit of complexity in our loading process.


Effectively, when you first write a value into the database it’ll write to RAM, but then once it gets flushed to flash, the database effectively asks itself, “Does this value already exist in flash?” Because once it’s first written, it’s just written to RAM, it isn’t written to backing flash. And if it says, “No it’s not,” the database then does a write to write it into Flash and then evict it out of RAM. That sounds pretty innocent, but if it already exists in flash when you read it, it says, “Hey, I need to evict this does it already exist in Flash?” “Yep.” “Okay, just chuck it away. It already exists, we’re good.”


It sounds pretty nice, but this is where we accidentally halted our database is once we started putting a huge amount of load on the cluster, our general throughput on peak day is somewhere in the order of 160 to 200,000 Redis operations per second. So, you’re starting to think of, hey, you might be evicting 100,000 values per second into Flash, you’re talking about added 100,000 operate or write operations per second into your cluster, and that accidentally halted our database. So, the way we actually go around this is once we write our data store, we actually basically read the whole thing once because if you read every single key, you pretty much guarantee to cycle everything into Flash, so it doesn’t have to do any of those writes. For right now, there is no option to basically say that, if I write—for our use case, we do very little writes except for upfront, so it’d be super nice for our use case, if we can say, “Hey, our write operations, no, I want you to actually do a full write-through to flash.” Because, you know, that would effectively cut our entire database prep in half. We no longer had to do that read to cycle everything through. Those are probably the two big things, and one of the biggest gotchas that we ran into [laugh] that maybe it isn’t, so known.


Corey: I really want to thank you for taking the time to speak with me today. If people want to learn more, where can they find you? And I 
will also theorize wildly, that if you’re like basically every other company out there right now, you’re probably hiring on your team, too.


Jason: Yeah, I very much am hiring; I’m actually hiring quite a lot right now. [laugh]. So, they can reach me, my email is simply [email protected]. I unfortunately, don’t have a Twitter handle. Or you can find me on LinkedIn. I’m pretty sure most people have LinkedIn nowadays.


But yeah, and also feel free to reach out if you’re also interested in learning more or opportunities, like I said, I’m hiring quite extensively. I’m specifically the team that builds our actual product APIs that we offer to customers, so a lot of the sort of latency optimizations that we do usually are kind of through my team, in coordination with all the other teams, since we need to build a new API with this requirement. How do we get that requirement? [laugh]. Like, let’s go start exploring.


Corey: Excellent. I will, of course, throw a link to that in the [show notes 00:32:10] as well. I want to thank you for spending the time to speak with me today. I really do appreciate it.


Jason: Yeah. I appreciate you having me on. It’s been a good chat.


Corey: Likewise. I’m sure we will cross paths in the future, especially as we stumble through the wide world of, you know, data stores in AWS, and this ecosystem keeps getting bigger, but somehow feels smaller all the time.


Jason: Yeah, exactly. You know, we’ll still be where we are hopefully, approving all of your transactions as they go through, make sure that you don’t run into any friction.


Corey: Thank you once again, for speaking to me, I really appreciate it.


Jason: No problem. Thanks again for having me.


Corey: Jason Frazier, Software Engineering Manager at Ekata. This has been a promoted episode brought to us by our friends at Redis. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment telling me that Enterprise Redis is ridiculous because you could build it yourself on a Raspberry Pi in only eight short months.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.


Announcer: This has been a HumblePod production. Stay humble.


Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They’ve also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That’s S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.

Corey: Today’s episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that’s built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you’re defining those as, which depends probably on where you work. It’s getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that’s exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn’t eat all the data you’ve gotten on the system, it’s exactly what you’ve been looking for. Check it out today at min.io/download, and see for yourself. That’s min.io/download, and be sure to tell them that I sent you.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This one is a bit fun because it’s a promoted episode sponsored by our friends at Redis, but my guest does not work at Redis, nor has he ever. Jason Frazier is a Software Engineering Manager at Ekata, a Mastercard company, which I feel, like, that should have some sort of, like, music backstopping into it just because, you know, large companies always have that magic sheen on it. Jason, thank you for taking the time to speak with me today.

Jason: Yeah. Thanks for inviting me. Happy to be here.

Corey: So, other than the obvious assumption, based upon the fact that Redis is kind enough to be sponsoring this episode, I’m going to assume that you’re a Redis customer at this point. But I’m sure we’ll get there. Before we do, what is Ekata? What do you folks do?

Jason: So, the whole idea behind Ekata is—I mean, if you go to our website, our mission statement is, “We want to be the global leader in online identity verification.” What that really means is, in more increasingly digital world, when anyone can put anything they want into any text field they want, especially when purchasing anything online—

Corey: You really think people do that? Just go on the internet and tell lies?

Jason: I know. It’s shocking to think that someone could lie about who they are online. But that’s sort of what we’re trying to solve specifically in the payment space. Like, I want to buy a new pair of shoes online, and I enter in some information. Am I really the person that I say I am when I’m trying to buy those shoes? To prevent fraudulent transactions. That’s really one of the basis that our company goes on is trying to reduce fraud globally.

Corey: That’s fascinating just from the perspective of you take a look at cloud vendors at the space that I tend to hang out with, and a lot of their identity verification of, is this person who they claim to be, in fact, is put back onto the payment providers. Take Oracle Cloud, which I periodically beat up but also really enjoy aspects of their platform on, where you get to their always free tier, you have to provide a credit card. Now, they’ll never charge you anything until you affirmatively upgrade the account, but—“So, what do you do need my card for?” “Ah, identity and fraud verification.” So, it feels like the way that everyone else handles this is, “Ah, we’ll make it the payment networks’ problem.” Well, you’re now owned by Mastercard, so I sort of assume you are what the payment networks, in turn, use to solve that problem.

Jason: Yeah, so basically, one of our flagship products and things that we return is sort of like a score, from 0 to 400, on how confident we are that this person is who they are. And it’s really about helping merchants help determine whether they should either approve, or deny, or forward on a transaction to, like, a manual review agent. As well as there’s also another use case that’s even more popular, which is just, like, account creation. As you can imagine, there’s lots of bots on everyone’s [laugh] favorite app or website and things like that, or customers offer a promotion, like, “Sign up and get $10.”

Well, I could probably get $10,000 if I make a thousand random accounts, and then I’ll sign up with them. But, like, make sure that those accounts are legitimate accounts, that’ll prevent, like, that sort of promo abuse and things like that. So, it’s also not just transactions. It’s also, like, account openings and stuff, make sure that you actually have real people on your platform.

Corey: The thing that always annoyed me was the way that companies decide, oh, we’re going to go ahead and solve that problem with a CAPTCHA on it. It’s, “No, no, I don’t want to solve machine learning puzzles for Google for free in order to sign up for something. I am the customer here; you’re getting it wrong somewhere.” So, I assume, given the fact that I buy an awful lot of stuff online, but I don’t recall ever seeing anything branded with Ekata that you do this behind the scenes; it is not something that requires human interaction, by which I mean, friction.

Jason: Yeah, for sure. Yeah, yeah. It’s behind the scenes. That’s exactly what I was about to segue to is friction, is trying to provide a frictionless experience for users. In the US, it’s not as common, but when you go into Europe or anything like that, it’s fairly common to get confirmations on transactions and things like that.

You may have to, I don’t know text—or get a code text or enter that online to basically say, like, “Yes, I actually received this.” But, like, helping—and the reason companies do that is for that, like, extra bit of security and assurance that that’s actually legitimate. And obviously, companies would like to prefer not to have to do that because, I don’t know, if I’m trying to buy something, this website makes me do something extra, the site doesn’t make me do anything extra, I’m probably going to go with that one because it’s just more convenient for me because there’s less friction there.

Corey: You’re obviously limited in how much you can say about this, just because it’s here’s a list of all the things we care about means that great, you’ve given me a roadmap, too, of things to wind up looking at. But you have an example or two of the sort of the data that you wind up analyzing to figure out the likelihood that I’m a human versus a robot.

Jason: Yeah, for sure. I mean, it’s fairly common across most payment forms. So, things like you enter in your first name, your last name, your address, your phone number, your email address. Those are all identity elements that we look at. We have two data stores: We have our Identity Graph and our Identity Network.

The Identity Graph is what you would probably think of it, if you think of a web of a person and their identity, like, you have a name that’s linked to a telephone, and that name is also linked to an address. But that address used to have previous people living there, so on and so forth. So, the various what we just call identity elements are the various things we look at. It’s fairly common on any payment form, I’m sure, like, if you buy something on Amazon versus eBay or whatever, you’re probably going to be asked, what’s your name? What’s your address? What’s your email address? What’s your telephone?

Corey: It’s one of the most obnoxious parts of buying things online from websites I haven’t been to before. It’s one of the genius ideas behind Apple Pay and the other centralized payment systems. Oh, yeah. They already know who you are. Just click the button, it’s done.

Jason: Yeah, even something as small as that. I mean, it gets a little bit easier with, like, form autocompletes and stuff like, oh, just type J and it’ll just autocomplete everything for me. That’s not the worst of the world, but it is still some amount of annoyance and friction. [laugh].

Corey: So, as I look through all this, it seems like one of the key things you’re trying to do since it’s in line with someone waiting while something is spinning in their browser, that this needs to be quick. It also strikes me that this is likely not something that you’re going to hit the same people trying to identify all the time—if so, that is its own sign of fraud—so it doesn’t really seem like something can be heavily cached. Yet you’re using Redis, which tells me that your conception of how you’re using it might be different than the mental space that I put Redis into what I’m thinking about where this ridiculous architecture diagram is the Redis part going to go?

Jason: Yeah, I mean, like, whenever anyone says Redis, thinks of Redis, I mean, even before we went down this path, you always think of, oh, I need a cache, I’ll just stuff in Redis. Just use Redis as a cache here and there. I don’t know, some small—I don’t know, a few tens, hundreds gigabytes, maybe—cache, spin that up, and you’re good. But we actually use Redis as our primary data store for our Identity Graph, specifically for the speed that we can get. Because if you’re trying to look for a person, like, let’s say you’re buying something for your brother, how do we know if that’s true or not? Because you have this name, you’re trying to send it to a different address, like, how does that make sense? But how do we get from Corey to an address? Like, oh, maybe used to live with your brother?

Corey: It’s funny, you pick that as your example; my brother just moved to Dublin, so it’s the whole problem of how do I get this from me to someone, different country, different names, et cetera? And yeah, how do you wind up mapping that to figure out the likelihood that it is either credit card fraud, or somebody actually trying to be, you know, a decent brother for once in my life?

Jason: [laugh]. So, I mean, how it works is how you imagine you start at some entry point, which would probably be your name, start there and say, “Can we match this to this person’s address that you believe you’re sending to?” And we can say, “Oh, you have a person-person relationship, like he’s your brother.” So, it maps to him, which we can then get his address and say, “Oh, here’s that address. That matches what you’re trying to send it to. Hey, this makes sense because you have a legitimate reason to be sending something there. You’re not just sending it to some random address out in the middle of nowhere, for no reason.”

Corey: Or the drop-shipping scams, or brushing scams, or one of—that’s the thing is every time you think you’ve seen it all, all you have to do is look at fraud. That’s where the real innovation seems to be happening, [laugh] no matter how you slice it.

Jason: Yeah, it’s quite an interesting space. I always like to say it’s one of those things where if you had the human element in it, it’s not super easy, but it’s like, generally easy to tell, like, okay, that makes sense, or, oh, no, that’s just complete garbage. But trying to do it at scale very fast in, like, a general case becomes an actual substantially harder problem. [laugh]. It’s one of those things that people can probably do fairly well—I mean, that’s why we still have manual reviews and things like that—but trying to do it automatically or just with computers is much more difficult. [laugh].

Corey: Yeah, “Hee hee, I scammed a company out of 20 bucks is not the problem you’re trying to avoid for.” It’s the, “Okay, I just did that ten million times and now we have a different problem.”

Jason: Yeah, exactly. I mean, one of the biggest losses for a lot of companies is, like, fraudulent transactions and chargebacks. Usually, in the case on, like, e-commerce companies—or even especially like nowadays where, as you can imagine, more people are moving to a more online world and doing shopping online and things like that, so as more people move to online shopping, some companies are always going to get some amount of chargebacks on fraudulent transactions. But when it happens at scale, that’s when you start seeing many losses because not only are you issuing a chargeback, you probably sent out some products, that you’re now out some physical product as well. So, it’s almost kind of like a double-whammy. [laugh].

Corey: So, as I look through all this, I tended to always view Redis in terms of, more or less, a key-value store. Is that still accurate? Is that how you wind up working with it? Or has it evolved significantly past them to the point where you can now do relational queries against it?

Jason: Yeah, so we do use Redis as a key-value store because, like, Redis is just a traditional key-value store, very fast lookups. When we first started building out Identity Graph, as you can imagine, you’re trying to model people to telephones to addresses; your first thought is, “Hey, this sounds a whole lot like a graph.” That’s sort of what we did quite a few years ago is, let’s just put it in some graph database. But as time went on and as it became much more important to have lower and lower latency, we really started thinking about, like, we don’t really need all the nice and shiny things that, like, a graph database or some sort of graph technology really offers you. All we really need to do is I need to get from point A to point B, and that’s it.

Corey: Yeah, [unintelligible 00:10:35] graph database, what’s the first thing I need to do? Well, spend six weeks in school trying to figure out exactly what the hell of graph database is because they’re challenging to wrap your head around at the best of times. Then it just always seemed overpowered for a lot of—I don’t want to say simple use cases; what you’re doing is not simple, but it doesn’t seem to be leveraging the higher-order advantages that graph database tends to offer.

Jason: Yeah, it added a lot of complexity in the system, and [laugh] me and one of our senior principal engineers who’s been here for a long time, we always have a joke: If you search our GitHub repository for… we’ll say kindly-worded commit messages, you can see a very large correlation of those types of commit messages to all the commits to try and use a graph database from multiple years ago. It was not fun to work with, just added too much complexity, and we just didn’t need all that shiny stuff. So, that’s how we really just took a step back. Like, we really need to do it this way. We ended up effectively flattening the entire graph into an adjacency list.

So, a key is basically some UUID to an entity. So, Corey, you’d have some UUID associated with you and the value would be whatever your information would be, as well as other UUIDs to links to the other entities. So, from that first retrieval, I can now unpack it, and, “Oh, now I have a whole bunch of other UUIDs I can then query on to get that information, which will then have more IDs associated with it,” is more or less sort of how we do our graph traversal and query this in our graph queries.

Corey: One of the fun things about doing this sort of interview dance on the podcast as long as I have is you start to pick up what people are saying by virtue of what they don’t say. Earlier, you wound up mentioning that we often use Redis for things like tens, or hundreds of gigabytes, which sort of leaves in my mind the strong implication that you’re talking about something significantly larger than that. Can you disclose the scale of data we’re talking about her?

Jason: Yeah. So, we use Redis as our primary data store for our Identity Graph, and also for—soon to be for our Identity Network, which is our other database. But specifically for our Identity Graph, scale we’re talking about, we do have some compression added on there, but if you say uncompressed, it’s about 12 terabytes of data that’s compressed, with replication into about four.

Corey: That’s a relatively decent compression factor, given that I imagine we’re not talking about huge datasets.

Jason: Yeah, so this is actually basically driven directly by cost: If you need to store less data, then you need less memory, therefore, you need to pay for less.

Corey: So, our users once again have shored up my longtime argument that when it comes to cloud, cost and architecture are in fact the same thing. Please, continue by all means.

Jason: I would be lying if I said that we didn’t do weekly slash monthly reviews of costs. Where are we spending costs in AWS? How can we improve costs? How can we cut down on costs? How can you store less—

Corey: You are singing my song.

Jason: It is a [laugh] it is a constant discussion. But yeah, so we use Zstandard compression, which was developed at Facebook, and it’s a dictionary-based compression. And the reason we went for this is—I mean like if I say I want to compress, like, a Word document down, like, you can get very, very, very high level of compression. It exists. It’s not that interesting, everyone does it all the time.

But with this we’re talking about—so in that, basically, four or so terabytes of compressed data that we have, it’s something around four to four-and-a-half billion keys and values, and so in that we’re talking about each key-value only really having anywhere between 50 and 100 bytes. So, we’re not compressing very large pieces of information. We’re compressing very small 50 to 100 byte JSON values that we have give UUID keys and JSON strings stored as values. So, we’re compressing these 50 to 100 byte JSON strings with around 70, 80% compression. I mean, that’s using Zstandard with a custom dictionary, which probably gave us the biggest cost savings of all, if you can [unintelligible 00:14:32] your dataset size by 60, 70%, that’s huge. [laugh].

Corey: Did you start off doing this on top of Redis, or was this an evolution that eventually got you there?

Jason: It was an evolution over time. We were formally Whitepages. I mean, Whitepages started back in the late-90s. It really just started off as a—we just—

Corey: You were a very early adopter of Redis [laugh]. Yeah, at that point, like, “We got a time machine and started using it before it existed.” Always a fun story. Recruiters seem to want that all the time.

Jason: Yeah. So, when we first started, I mean, we didn’t have that much data. It was basically just one provider that gave us some amount of data, so it was kind of just a—we just need to start something quick, get something going. And so, I mean, we just did what most people do just do the simplest thing: Just stuff it all in a Postgres database and call it good. Yeah, it was slow, but hey, it was back a long time ago, people were kind of okay with a little bit—

Corey: The world moved a bit slower back then.

Jason: Everything was a bit slower, no one really minded too much, the scale wasn’t that large. But business requirements always change over time and they evolve, and so to meet those ever-evolving business requirements, we move from Postgres, and where a lot of the fun commit messages that I mentioned earlier can be found is when we started working with Cassandra and Titan. That was before my time before I had started, but from what I understand, that was a very fun time. But then from there, that’s when we really kind of just took a step back and just said, like, “There’s so much stuff that we just don’t need here. Let’s really think about this, and let’s try to optimize a bit more.”

Like, we know our use case, why not optimize for our use case? And that’s how we ended up with the flattened graph storage stuffing into Redis. Because everyone thought of Redis as a cache, but everyone also knows that—why is it a cache? Because it’s fast. [laugh]. We need something that’s very fast.

Corey: I still conceptualize it as an in-memory data store, just because when I turned on disk persistence model back in 2011, give or take, it suddenly started slamming the entire data store to a halt for about three seconds every time it did it. It was, “What’s this piece of crap here?” And it was, “Oh, yeah. Turns out there was a regression on Zen, which is what AWS is used as a hypervisor back then.” And, “Oh, yeah.”

So, fork became an expensive call, it took forever to wind up running. So oh, the obvious lesson we take from this is, oh, yeah, Redis is not designed to be used with disk persistence. Wrong lesson to take from the behavior, but did cement, in my mind at least, the idea that this is something that we tend to use only as an in-memory store. It’s clear that the technology has evolved, and in fact, I’m super glad that Redis threw you my direction to talk to you about this stuff because until talking to you, I was still—I got to admit—sort of in the position of thinking of it still as an in-memory data store because the fact that Redis says otherwise because they’re envisioning it being something else, well okay, marketers going to market. You’re a customer; it’s a lot harder for me to talk smack about your approach to this thing, when I see you doing it for, let’s be serious here, what is a very important use case. If identity verification starts failing open and everyone claims to be who they say they are, that’s something is visible from orbit when it comes to the macroeconomic effect.

Jason: Yeah, exactly. It’s actually funny because before we move to primarily just using Redis, before going to fully Redis, we did still use Redis. But we used ElastiCache, we had it loaded into ElastiCache, but we also had it loaded into DynamoDB as sort of a, I don’t want this to fail because we weren’t comfortable with actually using Redis as a primary database. So, we used to use ElastiCache with a fallback to DynamoDB, just in that off chance, which, you know, sometimes it happens, sometimes it didn’t. But that’s when we basically just went searching for new technologies, and that’s actually how we landed on Redis on Flash, which is a kind of breaks the whole idea of Redis as an in-memory database to where it’s Redis, but it’s not just an in-memory database, you also have flashback storage.

Corey: So, you’ll forgive me if I combine my day job with this side project of mine, where I fixed the horrifying AWS bills for large companies. My bias, as a result, is to look at infrastructure environments primarily through the lens of AWS bill. And oh, great, go ahead and use an enterprise offering that someone else runs because, sure, it might cost more money, but it’s not showing up on the AWS bill, therefore, my job is done. Yeah, it turns out that doesn’t actually work or the answer to every AWS billing problem is to migrate to Azure to GCP. Turns out that doesn’t actually solve the problem that you would expect.

But you’re obviously an enterprise customer of Redis. Does that data live in your AWS account? Is it something using as their managed service and throwing over the wall so it shows up as data transfer on your side? How is that implemented? I know they’ve got a few different models.

Jason: There’s a couple of aspects onto how we’re actually bill. I mean, so like, when you have ElastiCache, you’re just billed for your, I don’t know, whatever nodes using, cache dot, like, r5 or whatever they are… [unintelligible 00:19:12]

Corey: I wish most people were using things that modern. But please, continue.

Jason: But yeah, so you basically just build for whatever last cache nodes you have, you have your hourly rate, I don’t know, maybe you might reserve them. But with Redis Enterprise, the way that we’re billed is there’s two aspects. One is, well, the contract that we signed that basically allows us to use their technology [unintelligible 00:19:31] with a managed service, a managed solution. So, there’s some amount that we pay them directly within some contract, as well as the actual nodes themselves that exist in the cluster. And so basically the way that this is set up, is we effectively have a sub-account within our AWS account that Redis Labs has—or not Redis Labs; Redis Enterprise—has access to, which they deploy directly into, and effectively using VPC peering; that’s how we allow our applications to talk directly to it.

So, we’re built directly—or so the actual nodes of the cluster, which are i3.8x, I believe, on they basically just run EC2 instances. All of those instances, those exist on our bill. Like, we get billed for them; we pay for them. It’s just basically some sub-account that they have access to that they can deploy into. So, we get billed for the instances of the cluster as well as whatever we pay for our enterprise contract. So, there’s sort of two aspects to the actual billing of it.

Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they’re all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don’t dispute that but what I find interesting is that it’s predictable. They tell you in advance on a monthly basis what it’s going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you’re one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you’ll receive a $100 in credit. Thats V-U-L-T-R.com slash screaming.

Corey: So, it’s easy to sit here as an engineer—and believe me, having been one for most of my career, I fall subject to this bias all the time—where it’s, “Oh, you’re going to charge me a management fee to run this thing? Oh, that’s ridiculous. I can do it myself instead,” because, at least when I was learning in my dorm room, it was always a “Well, my time is free, but money is hard to come by.” And shaking off that perspective as my career continued to evolve was always a bit of a challenge for me. Do you ever find yourself or your team drifting toward the direction of, “Well, what we’re paying for Redis Enterprise for? We could just run it ourselves with the open-source version and save whatever it is that they’re charging on top of that?”

Jason: Before we landed on Redis on Flash, we had that same thought, like, “Why don’t we just run our own Redis?” And the decision to that is, well, managing such a large cluster that’s so important to the function of our business, like, you effectively would have needed to hire someone full time to just sit there and stare at the cluster the whole time just to operate it, maintain it, make sure things are running smoothly. And it’s something that we made a decision that, no, we’re going to go with a managed solution. It’s not easy to manage and maintain clusters of that size, especially when they’re so important to business continuity. [laugh]. From our eyes, it was just not worth the investment for us to try and manage it ourselves and go with the fully managed solution.

Corey: But even when we talk about it, it’s one of those well—it’s—everyone talks about, like, the wrong side of it first, the oh, it’s easier if things are down if we wind up being able to say, “Oh, we have a ticket open,” rather than, “I’m on the support forum and waiting for people to get back to me.” Like, there’s a defensibility perspective. We all just sort of, like sidestep past the real truth of it of, yeah, the people who are best in the world running and building these things are right now working on the problem when there is one.

Jason: Yeah, they’re the best in the world at trying to solve what’s going on. [laugh].

Corey: Yeah, because that is what we’re paying them to do. Oh, right. People don’t always volunteer for for-profit entities. I keep forgetting that part of it.

Jason: Yeah, I mean, we’ve had some very, very fun production outages that just randomly happened because to our knowledge, we would just like—I would, like… “I have no idea what’s going on.” And, you know, working with their support team, their DevOps team, honestly, it was a good, like, one-week troubleshooting. When we were validating the technology, we accidentally halted the database for seemingly no reason, and we couldn’t possibly figure out what’s going on. We kept talking to—we were talking to their DevOps team. They’re saying, “Oh, we see all these writes going on for some reason.” We’re like, “We’re not sending any writes. Why is there writes?”

And that was the whole back and forth for almost a week, trying to figure out what the heck was going on, and it happened to be, like, a very subtle case, in terms of, like, the how the keys and values are actually stored between RAM and flash and how it might swap in and out of flash. And like, all the way down to that level where I want to say we probably talked to their DevOps team at least two to three times, like, “Could you just explain this to me?” Like, “Sure,” like, “Why does this happen? I didn’t know this was a thing.” So, on and so forth. Like, there’s definitely some things that are fairly difficult to try and debug, which definitely helps having that enterprise-level solution.

Corey: Well, that’s the most valuable thing in any sort of operational experience where, okay, I can read the documentation and all the other things, and it tells me how it works. Great. The real value of whether I trust something in production is whether or not I know how it breaks where it’s—

Jason: Yeah.

Corey: —okay—because the one thing you want to hear when you’re calling someone up is, “Oh, yeah. We’ve seen this before. This is what you do to fix it.” The worst thing in the world is, “Oh, that’s interesting. We’ve never seen that before.” Because then oh, dear Lord, we’re off in the mists of trying to figure out what’s going on here, while production is down.

Jason: Yeah kind of like, “What is this database do, like, in terms of what do we do?” Like, I mean, this is what we store our Identity Graph in. This has the graph of people’s information. If we’re trying to do identity verification for transactions or anything, for any of our products, I mean, we need to be able to query this database. It needs to be up.

We have a certain requirement in terms of uptime, where we want it at least, like, four nines of uptime. So, we also want a solution that, hey, even if it wants to break, don’t break that bad. [laugh]. There’s a difference between, “Oh, a node failed and okay, like, we’re good in 10, 20 seconds,” versus, “Oh, node failed. You lost data. You need to start reloading your dataset, or you can’t query this anymore.” [laugh]. There’s a very large difference between those two.

Corey: A little bit, yeah. That’s also a great story to drive things across. Like, “Really? What is this going to cost us if we pay for the enterprise version? Great. Is it going to be more than some extortionately large number because if we’re down for three hours in the course of a year, that’s we owe our customers back for not being able to deliver, so it seems to me this is kind of a no-brainer for things like that.”

Jason: Yeah, exactly. And, like, that’s part of the reason—I mean, a lot of the things we do at Ekata, we usually go with enterprise-level for a lot of things we do. And it’s really for that support factor in helping reduce any potential downtime for what we have because, well, if we don’t consider ourselves comfortable or expert-level in that subject, I mean, then yeah, if it goes down, that’s terrible for our customers. I mean, it’s needed for literally every single query that comes through us.

Corey: I did want to ask you, but you keep talking about, “The database” and, “The cluster.” That seems like you have a single database or a single cluster that winds up being responsible for all of this. That feels like the blast radius of that thing going down must be enormous. Have you done any research into breaking that out into smaller databases? What is it that’s driven you toward this architectural pattern?

Jason: Yeah, so for right now, so we have actually three regions were deployed into. We have a copy of it in us-west in AWS, we have one an eu-central-1, and we also have one, an ap-southeast-1. So, we have a complete copy of this database in three separate regions, as well as we’re spread across all the available availability zones for that region. So, we try and be as multi-AZ as we can within a specific region. So, we have thought about breaking it down, but having high availability, having multiple replication factors, having also, you know, it stored in multiple data centers, provides us at least a good level of comfortability.

Specifically, in our US cluster, we actually have two. We literally also—with a lot of the cost savings that we got, we actually have two. We have one that literally sits idle 24/7 that we just call our backup and our standby where it’s ready to go at a moment’s notice. Thankfully, we haven’t had to use it since I want to say its creation about a year-and-a-half ago, but it sits there in that doomsday scenario: “Oh, my gosh, this cluster literally cannot function anymore. Something crazy catastrophic happened,” and we can basically hot swap back into another production-ready cluster as needed, if needed.

Because the really important thing is that if we broke it up into two separate databases if one of them goes down, that could still fail your entire query. Because what if that’s the database that held your address? We can still query you, but we’re going to try and get your address and well, there, your traversal just died because you can no longer get that. So, even trying to break it up doesn’t really help us too much. We can still fail the entire traversal query.

Corey: Yeah, which makes an awful lot of sense. Again, to be clear, you’ve obviously put thought into this goes way beyond the me hearing something in passing and saying, “Hey, you considered this thing?” Let’s be very clear here. That is the sign of a terrible junior consultant. “Well, it sounds like what you built sucked. Did you consider building something that didn’t suck?” “Oh, thanks, Professor. Really appreciate your pointing that out.” It’s one of those useful things.

Jason: It’s like, “Oh, wow, we’ve been doing this for, I don’t know, many, many years.” It’s like, “Oh, wow, yeah. I haven’t thought about that one yet.” [laugh].

Corey: So, it sounds like you’re relatively happy with how Redis has worked out for you as the primary data store. If you were doing it all again from scratch, would you make the same technology selection there or would you go in a different direction?

Jason: Yeah, I think I’d make the same decision. I mean, we’ve been using Redis on Flash for at this point three, maybe coming up to four years at this point. There’s a reason we keep renewing our contract and just keep continuing with them is because, to us, it just fits our use case so well, and we very much choose to continue going with this direction in this technology.

Corey: What would you have them change as far as feature enhancements and new options being enabled there? Because remember, asking them right now in front of an audience like this puts them in a situation where they cannot possibly refuse. Please, how would you improve Redis from where it is now?

Jason: I like how you think. That’s [laugh] a [fair way to 00:28:42] to describe it. There’s a couple of things for optimizations that can always be done. And, like, specifically with, like, Redis on Flash, there’s some issue we had with storing as binary keys that to my knowledge hasn’t necessarily been completed yet that basically prevents us from storing as binary, which has some amount of benefit because well, binary keys require less memory to store. When you’re talking about 4 billion keys, even if you’re just saving 20 bytes of key, like you’re talking about potentially hundreds of gigabytes of savings once you—

Corey: It adds up with the [crosstalk 00:29:13].

Jason: Yeah, it adds up pretty quick. [laugh]. So, that’s probably one of the big things that we’ve been in contact with them about fixing that hasn’t gotten there yet. The other thing is, like, there’s a couple of, like, random… gotchas that we had to learn along the way. It does add a little bit of complexity in our loading process.

Effectively, when you first write a value into the database it’ll write to RAM, but then once it gets flushed to flash, the database effectively asks itself, “Does this value already exist in flash?” Because once it’s first written, it’s just written to RAM, it isn’t written to backing flash. And if it says, “No it’s not,” the database then does a write to write it into Flash and then evict it out of RAM. That sounds pretty innocent, but if it already exists in flash when you read it, it says, “Hey, I need to evict this does it already exist in Flash?” “Yep.” “Okay, just chuck it away. It already exists, we’re good.”

It sounds pretty nice, but this is where we accidentally halted our database is once we started putting a huge amount of load on the cluster, our general throughput on peak day is somewhere in the order of 160 to 200,000 Redis operations per second. So, you’re starting to think of, hey, you might be evicting 100,000 values per second into Flash, you’re talking about added 100,000 operate or write operations per second into your cluster, and that accidentally halted our database. So, the way we actually go around this is once we write our data store, we actually basically read the whole thing once because if you read every single key, you pretty much guarantee to cycle everything into Flash, so it doesn’t have to do any of those writes. For right now, there is no option to basically say that, if I write—for our use case, we do very little writes except for upfront, so it’d be super nice for our use case, if we can say, “Hey, our write operations, no, I want you to actually do a full write-through to flash.” Because, you know, that would effectively cut our entire database prep in half. We no longer had to do that read to cycle everything through. Those are probably the two big things, and one of the biggest gotchas that we ran into [laugh] that maybe it isn’t, so known.

Corey: I really want to thank you for taking the time to speak with me today. If people want to learn more, where can they find you? And I will also theorize wildly, that if you’re like basically every other company out there right now, you’re probably hiring on your team, too.

Jason: Yeah, I very much am hiring; I’m actually hiring quite a lot right now. [laugh]. So, they can reach me, my email is simply [email protected]. I unfortunately, don’t have a Twitter handle. Or you can find me on LinkedIn. I’m pretty sure most people have LinkedIn nowadays.

But yeah, and also feel free to reach out if you’re also interested in learning more or opportunities, like I said, I’m hiring quite extensively. I’m specifically the team that builds our actual product APIs that we offer to customers, so a lot of the sort of latency optimizations that we do usually are kind of through my team, in coordination with all the other teams, since we need to build a new API with this requirement. How do we get that requirement? [laugh]. Like, let’s go start exploring.

Corey: Excellent. I will, of course, throw a link to that in the [show notes 00:32:10] as well. I want to thank you for spending the time to speak with me today. I really do appreciate it.

Jason: Yeah. I appreciate you having me on. It’s been a good chat.

Corey: Likewise. I’m sure we will cross paths in the future, especially as we stumble through the wide world of, you know, data stores in AWS, and this ecosystem keeps getting bigger, but somehow feels smaller all the time.

Jason: Yeah, exactly. You know, we’ll still be where we are hopefully, approving all of your transactions as they go through, make sure that you don’t run into any friction.

Corey: Thank you once again, for speaking to me, I really appreciate it.

Jason: No problem. Thanks again for having me.

Corey: Jason Frazier, Software Engineering Manager at Ekata. This has been a promoted episode brought to us by our friends at Redis. I’m Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment telling me that Enterprise Redis is ridiculous because you could build it yourself on a Raspberry Pi in only eight short months.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.