How Data Discovery is Changing the Game with Shinji Kim

Episode Summary

Shinji Kim, CEO and Co-Founder of Select Star, joins Corey to talk about the fast-growing world of data discovery. Shinji presents the question that Select Star answers, “How discoverable is your data?” and explains how Select Star is differentiating itself in a space where new players are appearing all the time. Corey and Shinji talk about the needs of data discovery clients ranging from “I need a database” to “I have too many databases”, and how vital it is to understand what data is actually being used to avoid overpaying for data storage or worse - deleting data that’s vital to your organization. Listen in to find out why data discovery is becoming more essential and the impact of making better use of your data.

Episode Show Notes & Transcript

About Shinji

Shinji Kim is the Founder & CEO of Select Star, an automated data discovery platform that helps you to understand & manage your data. Previously, she was the Founder & CEO of Concord Systems, a NYC-based data infrastructure startup acquired by Akamai Technologies in 2016. She led the strategy and execution of Akamai IoT Edge Connect, an IoT data platform for real-time communication and data processing of connected devices. Shinji studied Software Engineering at University of Waterloo and General Management at Stanford GSB.



Links Referenced:


Transcript


Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it’s an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That’s why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it’s ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That’s snark.cloud/appconfig.

Corey: I come bearing ill tidings. Developers are responsible for more than ever these days. Not just the code that they write, but also the containers and the cloud infrastructure that their apps run on. Because serverless means it’s still somebody’s problem. And a big part of that responsibility is app security from code to cloud. And that’s where our friend Snyk comes in. Snyk is a frictionless security platform that meets developers where they are - Finding and fixing vulnerabilities right from the CLI, IDEs, Repos, and Pipelines. Snyk integrates seamlessly with AWS offerings like code pipeline, EKS, ECR, and more! As well as things you’re actually likely to be using. Deploy on AWS, secure with Snyk. Learn more at Snyk.co/scream That’s S-N-Y-K.co/scream

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Every once in a while, I encounter a company that resonates with something that I’ve been doing on some level. In this particular case, that is what’s happened here, but the story is slightly different. My guest today is Shinji Kim, who’s the CEO and founder at Select Star.

And the joke that I was making a few months ago was that Select Stars should have been the name of the Oracle ACE program instead. Shinji, thank you for joining me and suffering my ridiculous, basically amateurish and sophomore database-level jokes because I am bad at databases. Thanks for taking the time to chat with me.

Shinji: Thanks for having me here, Corey. Good to meet you.

Corey: So, Select Star despite being the only query pattern that I’ve ever effectively been able to execute from memory, what you do as a company is described as an automated data discovery platform. So, I’m going to start at the beginning with that baseline definition. I think most folks can wrap their heads around what the idea of automated means, but the rest of the words feel like it might mean different things to different people. What is data discovery from your point of view?

Shinji: Sure. The way that we define data discovery is finding and understanding data. In other words, think about how discoverable your data is in your company today. How easy is it for you to find datasets, fields, KPIs of your organization data? And when you are looking at a table, column, dashboard, report, how easy is it for you to understand that data underneath? Encompassing on that is how we define data discovery.

Corey: When you talk about data lurking around the company in various places, that can mean a lot of different things to different folks. For the more structured data folks—which I tend to think of as the organized folks who are nothing like me—that tends to mean things that live inside of, for example, traditional relational databases or things that closely resemble that. I come from a grumpy old sysadmin perspective, so I’m thinking, oh, yeah, we have a Jira server in the closet and that thing’s logging to its own disk, so that’s going to be some information somewhere. Confluence is another source of data in an organization; it’s usually where insight and a knowledge of what’s going on goes to die. It’s one of those write once, read never type of things.

And when I start thinking about what data means, it feels like even that is something of a squishy term. From the perspective of where Select Start starts and stops, is it bounded to data that lives within relational databases? Does it go beyond that? Where does it start? Where does it stop?

Shinji: So, we started the company with an intention of increasing the discoverability of data and hence providing automated data discovery capability to organizations. And the part where we see this as the most effective is where the data is currently being consumed today. So, this is, like, where the data consumption happens. So, this can be a data warehouse or data lake, but this is where your data analysts, data scientists are querying data, they are building dashboards, reports on top of, and this is where your main data mart lives.

So, for us, that is primarily a cloud data warehouse today, usually has a relational data structure. On top of that, we also do a lot of deep integrations with BI tools. So, that includes tools like Tableau, Power BI, Looker, Mode. Wherever these queries from the business stakeholders, BI engineers, data analysts, data scientists run, this is a point of reference where we use to auto-generate documentation, data models, lineage, and usage information, to give it back to the data team and everyone else so that they can learn more about the dataset they’re about to use.

Corey: So, given that I am seeing an increased number of companies out there talking about data discovery, what is it the Select Star does that differentiates you folks from other folks using similar verbiage in how they describe what they do?

Shinji: Yeah, great question. There are many players that popping up, and also, traditional data catalog’s definitely starting to offer more features in this area. The main differentiator that we have in the market today, we call it fast time-to-value. Any customer that is starting with Select Star, they get to set up their instance within 24 hours, and they’ll be able to get all the analytics and data models, including column-level lineage, popularity, ER diagrams, and how other people are—top users and how other people are utilizing that data, like, literally in few hours, max to, like, 24 hours. And I would say that is the main differentiator.

And most of the customers I have pointed out that setup and getting started has been super easy, which is primarily backed by a lot of automation that we’ve created underneath the platform. On top of that, just making it super easy and simple to use. It becomes very clear to the users that it’s not just for the technical data engineers and DBAs to use; this is also designed for business stakeholders, product managers, and ops folks to start using as they are learning more about how to use data.

Corey: Mapping this a little bit toward the use cases that I’m the most familiar with, this big source of data that I tend to stumble over is customer AWS bills. And that’s not exactly a big data problem, given that it can fit in memory if you have a sufficiently exciting computer, but using Tableau don’t wind up slicing and dicing that because at some point, Excel falls down. From my perspective, problem with Excel is that it doesn’t tend to work on huge datasets very well, and from the position of Salesforce, the problem with Excel is that it doesn’t cost a giant pile of money every month. So, those two things combined, Tableau is the answer for what we do. But that’s sort of the end-all for us of, that’s where it stops.

At that point, we have dashboards that we build and queries that we run that spit out the thing we’re looking at, and then that goes back to inform our analysis. We don’t inherently feed that back into anything else that would then inform the rest of what we do. Now, for our use case, that probably makes an awful lot of sense because we’re here to help our customers with their billing challenges, not take advantage of their data to wind up informing some giant model and mispurposing that data for other things. But if we were generating that data ourselves as a part of our operation, I can absolutely see the value of tying that back into something else. You wind up almost forming a reinforcing cycle that improves the quality of data over time and lets you understand what’s going on there. What are some of the outcomes that you find that customers get to by going down this particular path?

Shinji: Yeah, so just to double-click on what you just talked about, the way that we see this is how we analyze the metadata and the activity logs—system logs, user logs—of how that data has been used. So, part of our auto-generated documentation for each table, each column, each dashboard, you’re going to be able to see the full data lineage: where it came from, how it was transformed in the past, and where it’s going to. You will also see what we call popularity score: how many unique users are utilizing this data inside the organization today, how often. And utilizing these two core models and analysis that we create, you can start looking at first mapping out the data flow, and then determining whether or not this dataset is something that you would want to continue keeping or running the data pipelines for. Because once you start mapping these usage models of tables versus dashboards, you may find that there are recurring jobs that creates all these materialized views and tables that are feeding dashboards that are not being looked at anymore.

So, with this mechanism by looking initially data lineage as a concept, a lot of companies use data lineage in order to find dependencies: what is going to break if I make this change in the column or table, as well as just debugging any of issues that is currently happening in their pipeline. So, especially when you will have to debug a SQL query or pipeline that you didn’t build yourself but you need to find out how to fix it, this is a really easy way to instantly find out, like, where the data is coming from. But on top of that, if you start adding this usage information, you can trace through where the main compute is happening, which largest route table is still being queried, instead of the more summarized tables that should be used, versus which are the tables and datasets that is continuing to get created, feeding the dashboards and is those dashboards actually being used on the business side. So, with that, we have customers that have saved thousands of dollars every month just by being able to deprecate dashboards and pipelines that they were afraid of deprecating in the past because they weren’t sure if anyone’s actually using this or not. But adopting Select Star was a great way to kind of do a full spring clean of their data warehouse as well as their BI tool. And this is an additional benefit to just having to declutter so many old, duplicated, and outdated dashboards and datasets in their data warehouse.

Corey: That is, I guess, a recurring problem that I see in many different pockets of the industry as a whole. You see it in the user visibility space, you see it in the cost control space—I even made a joke about Confluence that alludes to it—this idea that you build a whole bunch of dashboards and use it to inform all kinds of charts and other systems, but then people are busy. It feels like there’s no ‘and then.’ Like, one of the most depressing things in the universe that you can see after having spent a fair bit of effort to build up those dashboards is the analytics for who internally has looked at any of those dashboards since the demo you gave showing it off to everyone else. It feels like in many cases, we put all these projects and amount of effort into building these things out that then don’t get used.

People don’t want to be informed by data they want to shoot from their gut. Now, sometimes that’s helpful when we’re talking about observability tools that you use to trace down outages, and, “Well, our site’s really stable. We don’t have to look at that.” Very awesome, great, awesome use case. The business insight level of dashboard just feels like that’s something you should really be checking a lot more than you are. How do you see that?

Shinji: Yeah, for sure. I mean, this is why we also update these usage metrics and lineage every 24 hours for all of our customers automatically, so it’s just up-to-date. And the part that more customers are asking for where we are heading to—earlier, I mentioned that our main focus has been on analyzing data consumption and understanding the consumption behavior to drive better usage of your data, or making data usage much easier. The part that we are starting to now see is more customers wanting to extend those feature capabilities to their staff of where the data is being generated. So, connecting the similar amount of analysis and metadata collection for production databases, Kafka Queues, and where the data is first being generated is one of our longer-term goals. And then, then you’ll really have more of that, up to the source level, of whether the data should be even collected or whether it should even enter the data warehouse phase or not.

Corey: One of the challenges I see across the board in the data space is that so many products tend to have a very specific point of the customer lifecycle, where bringing them in makes sense. Too early and it’s, “Data? What do you mean data? All I have are these logs, and their purpose is basically to inflate my AWS bill because I’m bad at removing them.” And on the other side, it’s, “Great. We pioneered some of these things and have built our own internal enormous system that does exactly what we need to do.” It’s like, “Yes, Google, you’re very smart. Good job.” And most people are somewhere between those two extremes. Where are customers on that lifecycle or timeline when using Select Star makes sense for them?

Shinji: Yeah, I think that’s a great question. Also the time, the best place where customers would use Select Star for is that after they have their cloud data warehouse set up. Either they have finished their migration, they’re starting to utilize it with their BI tools, and they’re starting to notice that it’s not just, like, you know, ten to fifty tables that they’re starting with; most of them have more than hundreds of tables. And they’re feeling that this is starting to go out of control because we have all these data, but we are not a hundred percent sure what exactly is in our database. And this usually just happens more in larger companies, companies at thousand-plus employees, and they usually find a lot of value out of Select Star right away because, like, we will start pointing out many different things.

But we also see a lot of, like, forward-thinking, fast-growing startups that are at the size of a few hundred employees, you know, they now have between five to ten-person data team, and they are really creating the right single source of truth of their data knowledge through a Select Star. So, I think you can start anywhere from when your data team size is, like, beyond five and you’re continuing to grow because every time you’re trying to onboard a data analyst, data scientist, you will have to go through, like, basically the same type of training of your data model, and it might actually look different because the data models and the new features, new apps that you’re integrating this changes so quickly. So, I would say it’s important to have that base early on and then continue to grow. But we do also see a lot of companies coming to us after having thousands of datasets or tens of thousands of datasets that it’s really, like, very hard to operate and onboard anyone. And this is a place where we really shine to help their needs, as well.

Corey: Sort of the, “I need a database,” to the, “Help, I have too many databases,” pipeline, where [laugh] at some point people start to—wanting to bring organization to the chaos. One thing I like about your model is that you don’t seem to be making the play that every other vendor in the data space tends to, which is, “Oh, we want you to move your data onto our systems. The end.” You operate on data that is in place, which makes an awful lot of sense for the kinds of things that we’re talking about. Customers are flat out not going to move their data warehouse over to your environment, just because the data gravity is ludicrous. Just the sheer amount of money it would take to egress that data from a cloud provider, for example, is monstrous.

Shinji: Exactly. [laugh]. And security concerns. We don’t want to be liable for any of the data—and this is, like, a very specific decision we’ve made very early on the company—to not access data, to not egress any of the real data, and to provide as much value as possible just utilizing the metadata and logs. And depending on the types of data warehouses, it also can be really efficient because the query history or the metadata systems tables are indexed separately. Usually, it’s much lighter load on the compute side. And that definitely has, like, worked well for our advantage, especially being a SaaS tool.

Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig secures your cloud from source to run. They believe, as do I, that DevOps and security are inextricably linked. If you wanna learn more about how they view this, check out their blog, it's definitely worth the read. To learn more about how they are absolutely getting it right from where I sit, visit Sysdig.com and tell them that I sent you. That's S Y S D I G.com. And my thanks to them for their continued support of this ridiculous nonsense.

Corey: What I like is just how straightforward the integrations are. It’s clear you’re extraordinarily agnostic as far as where the data itself lives. You integrate with Google’s BigQuery, with Amazon Redshift, with Snowflake, and then on the other side of the world with Looker, and Tableau, and other things as well. And one of the example use cases you give is find the upstream table in BigQuery that a Looker dashboard depends on. That’s one of those areas where I see something like that, and, oh, I can absolutely see the value of that.

I have two or three DynamoDB tables that drive my newsletter publication system that I built—because I have deep-seated emotional problems and I take it out and everyone else via code—but as a small, contained system that I can still fit in my head. Mostly. And I still forget which table is which in some cases. Down the road, especially at scale, “Okay, where is the actual data source that’s informing this because it doesn’t necessarily match what I’m expecting,” is one of those incredibly valuable bits of insight. It seems like that is something that often gets lost; the provenance of data doesn’t seem to work.

And ideally, you know, you’re staffing a company with reasonably intelligent people who are going to look at the results of something and say, “That does not align with my expectations. I’m going to dig.” As opposed to the, “Oh, yeah, that seems plausible. I’ll just go with whatever the computer says.” There’s an ocean of nuance between those two, but it’s nice to be able to establish the validity of the path that you’ve gone down in order to set some of these things up.

Shinji: Yeah, and this is also super helpful if you’re tasked to debug a dashboard or pipeline that you did not build yourself. Maybe the person has left the company, or maybe they’re out-of-office, but this dashboard has been broken and you’re quote-unquote, “On call,” for data. What are you going to do? You’re going to—without a tool that can show you a full lineage, you will have to start digging through somebody else’s SQL code and try to map out, like, where the data is coming from, if this is calculating correctly. Usually takes, you know, few hours to just get to the bottom of the issue. And this is one of the main use cases that our customers bring up every single time, as more of, like, this is now the go-to place every time there is any data questions or data issues.

Corey: The first and golden rule of cloud economics is step one, turn that shit off.

Shinji: [laugh].

Corey: When people are using something, you can optimize the hell out of it however you want, but nothing’s going to beat turning it off. One challenge is when we’re looking at various accounts and we see a Redshift cluster, and it’s, “Okay. That thing’s costing a few million bucks a year and no one seems to know anything about it.” They keep pointing to other teams, and it turns into this giant, like, finger-pointing exercise where no one seems to have responsibility for it. And very often, our clients will choose not to turn that thing off because on the one hand, if you don’t turn it off, you’re going to spend a few million bucks a year that you otherwise would not have had to.

On the other, if you delete the data warehouse, and it turns out, oh, yeah, that was actually kind of important, now we don’t have a company anymore. It’s a question of which is the side you want to be wrong on. And in some levels, leaving something as it is and doing something else is always a more defensible answer, just because the first time your cost-saving exercises take out production, you’re generally not allowed to save money anymore. This feels like it helps get to that source of truth a heck of a lot more effectively than tracing individual calls and turning into basically data center archaeologists.

Shinji: [laugh]. Yeah, for sure. I mean, this is why from the get go, we try to give you all your tables, all of your database, just ordered by popularity. So, you can also see overall, like, from all the tables, whether that’s thousands or tens of thousands, you’re seeing the most used, has the most number of dependencies on the top, and you can also filter it by all the database tables that hasn’t been touched in the last 90 days. And just having this, like, high-level view gives a lot of ideas to the data platform team about how they can optimize usage of their data warehouse.

Corey: From where I tend to sit, an awful lot of customers are still relatively early in their data journey. An awful lot of the marketing that I receive from various AWS mailing lists that I found myself on because I’ve had the temerity to open accounts has been along the lines of oh, data discovery is super important, but first, they presuppose that I’ve already bought into this idea that oh, every company must be a completely data-driven company. The end. Full stop.

And yeah, we’re a small bespoke services consultancy. I don’t necessarily know that that’s the right answer here. But then it takes it one step further and starts to define the idea of data discovery as, ah, you will use it to find a PII or otherwise sensitive or restricted data inside of your datasets so you know exactly where it lives. And sure, okay, that’s valuable, but it also feels like a very narrow definition compared to how you view these things.

Shinji: Yeah. Basically, the way that we see data discovery is it’s starting to become more of an essential capability in order for you to monitor and understand how your data is actually being used internally. It basically gives you the insights around sure, like, what are the duplicated datasets, what are the datasets that have that descriptions or not, what are something that may contain sensitive data, so on and so forth, but that’s still around the characteristics of the physical datasets. Whereas I think the part that’s really important around data discovery that is not being talked about as much is how the data can actually be used better. So, have it as more of a forward-thinking mechanism and in order for you to actually encourage more people to utilize data or use the data correctly, instead of trying to contain this within just one team is really where I feel like data discovery can help.

And in regards to this, the other big part around data discovery is really opening up and having that transparency just within the data team. So, just within the data team, they always feel like they do have that access to the SQL queries and you can just go to GitHub and just look at the database itself, but it’s so easy to get lost in the sea of metadata that is just laid out as just the list; there isn’t much context around the data itself. And that context and with along with the analytics of the metadata is what we’re really trying to provide automatically. So eventually, like, this can be also seen as almost like a way to, like, monitor the datasets, like, how you’re currently monitoring your applications through Datadog or your website with your Google Analytics, this is something that can be also used as more of a go-to source of truth around what your state of the data is, how that’s defined, and how that’s being mapped to different business processes, so that there isn’t much confusion around data. Everything can be called the same, but underneath it actually can mean very different things. Does that make sense?

Corey: No, it absolutely does. I think that this is part of the challenge in trying to articulate value that is, I guess, specific to this niche across an entire industry. The context that drives data is going to be incredibly important, and it feels like so much of the marketing in the space is aimed at one or two pre-imagined customer profiles. And that has the side effect of making customers for whom that model doesn’t align, look and feel like either doing something wrong, or makes it look like the vendor who’s pitching this is somewhat out of touch. I know that I work in a relatively bounded problem space, but I still learn new things about AWS billing on virtually every engagement that I go on, just because you always get to learn more about how customers view things and how they view not just their industry, but also the specificities of their own business and their own niche.

I think that is one of the challenges historically, with the idea of letting software do everything. Do you find the problems that you’re solving tend to be global in nature or are you discovering strange depths of nuance on a customer-by-customer basis at this point?

Shinji: Overall, a lot of the problems that we solve and the customers that we work with is very industry agnostic. As long as you are having many different datasets that you need to manage, there are common problems that arises, regardless of the industry that you’re in. We do observe some industry-specific issues because your data is either, it’s an unstructured data, or your data is primarily events, or you know, depending on how the data looks like, but primarily because of most of the BI solutions and data warehouses are operating as a relational databases, this is a part where we really try to build a lot of best practices, and the common analytics that we can apply to every customer that’s using Select Star.

Corey: I really want to thank you for taking so much time to go through the ins and outs of what it is you’re doing these days. If people want to learn more, where’s the best place to find you?

Shinji: Yeah, I mean, it’s been fun [laugh] talking here. So, we are at selectstar.com. That’s our website. You can sign up for a free trial. It’s completely self-service, so you don’t need to get on a demo but, like, we’ll also help you onboard and happy to give a free demo to whoever that is interested.

We are also on LinkedIn and Twitter under selectstarhq. Yeah, I mean, we’re happy to help for any companies that have these issues around wanting to increase their discoverability of data, and want to help their data team and the rest of the company to be able to utilize data better.

Corey: And we will, of course, put links to all of that in the [show notes 00:28:58]. Thank you so much for your time today. I really appreciate it.

Shinji: Great. Thanks for having me, Corey.

Corey: Shinji Kim, CEO and founder at Select Star. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that I won’t be able to discover because there are far too many podcast platforms out there, and I have no means of discovering where you’ve said that thing unless you send it to me.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.



Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.