Thomas Hazel is Founder, CTO, and Chief Scientist of CHAOSSEARCH. He is a serial entrepreneur at the forefront of communication, virtualization, and database technology and the inventor of CHAOSSEARCH's patent pending IP. Thomas has also patented several other technologies in the areas of distributed algorithms, virtualization and database science. He holds a Bachelor of Science in Computer Science from University of New Hampshire, and founded both student & professional chapters of the Association for Computing Machinery (ACM).
- Company site: http://chaossearch.io
- Twitter: @ThomasHazel
- LinkedIn: https://www.linkedin.com/in/thomashazel/
Announcer: Hello and welcome to Screaming in the Cloud with your host, cloud economist, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Thomas Hazel, founder and CTO of everyone's favorite company, CHAOSSEARCH. Thomas, welcome to the show.
Thomas: Thank you for having me.
Corey: So, let's start at the very beginning. Most people don't spring into existence having founded a company. There's usually an origin story. What were you doing before CHAOSSEARCH?
Thomas: So I had a background in big distributor systems, made a career out of building God boxes back in the day in telecom. And then for the last 15 years, I'm working on new computer science with respects to database, data analytics, really trying to be inventive in the way of the force of computer science.
Corey: And for those who have been living under a rock and not been paying attention to any episode of anything I've ever done until this one, you folks have sponsored a number of different things that I've been involved in, including this episode. But for those who have not been paying attention, what does CHAOSSEARCH do?
Thomas: So at a high level, we created some innovative technology that allows customers to do analytics, text search, and both relational directly on their office storage, particularly Amazon S3. And we do it at a scale and cost and price point that is quite unique, quite disruptive in the market. And so, for the last four years we've been building out a new indexing technology as well as associated architecture as a service that customers log on to our service, within five minute registration, they're up and running doing terabytes of analysis, let's say for log analytics, within minutes and using their favorite API, Elastic API or coming out in 2020, SQL API.
Corey: Before you folks had ever sponsored anything that I was involved with, I was aware of you because you employed Pete Cheslock, everyone's favorite DevOps thought board, as the VP of product. When he started talking about what you folks did, I said, "It seems implausible that it's as amazing as you say, but all right, I'll suspend disbelief. Tell me about it."
And it turned out that everything that I was told was actually in fact correct. I recommended you folks to clients of mine, not because there was any business relationship, and that is still something people can't buy, as turns out. Credibility, it matters. But because for a certain subset of use cases, it is an incredibly cost effective approach to handling things. So, I guess this is probably an idiotic question, but I'm going to roll with it anyway. What is it that makes this such a unique thing? There's nothing else in the market like this, but there should be.
If we take a look at cloud, all the different providers took a vote and the storage technology that won by orders of magnitude in terms of price, has always been object store. Why is there nothing that provides a simple searchable API for a rapid response for data that lives in S3?
Thomas: Great, great question. So Mike Backrum as I mentioned, is in distributed computing and computer science and there's existing technology out there, Lucene, it's an inverted index, really driver force behind log analytics. There's comm storage for warehousing, B-trees, LSM-trees, all these classic computer science data structures and algorithms that have been used for the last 30 years. And the real issue is, is that they are at the breaking point of the scale that we're seeing today. Moore's Law says one thing, but machine generated data is outpacing it.
And about five to seven years ago, I wanted to crack that code of creating a new data structure and algorithm that could really provide the next level of scale, hundred thousand magnitude scale, without having to use massive amount of compute and a team of engineers trying to erect a system of terabyte and petabyte scale. So that was what I wanted to set out and go do.
And so I reached out to people like Pete to say, "Hey, I have this idea. What do you think if I solve this problem, will people care?" He kind of laughed a little bit. He says, "Everybody wants this problem solved and people want unicorns and a pot of gold. Call me when you figure it out." And a couple of years ago, I reached out to him. I said, "Hey Pete, I think I figured it out." And we went to market with that solution.
So really the essence of it is, I created a new representation, really a compression of them that's both a database index that can uniquely support text search, like the classic hunting and log analytics that a Lucene index would support, as well as relational analytics that you think warehousing technology with the same representation, without having to have siloed databases that you'd say, "I'm going to store data necessary for archival, but then move it out, ETL into say, a relational system or say Elasticsearch cluster."
You can imagine that is of great complexity and at scale, these systems start to break. And so I said, "What if you created a service that you leave the data in your storage?" Really, the idea of storage analytics convergence. And the power of object storage, amazon S3 was a great place to start where it's infinitely scalable, wonderfully inexpensive, secure and durable. But the problem is technology that was built 30 years ago cannot access it in a high performance way.
So people move it out of S3 and then into one of these classic databases. So I thought with this innovative technology that we have patent pending papers on that said I could take our index and create a new architecture that leverages distributed compute to the essence, unlock that storage that I had to move it out of the system.
And that is what we did. CHAOSSEARCH, as you can imagine, search in the chaos is why we came up with the name, because there's so much data being stored in S3 it's like all data roads lead to S3 and object storage and now every cloud provider has one. And so that's what we did over the last four years, build out a service that takes the customer's S3 account, they provide read only access, we provide the compute, we index the data and write these indices back into their account and then they can do a text search for say, log analytics as well as relational force, a business intelligence analytics.
And so for the last several years we've been building that platform and we're super excited. So Pete said, "Hey, I got to get on board in this thing because it seems like you guys cracked the code and this is what I see everyday as a problem in the market."
Corey: It's a great approach if for no other reason that it finally does what I think everything should be doing, is separating out the compute and the storage layers. With anything that's legacy you're playing within this space and you're running clusters of these things, oh, you need to store more data in it, add more nodes. You're adding compute when you maybe or maybe don't need that and your storage is going in lockstep with that. Conversely you need better performance, well add more nodes that add to the storage burden.
What I like about your entire approach is that there is no management overhead. The data lives in S3 and you don't have to touch it again once that index is created, it's just there available for querying.
Thomas: Absolutely and the separation was obviously extremely important where if you have one terabyte a day of data to a hundred terabytes, we just spin up additional compute and that nightmare of trying to create shards of a task storage compute is the nightmare that people deal with today, particularly say, with Elasticsearch and their clustering technology. And so the ability just to have infinite storage with dynamic compute, and so you can have one node represent the entire dataset, or a hundred nodes.
The ability for us to allocate to make indexing faster or query search faster is all on the fly and all dynamic. But this technology and this index does it at a price point that is very, very unique and very disruptive. Our costs to do indexing at scale is quite low and the ability to do on the fly compute allocation for your tech search, for your aggregations across terabytes of data, is extremely cost effective and you can see that in our pricing today.
Corey: One of the things that I find so interesting about this entire approach is that people generally already have something that's using Elasticsearch out there. I run the numbers myself, I didn't need you to tell them to me. It's one of these things of working from an economic point of view, nothing personal, I generally try not to take too much of what the vendors tell me on faith. I'd like to test these things out myself and it was right. It was knocking a majority of the spend off in virtually every case and that was incredible. Especially when you apply it to something like log analytics.
You take a look at any of the log analytics companies and people can think through a wide variety of log analytics companies, it doesn't really matter which one, and at some point at scale you have to begin kidnapping princesses for ransom in order to pay for the bill. So it's absolutely one of those challenging problems. And then when the bill gets too high, you talk to that vendor and their response is, "Oh, log less stuff," which sort of cuts against the entire premise of a log analytics company because you can start noticing relationships and data if you have the data. But if you don't, that door is shut for you. It really has seemed like a disjointed, fractured industry for an awfully long time. I'm still trying and failing to find examples of other approaches that solve this problem the way that you have. There's nothing else like it.
Thomas: You know, you're exactly right. And what's interesting to me is every single time we hear a customer wanting to increase their retention, they double their cost. So every time you double it, your costs doubles and your complexity doubles. And so the idea that we've created that the storage is infinitely scalable, S3 has been quite proven on that, and our ability to elastically scale up the compute and deliver the compute where the storage is for those query's, it seems so obvious, it's so natural. The problem is that the separating of storage compute is not the rocket science. It's the technology, this index that has allowed us to do it at a price point all pure on S3 or object stores in general.
Corey: We take a look at Reinvent last year and they announced enhancements to the Amazon Elasticsearch service that they run and I thought, "Okay, this is it. They're finally going to do the same thing," and they launched their badly named UltraWarm tier that had an accelerated performance approach, or I guess a lower cost of being able to stage old data out. Even they went in the exact wrong direction for this problem as I understand. It seems bizarre.
Thomas: So the funny thing is, they solved a problem like everyone else was solving. They solved the Lucene problem via band-aids, if you will, to be so bold. The system was never to do this type of scale, right? Elasticsearch was surprised that the log analytic community adopted this technology and UltraWarm is just another way to provide a caching layer to make it a little bit better, a little bit more cost effective. And so what we wanted to do is in essence make S3 ultra hot. And the idea is that with this technology, we don't have to play those bandaid caching games. It's pure access on S3 and it feels like it's a hot cluster, but it's on S3 with R compute.
And that's where our technology, our index technology has cracked that code. How to make S3 high-performance, make S3 a data lake ground that Amazon's pushing but not make it swampy. We provide actually data discovering and catalog what's in there, but to ultimately index the data to provide log analytics via the Elastic API and Cabana or SQL say through a Looker visualization for BI workloads or Athena workloads. And again, the key thing is to make it performant, make it extremely scalable without having to worry about charting as well as high performance with a very low cost.
I know that those are a lot of what ifs that we started out this company, but we cracked that code as I mentioned, and we're super excited about what we've built because we're seeing at our customers that we take their bill and literally cut it in half if not a third.
Corey: So one of the, I guess, constraints I have is that when you first learn about something, it's difficult to go back and relearn it as something else. I was introduced to CHAOSSEARCH as you effectively, whenever you're using Elasticsearch as a part of something, maybe an Elk Stack, maybe something else, the API is equivalent to a drop in replacement of CHAOSSEARCH. And that's how I've always conceptualized. Anytime I see a big Elasticsearch bill, this is one of the things that I tend to think about.
The challenge, of course, is that it sounds like you're going beyond being effectively just that. What do I misunderstand in that oversimplified description of what you folks do?
Thomas: I hate to say it but we are building the next generation database that really rethinks how stored analytics converge, fuse together. We created and distributed fabric and we have this ability to export compatible APIs. We don't run any Elasticsearch underneath the hood or Lucene index, but we support a open standard Elastic API that people know and love with the Cabana integration for all that great visualization that people do in log analytics. And the idea that you can do the Elastic API with say, an index pattern, and the same index is a table and say a Presto dialect SQL interface with Looker without having that cost and complexity that standing up a relational system like a warehousing or standing up elastic cluster that it's almost hard to believe because it hasn't been done before.
There are some companies on the fringes of trying to solve this. Snowflake has done some separation storage compute, though they still have the cash out on a physical disc. We are 100% pure object storage and the difference is it's your storage. You own the data. We are just the distributing compute that manages the index and the query execution.
So it's another way of delivering the idea that you dump data and then what our service does is we support what we call this refinery within our service. So you index it once, you index everything. And then with our refinery, you can create virtual transformation or views that look like index patterns in Elasticsearch or tables in say, SQL, all in that same representation on the fly.
So imagine if you had a hundred terabytes of data and get a physically ETL, typically folks who do EMR to take it out of S3, transform it and put into say a warehousing solution or Elasticsearch. What if you just brought up the wizard, created a view, created your transform and it's available immediately without having to do anything physical. This is the power of this index technology created. It is distributed, it supports text search and relational. It's uniquely compressed to save on costs, but allows for these virtual transformations late materialize that allows you to do all that variation that each department maybe in a company wants to read and analyze that data.
Corey: So putting the shoe on the other foot a bit. In what use case is someone going to be using Elasticsearch and for when CHAOSSEARCH is not going to be a fit?
Thomas: Yeah, so a classic case is I call the Elk use case where, let's say you have logging for denial of service attack and you want to know what's going on. So often people stand up, CDN logs in say, Elasticsearch or Elk, and this can get pretty big. One terabyte up to 10 terabytes a day, maybe even more of log data. And they typically see that there's a denial of service attack and they want to figure out what's going on. The problem with the Elasticsearch, it has more denial of service attacks come in, more logs come in, and now you're querying more often. This is classically what makes Elasticsearch sad where the cluster comes down because you've only provision so much capacity.
With our system, you dump your data into S3 and we index it, we allow for dynamic scale to do that exact same security ops type use case, denial of service attack, maybe CloudFlare logs, app logs. Constantly people come to us and they have really messy data for their app logs. They're dumping, it's an S3, and they want to know what's going on.
Again, CHAOSSEARCH is a great way to do a app log analytics. Hey, what does my application doing? Is it running slow, were there problems? So app logs, security ops, DevOps, those type of use cases are really a sweet spot because almost everyone we've talked to when we built out this idea, they say, "Do use S3?" And they almost would say, "Duh, of course we do." And do you use the ElasticSearch? "Of course we do." And they typically stored in S3 first and then store it to the Elastic cluster and we just say, "Keep it in the S3, we'll index it, and you can have that exact same Elk functionality that you've had without the cost and complexity."
Now what we're not good for, there was a time where we were staying away from real time functionality where you wanted a sub-second type performance for a short window of data. And we were going after the big, big data where customers that had 10 terabytes a day up to 100 terabytes a day of analytics that was just breaking the bank at that scale. Actually in Q-1 of this year, 2020 we're coming out with the real time. So just as you would put data into Elasticsearch for instance, you can put data to us, it's available real time. We'll write this data into your S3 account and as we come out with our SQL interface, we'll support, create updates and upcerts as well. So it's really turning into a full fledge data platform that can handle the real time and obviously that real scale.
So one of our limitations was in the real time because we were not focused on that. But as we go into more of a BI real time use case, we're adding that feature out. Other limitations I would say parody with the Elastic API, we're really focused on log analytics and coming out BI analytics. The last API is quite broad and we don't support all the classic low level texture type capabilities like fuzzy type searching. Not that we won't, but that's not really where our wheelhouse is. But we do support all the classic log analytic, text search, wild carding, etc.
The limitations, I don't know. We have some big ideas and some pretty powerful technology and architecture. So if there are limitations today they won't be limitations tomorrow.
Corey: One of my personal guilty pleasures is pointing out the terrible business practices of others, and I've been vocal about this for a little while where there's a giant slap fight between Elastic and pretty much anyone who is selling anything that looks like Elastic. It seems like their approach to open source by and large is, "Use our open source software. No wait, not like that." And so now there's a trademark lawsuit among other things. There's a slap fight that's going on between AWS and Elastic where it was also launch their open distro for Elasticsearch. Elastic was doing their whole only some of the code in the same repository and some of the same commits is free and open source, the rest comes with a commercial license. CHAOSSEARCH bypasses all of that, correct? Effectively it's sit on the sidelines and watch popcorn. There's no Elasticsearch under the hood here.
Thomas: Yeah, yeah, yeah. No, we're Switzerland in those battles. I mean, the open source community is so important to all of us and Elastic has done some great work and Amazon has done some great work. I know Amazon gets some bad raps about taking open source and making it a service and making a business and open source companies, once they have to start making money, may make some software proprietary. We were using the open source Cabana out of Elastic and to be frank, when Elastic started close sourcing, for lack of better term, or making a license different than than the Apache2 for some of their more advanced features out of Cabana, this past summer we adopted the Amazon's open distro for the alerting the timeline, the role based S control because it was open and it was keeping with that philosophy where a whole community was built based on the idea that this would be free and open.
Now I understand business and I understand that we have a business, we're creating a service to make money. But the idea to have open APIs, it's really something that it's harder to fight that. And I do believe that open API is the way to go. I understand why Elastic is doing what they're doing. I understand what Amazon is doing as a business. We're in that service of solving customer problems and you get paid for it as a service. So we're kind of in that mode where let's keep the API is open, let's make a business on offering a service or support like open source used to always do.
Corey: Well I wish you well, but I kind of think you're going to struggle until you learn to do what real companies do and threaten to sue your customers.
Thomas: Yeah, yeah. Like I said, I'll play Switzerland on that one. But I mean, listen, it's amazing what open source community has done and we leverage open source. The idea that you open source something and then in essence close in the future, that's a tough scenario for people who bet into this API that now everyone uses today. So how it all plays out, I'm not sure. There's a lot of money involved with that. And our viewpoint is if there's tooling and APIs that we can leverage from the community, we'll use it and bring those to our customers.
Corey: I love making fun of companies doing different things. That is my part and parcel and I've got to say, I'm sorry, you're not immune either. I have to ask the burning question that's been on my mind since I first heard of you folks and was corrected on this. Why is CHAOSSEARCH all in caps?
Thomas: It's not as well thought out as you might think. I had the idea of naming a company chaos-something. And originally, before we were CHAOSSEARCH we actually were called Chaos Sumo, wrestling the chaos. And as we were going after log analytics and we knew Sumo Logic was out there, we didn't want to be confused with them. So I thought, "CHAOSSEARCH because we're searching the chaos, is really where the value is," the searching analytics. So I said, "Ah, let's rename the company CHAOSSEARCH." Do we call it Chaos Space Search? Do we call it CHAOSSEARCH? Do we do all lowercase? Do we do all cap case? Really we did all these variants really quickly on a piece of paper and we looked at all three or four variants and we like, "Oh, the capital looks pretty good. Let's go with that." And it was nothing more than that.
And so what we've been doing to get around the all caps concept is making the first part of chaos bold. It seems to be working for us. But yeah, we get teased about that all the time and it was more just a looked good in the font we used and that's why we chose it.
Corey: And sometimes that's all it takes is going down that path. But it, of course, opens you up to all kinds of criticism from the peanut gallery, by which, of course, I mean me. I mean, you have search in the name. Search a little harder, you can find the caps lock key to turn it off. And the counter response, of course, is that, "Oh wow, there's a caps lock key, that makes it way easier to type the name of the company. Great. Just great. It's cruise control for cool."
I know that asking for what's coming next is always perilous because the best laid plans, et cetera, et cetera. What's next for you folks? What are you focusing on in this year of our Lord 2020?
Thomas: Our big vision is to build out a new type of data platform and as I mentioned, we went to market last year going after the big scale of log analytics. Support in the Elastic API and Cabana interface and solving those type of problems. But the vision is to have a true multi-model, real time database that the idea that we deliver on that data lake philosophy and you have database tooling that is natural, you're used to it. So once we come out with this multi-model capability, we're going to go after the Athena use case. We hear a lot of complaints about the costs and the scale and the idea that you have one data source or multiple data sources via the Elastic API or SQL API. 2020 is going to bring out really the first true, true multi-model functionality, our database that we're really excited about.
We had customers asking us all the time, "Can you support the Athena use case or the SQL Presto dialect?" And that's where we're going to do. We're going to first offer it to our law customers and then start growing the business into BI and ad hoc analytics all on S3.
We do have a plan for later this year to go multi-cloud. We've been pulled both from Microsoft and Google and which one we do next, I'll keep that a secret, but we will be coming out with a multi-cloud thing later in 2020.
Corey: One problem I see in the world of cloud billing historically has been that it's when you switch to a consumption based model, people don't know what something's going to cost. And sure they'd like to whine and cry and complain about it. But with a lot of systems with significant storage volume, you could run a query that costs tens of thousands of dollars without knowing it in advance. Driving down the overall cost, acquiring these things is, of course, incredibly valuable and helpful. But what are you seeing in the space as far as addressing that from a larger perspective? When you have an internal application that, run a query here and it costs $20,000, when someone hits submit, first, even attributing that back to that one query is a super hard problem. But the gold standard people are going for as a pop off of, "Hey, if you run that query, it's going to cost you a giant pile of money. Continue, yes or no?"
So there is that problem of doing the cost attribution of querying interestingly large datasets. Is that something that's on your roadmap at all? Is that not something you're seeing in your customers?
Thomas: So I was holding that one back. So yes, we actually have something that we're coming out with to make our system both be upfront storage type pricing as well as consumption based. And that's a very novel and unique in the logging space, it's not so much in more of the BI.
And a part of that offering we start supporting consumption based model is we actually know the cost of a query. And so we're going to have tooling in our user interface that you can say, "Oh this is going to cost me X amount of dollars over these hundreds of terabytes. Do I want to do it?" Or maybe Susie user can do that type of query, but maybe Bobby only can do short time query's for this amount of cost and have a whole billing construct within our user experience to keep those costs down.
Now to your point earlier, we've cracked the code on reducing that cost dramatically low. But imagine if your costs of indexing was virtually free and it's all based on queries, but your queries are cost effective as well as intelligent. And I'm coming out and I'm saying we are coming out with a feature that will be consumer based and the user will know and control how much data and what's going to cost.
Corey: To be clear, when you say that it predicts the cost and tells you what it runs in advance, is that the cost for CHAOSSEARCH? Is that the infrastructure cost underlying what's going on? Or is it both?
Thomas: So clearly we'll have a margin within-
Corey: Well, of course, I'm not suggesting you should.
Thomas: Yeah, it'll be cost of the query. So we want to not only be disruptive in the classic log pricing model where everything from $100 per gigabyte and up, where we're currently $10 to $15 per gigabyte, which is very disruptive in this market. We're going to be instead of $5 per query per terabyte or $1, I'm not going to say what we're going to be, we're going to be dramatically cheaper than that and the idea that, "Oh, these 10 terabytes of one query is going to cost me X," you can control that. You can set policies so that makes sure that you only use what you want to pay for.
And not necessarily a credit based because that can get complicated. I know other vendors do a credit base that you put up front a cost and then you eat into that. That's hard to deal with. This is going to be a lot more driven by your controls and and what you want to do. So it's going to be significant, it's going to be disruptive and hopefully the customer's are going to like it. And they kind of a la cart, maybe you want to choose full upfront, by storage, or maybe uses base as the way you want to go.
Corey: Having a variety of different options is always a good direction to go in as far as meeting customer requirements. Everyone has a different use case and everyone wants to express that in different ways. There few things more frustrating than when a vendor's pricing model doesn't align with how you are intending to use the service.
Thomas: Yeah, I mean, here's a good example. We have people that come to us, say, "I have a hundred terabytes. I only have to query maybe once a week. It doesn't make any sense for me to stand up a huge system to do that because consumption base makes a lot of sense," right? Or when there's a denial of a service attack, then you want to really hunt and figure out what's going on. But the rest of the time, the system's pretty idle.
Now if you're doing some more real time where you're doing dashboarding or it's built into your application, okay, maybe the consumption base is not the right pricing, but when you're doing a lot of ad hoc or an investigation or you need it when you need it, but you don't need it when you don't, there's really no good solution out there in the market, particularly in log analytics.
Corey: So if people want to learn more about what you folks are up to, continue to follow your exploits, get annoyed at your unnecessary capitalization, where can they do that?
Thomas: Come look at the CHAOSSEARCH.io. We have a whole bunch of material that talks about the platform. We're actually updating our content later this month on a whole bunch of detail use cases and an ebook coming out. So come to our website, ask for a free trial. It's fully automated, you're up and running within five minutes on your S3. You can also set up a larger POC where we allow you to test out 250 gigabytes of data, which is actually pretty big for the free trial. And if you're a really big account we have that we call big POC where you can call us up and if you're looking to test out terabytes of data per day, we can work with that with you.
We have a lot of good blogs and a lot of good documentation, but sometimes just kicking the tires is the best way to learn and our free trial is probably the quickest way to learn what we do.
When you first log in, it's quite unique. We're the first company that starts with your storage and not just the idea that you dump data into them and then you start playing with the product. The product starts when you first log into your storage. We have a lens into your S3, we have a refinery to create different viewpoints. And then we have Cabana, your favorite visualization tool and Elk to do your analytics. And we've automated the process from raw data to insights really best of breed.
Corey: Well, thank you so much for taking the time to speak with me today. I appreciate it.
Thomas: Thank you.
Corey: Thomas Hazel, founder and CTO of CHAOSSEARCH. To learn more, visit CHAOSSEARCH.io. I am Corey Quinn, this is Screaming in the cloud.
If you've enjoyed this podcast, please leave it a five star review in Apple podcasts. If you've hated this podcast, please leave it a five star review in Apple podcasts and tell me what my problem is.
Announcer: This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.
This has been a HumblePod production. Stay humble.