The Podcast
Calendar Icon 10.23.2019
aws-section-divider
Audio Icon
The Power of Time Series Databases with Paul Dix
About Paul Dix
Paul Dix is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.


Links Referenced: 
Transcript
Announcer:  Hello, and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey:  Welcome to Screaming in the Cloud. I'm Corey Quinn. This week's episode of Screaming in the Cloud is sponsored by InfluxData, makers of InfluxDB. As a part of that sponsorship, they have generously provided one of their co-founders to have this conversation. Paul Dix, welcome to the show.


Paul:  Thanks, Corey. Glad to be here.


Corey:  Thank you for taking the time to entertain my ridiculous nonsense. It's always appreciated. I guess where I want to start on this is for... Let's begin at the very start of all of this. You are makers of the premier offering in the world of time series databases. For those of us whose platonic ideal of a database is Route 53, what is a time series database, and why might I need one?


Paul:  Yeah. So, uh I mean, a time series database is basically just a database that's optimized for a specific kind of workload, which is time series data. Now, the thing that makes time series data different than, say, reference data that you keep in a relational database uh is that it's largely an append-only workload. Right? You have new data arriving all the time. You're not updating previous records, uh and when you query the data, you're frequently uh creating large ranges of data to compute like summaries of you know what was the min value in these five-minute increments for the last four hours or something like that.


Paul:  Um, so you can certainly use other types of databases to store time series data. You can use relational databases or other NoSQL databases, but for... for time series data specifically, there are optimizations that you can make to deal with the very high rate of ingest, the query workloads, which are very, very different, and one other... A few other things.


Paul:  Uh one, which is it's very common in time series to have your high-precision data that you keep around for a limited window of time like, say, I'm going to keep all my raw data for seven days, and then you want to uh summarize it or downsample it and, say, keep those summarizations around for longer periods of time like three months or a year, whatever.


Paul:  Uh and a good time series solution database will basically uh handle that data management life cycle for you automatically, evicting the high-precision data, downsampling the other data. Um so the eviction I think is actually really interesting from a database perspective uh when you think about relational databases. Right?


Paul:  So in the naive case, if you're going to evict your time series data and, you say, say you want to keep it around for just a day, what that means is the naive way of doing this is every time I do a write, I have to delete the oldest data point, right, because if I'm ingesting at a fixed rate, then I know for every write that goes in, there's a delete happening.


Paul:  And regular databases aren't designed for this workload. They actually assume that you want to keep most of your data around for all time, so deletes actually are expensive. So time series databases do things like that. They optimize the ingest, the eviction of high-precision data, the downsampling, and the summarizations that you might want to do in real time.


Corey:  So I played a little bit with things like this once upon a time in my first life as a network admin. Uh we ran Cacti to manage a lot of these things. Whose, if you've never used Cacti, the primary purpose of that software is to sit on your network, be written in PHP, and be exploited, and use as an attack platform for the rest of your network.


Corey:   It displayed graphs using our RRDtool or RRD-based uh tools out of the can of MRTG and a bunch of other similar products, and it's exactly what you described. As you look back further in time, the data gets less and less granular under the baseline assumption that you won't need to have that level of insight and visibility into things that happened a month ago as you might yesterday. Is that aligned with the same principles?


Paul:  Yeah, that's similar for the most part. Um there is an important distinction with RRD that I like to make, which is um when I think of time series data, I think of two types of data. Uh what's called a regular time series, which is samples taken at fixed intervals of time like once every 10 seconds, or once a minute, or once an hour. And then, there are irregular time series, which are basically event streams. That could be individual requests to an API and their response times. That could be a container spinning up or shutting down, uh any sort of an exception and application, any sort of event.


Paul:  Now, RRD, and uh I guess it's kind of like a spiritual successor, which would be probably Graphite. Those are based around storing regular time series data. So regular time series data is basically a summarization of some underlying like distribution, or raw event stream, or whatever. So for example, if I store the response time for every request made into my API, that's an event stream. That's an irregular time series, but I can query that time series and say, "Give me the 95th percentile in 10-minute intervals for the last eight hours of time." And what you've done there is you've created a regular time series from the underlying irregular event stream.


Paul:  So essentially, when you think about putting data into RRD, you're summarizing your data before it ever goes into the database, um and with Influx, what we wanted to do from the very early stage was we wanted to be able to store the raw event stream as well as the summarizations.


Corey:  If we go back to when you said you made your first commit to uh what became InfluxDB back in 2013. Uh if I, I didn't notice it at the time, and if I had, I would have made blistering fun of it. It's, "Oh, you're building your own database engine. I know you, you're that guy from Hacker News come to life." And it sounds like something that only someone who's deranged would do it except for the fact that it worked. You've built a successful company. You are a name brand in the time series space, and you were clearly right about this. I mean, now we have other entrance to the space, which we'll get to in a little bit. But at the time, did it feel like you were potentially making a catastrophic mistake?


Paul:  Uh it really didn't. Uh and that's based on my experience from a few other times. So in 2010, I was working for a FinTech startup, uh and we had to essentially build a you know "time series database" or "time series solution" for tracking uh you know real-time pricing for like 200,000 different uh financial instruments. Uh and the solution I built was basically Scala web services uh using Cassandra as the underlying longterm data store and Redis as a real-time indexing layer.


Paul:  Um and then uh, later on, when I built uh the first product that we built actually, which was a SaaS platform for real-time metrics and monitoring in the server monitoring space, uh I had to build a time series solution again so... and this was basically two completely different problem domains, but the solution was the exact same. Like in that case, for the first version of this, I did Scala web services on top of Cassandra with Redis, and then I built a next version of that API, uh which I used Go for, and this was in... I guess late 2012 is when I started development on that. So I used Go, and I used the LevelDB, which is an open-source library that Google had at the time, and I built this whole thing. So I essentially built a database for that API.


Paul:  And the thing that I realized through the process of doing this was that for solving this, this time series use case problem, you could use a general purpose database. You could use Cassandra or you could use MySQL or Postgres, whatever. But the thing is I had to write this like mountain of application code of web services code in Scala to make the whole thing work. And my feeling in 2013 was there's nobody focused on this exclusively.


Paul:  Graphite at that time was largely an abandoned project. Uh everybody you know who was using it was complaining about the fact that it wouldn't scale, uh so I thought, "Okay. Nobody is focused on this, but here I am." I've had to solve this problem multiple times in the past few years. I saw people at large companies trying to solve the problem themselves, and I saw all the monitoring companies doing the same thing, so I basically thought, "Here's you know here's something... Here is a need that isn't being served, uh so I might as well go do it."


Paul:  And I think the other important thing is like to think of the timing like... So in 2013, obviously, NoSQL was a big thing. And you know you have the different players in the space, and it wasn't obvious how that was going to shake out. Although, MongoDB was obviously already very popular, um but it wasn't like... I feel like over the last like two or three years, there's literally like a new time series database or a new database of some kind every other week that's on the front page of Hacker News. So I, I feel like there was a little bit less like new database fatigue in 2013 than there is now.


Corey:  Yeah. It's, it's one of those areas where there are so many different database options that... uh to rip off the ancient JWZ quote, it's, "Oh, I have a problem. I'll use regular expressions. Now, that way, I have two problems." It's, "Oh, I'll just write my own database." And invariably, in 2019, it feels like that's exactly the wrong direction to go in, but counterpoint, you folks have recently released Influx 2, so what's the story behind that?


Paul:  I mean, I guess... So first off, I should probably say that I'm uh, I'm a firm believer in what I call polyglot persistence, which is you use the right tool for the job and not every single persistence need is the same as others. Right? For some, you absolutely will need uh a relational-transactional database, and for other things, you won't need that. Um and you know all of this would be kind of like a moot point. If relational-transactional databases were infinitely scalable and infinitely performed, right, then we would just use those. But that's not the case, right? You make trade-offs and optimizations based on your needs and the specific use case.


Paul:  So initially, you know with InfluxDB 1.X, that's what that was about. But we started with the database, and then we saw that there were other needs that people had in this time series use case. Right? And for, for time series, like what I realized is it's an abstraction that works well for solving problems in multiple domains. Right? Server monitoring, I mentioned. Financial market data is one, uh real time analytics, but also, sensor data of all kinds, be it industrial, oil and gas, wearables, consumer tech. All that kind of stuff.


Paul:  So people had other needs to solve these problems and to build applications on top of like this time series abstraction. They had to collect it. They had to store it and query it. They had to process it for either doing ETL for enrichment or for doing monitoring and alerting. Uh and then, finally, they had to summarize the data for human consumption either through visualization, or reporting, or other kinds of things.


Paul:  So as I built the company, uh you know I raised capital and hired developers to build these other pieces, uh and we learned a lot over that period of time over the last six years. Um and the thing I realized is the... What I wanted is a platform that was easy for developers to use that kind of encapsulated all of that. Right now, we have... In 1.X, we have four separate products. With 2.0, what we tried to do is combine them into like one cohesive whole where there's a single API that is you know consistent, that's easy to use. Uh you know there's a swagger definition for it, and there's a user interface on top of it.


Paul:  And then, the last bit with uh 2.0 is... You know with InfluxDB 1, we had a query language that looks very much like SQL, uh and that's because I thought it would be easier for people to pick up. And it certainly was, but what we found is there were more complex like analytics and processing tasks that people wanted to accomplish that they wanted to push down into the database.


Paul:  And because they couldn't do that, a common pattern emerged where people would write code in you know whatever language they choose, right, like Python, or Ruby, or apparently your favorite, PHP. Uh then, uh they would query data out of the database, and do some post-processing, and then write data back into the database so that they could get it back into the tool chain for monitoring it, for visualizing it, and all these other things.


Paul:  So when we created 2.0, we decided, "Let's create a new language called Flux, which is not just a query language, but it's also a query planner, a query optimizer, and it's a it's a scripting language so you can push down this kind of complex processing into the data platform." And as a language, you know we want it to be turning complete and generally useful, but we also want it to be able to pull in data from sources outside of InfluxDB.


Paul:  As I mentioned, I'm you know I'm a firm believer in polyglot persistence, and what that means is... You know Influx is great for time series data, but it's not good for reference data, so we want to be able to pull in you know data from Postgres or MySQL, or from any sort of third-party API that you want to pull data in from. Right? You could hit GitHub for data that you could mix and match with your time series data. So basically 2.0 is the realization of you know collapsing those four components into one cohesive whole and creating a language that allows people to define really complex analytics and processing that the platform will just do for them.


Corey:  It seems almost like you're going through Hacker News to some extent and picking all the terrible ideas at once. So you just mentioned, for example, that you built a new query language called Flux. Um two issues.


Paul:  Mmhmm.


Corey:  One, writing your own language is always one of those things that's fraught with peril, but based upon what you demonstrated, I will absolutely extend credence to that that you're probably doing the right thing.


Corey:   But I will say that from my experience, working in tech for entirely too long, which is where I guess this bitter cynicism all has root, I find that whenever I have to learn a specific language to use a particular tool, it means tears before bedtime, and I want to wind up calling out a bunch of different companies that have done this, but it's unfair because it seems that every time I've dealt with this specific DSL, you have these problems.


Corey:   I'm still going to maintain that Kubernetes wrote their own custom DSL called YAML, which is so historically incorrect that I don't even know where to begin, but that's why I like saying those controversial things. What, I guess what made you decide to do this uh I guess in the face of historical terrible experiences with these?


Paul:  Yeah. By the way, I think Kubernetes was... That's not original to create a YAML DSL. I think they're just copying uh Spring and Struts who created a DSL and uh that...


Corey:  Once upon a time, we wound up adding Jinja to YAML and called... well that were effectively turned into SaltStack for its configuration language. Again, we're all code terrorists in our own way.


Paul:  Yeah, yeah. No. I mean, it's absolutely fair. I agree. Like uh you know generally speaking, like why would you create your own language? There, there are countless other languages out there. And uh so you know so basically, like one option is we just go with SQL. Right? Well, one, creating an actual SQL-compliant SQL is really, really hard. It's a lot of work. Two, SQL isn't turing complete. It's actually not a programming language. It's a declarative scripting language.


Paul:  Now, Microsoft's version of SQL has extensions that makes it turing complete. Uh Oracle's version of SQL has the same so... But then, again, like you're not using standards-compliant SQL, and really like even when you get down to it, every single major database has differences between what their versions of SQL are. So you know there's a standard, which is the lowest common denominator, and then when you get into more powerful query functionality, you end up getting into the specific database implementation.


Paul:  And as you mentioned, like there's so many tools that have their own languages. Basically, like I think any analytics tool in existence, whether it's log analytics, user analytics, business intelligence, marketing analytics, they all have their own custom query languages, uh and that hasn't, certainly hasn't stopped them from becoming popular, but let me speak to uh one thing about our specific journey to Flux that I think is relevant, which is you know in 2013, I created InfluxDB with this language, this query language that looked like SQL, but it wasn't actually SQL. It was different in ways that are actually frustrating if you're a SQL expert and you try to use it. Um but at the time, like tons of people picked it up because you know most people actually aren't writing SQL day-to-day. They are using their ORM.


Paul:  Now, I personally had a viewpoint probably in the fall of 2014 that the SQL style of writing queries was maybe not the best way to work with time series data, which I basically viewed as just like ordered streams of data coming through. And I thought a functional style language would actually be the better, the better way to represent the query style.


Paul:  Now, I wouldn't make... I was too afraid to make that change at the time, but when we introduced our processing agent capacitor, which is there for like background ETL and monitoring, alerting, and real-time processing, uh when we introduced that in September of 2015, it had a language that was more functional in nature, so we we again like made the foolish mistake of creating a language, and we actually made not just that mistake, but the other mistake, which is we created a platform that now had two separate languages that were custom. One for interactive querying and one for background processing.


Paul:  Um and uh the language itself called TICKscript actually looks like nothing else that uh you've probably ever seen. It's very, very strange. Um but over the last... Was it three and a half, four years? Uh a surprising number of people have actually adopted it, and a surprising number of people have written very complex TICKscripts despite very serious gaps in the functionality they should provide as a language uh and some gaps in what I call like developer ergonomics, which is the experience of actually writing code in it, and developing and testing things.


Paul:  Um but it has this like fan base that uses it, and they get a lot of value out of it, so I thought, "Well, there must be something there because if those people are willing to suffer the pain of using this thing that I can see all sorts of like horrible warts on, there's something worth you know putting more effort into." And when we went to create Flux, it's... You know we wanted something that could be used for background processing as well as interactive querying, and the, the choices then were... You know we knew we couldn't use SQL for the reasons I mentioned, so at this point, it's either, "Do we use an embedded language like Lua, or do we create our own?"


Paul:  Now, Lua obviously like there are very mature implementations of it. It would have been way easier to just use that. My problem is like I don't think Lua has enough popularity and that people are familiar enough with it. I think the learning curve is too high for people to adopt Lua like it's, it's just not that easy for regular developers to use.


Paul:  So the other thing we wanted was we wanted to be able to control the tooling around the language. We want to, ultimately, we want to create an experience that has a UI in it that allows you to create these Flux scripts without actually writing Flux code. Right? So point-and-click interfaces that describe like data flows of different time series data that you're collecting that output you know monitoring/alerting rules or all sorts of other things. Um so we want to be able to control the language.


Paul:  And then, the next thing I did was I thought, "Okay. How, you know if we're going to do a new language, it has to be easy to use. It has to be easy to pick up," so we intentionally made it look like JavaScript, which... You know plenty of people hate on JavaScript, but the fact is it's probably the most widely used programming language in the world. Even people who don't write JavaScript day-to-day are usually pretty familiar with it.


Paul:  You can look at the code and kind of understand what's going on, um and we said like... The truth is like the learning curve in this thing is going to be the API, and the API learning curve would be there regardless if we had written... you know if we had used Ruby as the starting point, Lua. All of those other things. Like the API is the biggest surface area. The surface area of the actual syntax of the language, that can be covered uh in 15 minutes by reading a getting-started guide that shows you the basic pieces of it.


Paul:  So that's you know that's the bet we're making. Obviously, you know we just launched 2.0 as a cloud product. The open-source product is still... The open-source build is in alpha right now. We've just released a new alpha release, so it's really too early to tell what's going to happen, but the thing I... The joke I like to make is I'll either be spectacularly wrong or spectacularly right, but there probably won't be a middle ground.


Corey:   No, it's fair. And I think we're going to see one way or another. It's, it's an interesting space, and we're seeing a lot of emergence coming out of it, which I guess gets me to um one issue that has been a recurring theme on this show, has been the idea that multi-cloud is generally not a terrific direction to go in. Pick a vendor and go all in.


Corey:  The challenge is that you're already going to be locked in to whatever it is you choose, unless you're spending an awful lot of time working around that to no real benefit, but understand where that lock-in comes from. From that perspective, if something gets built on top of Influx, is that fundamentally locked in from a data model perspective? Is there lock-in being driven from a "once you start paying, you never ever, ever get to leave" and an Oracle-esque model? How does that story play out as far as adoption-implementation go?


Paul:  Yeah, so, so we have the open-source InfluxDB, which is basically a single server. You can use that obviously. It's MIT-licensed uh with no restrictions, so you can do whatever you want with that. If you want to make your own new version of Influx and fork it, go for it. That's up to you. Um our cloud product is basically the exact same API, the exact same user interface. Um we don't yet have bulk export and bulk import of data, but our goal is to have seamless data transitions from open-source nodes at the edge and our cloud product uh running in whichever provider you choose. Right?


Paul:  Right now, we're adjusting AWS, but soon, we'll be in GCP and Azure. Um and ultimately, like we don't want to hold your data hostage. The data model of InfluxDB is simple enough that you can represent it pretty easily. Like trivially, you can represent it in any relational database, and you can also represent it in Cassandra, or HBase, or whatever. Uh but basically, that open-source component is ideally the thing that gives you the feeling that you don't have lock-in, but I agree with you in the sense that once you've invested a certain amount of time and effort into a piece of infrastructure and particularly into a provider that's hosting your data for you, there's kind of, you know there's lock-in just by virtue of the fact that you don't want to spend the time and money to move off of it.


Paul:  And the other thing about data is that it has gravity. It's not free to move from one place to another. So, and particularly in our use case, we're talking about large amounts of data, so that becomes that becomes a thing that you actually have to pay attention to. So you know ultimately, like people want to feel like there's no lock-in, but you know if it comes time to like, say, switch cloud providers, like are you really going to haul all feature development for six months while you do you know this lift and shift over to another cloud provider that provides zero customer value? Right? Like the main thing you need is the threat of moving to another cloud provider to give you you know a pricing package.


Corey:  Yeah, and there are mixed reports as to how well that actually works, but it does raise an interesting question. Um one of the easiest jobs in the world has got to be running product strategy at AWS because you're just a post-it note that says, "Yes," on it there. There's really no thing that I would put past them building at this point in time.


Corey:  And to that end, they have announced their own time series offering called Amazon Time Stream, which... It sounds almost like it can manipulate time itself, which it probably should because it was announced at re:Invent last year, and we're about to hit re:Invent this year, and it still hasn't been released. So it's like Influx without those whole pesky customers. So I don't, I don't know what the story there is, but more interesting to this conversation and germane to what you're doing and what you're building, what is it like when Amazon enters your market, when they come to crush you for lack of a better term?


Paul:  Yeah, so it was... When that got announced last year, it wasn't actually entirely unexpected. Uh I just knew it was a matter of time, just when. Um you know it's obviously concerning. That's always the question is like, "What if so-and-so comes to build your product?" Uh I mean, the things I take comfort in are basically that Amazon isn't guaranteed to win in every single market they enter, and it's not necessarily, necessary that um you know it's a winner-take-all situation for every single product. Right?


Paul:  So for example, Amazon entered Elasticsearch. Right? They have Elasticsearch hosting, and by all accounts, they make far more money uh at it than Elastic does, but last I checked, Elastic's market cap was pretty big and they're doing pretty well despite the fact that Amazon has come for them and is trying to crush them, right, and has forked their distribution, even though they don't call it a fork.


Paul:  So I realized that there are things we can still do to try and deliver customer value that's outside of what Amazon does. Right? Like we're not going to be able to buy server time, and memory, and network bandwidth cheaper than they can, but hopefully, we can provide a developer experience that's better. We can provide a user experience that's better.


Paul:  Um so you know when I think about Time Stream, which is their you know their soon-to-come time series database in the cloud, that's just one component of what Influx 2.0 does. A query in storage is just one piece. If you are going to try to cobble together what Influx 2 does by yourself, you'd have to take Kinesis paired with Lambda, paired with Time Stream, paired with S3, paired with uh some sort of visualization engine. I forget what Amazon's is, but they seem like...


Corey:  Uh QuickSight, and that also has no customers because it's like Tableau, but crappy, and it turns out that's not the most compelling market to butcher.


Paul:  Yeah, so that's the other thing that... I've heard you say and I've heard plenty of other people say, which is uh Amazon competes very fiercely on basically infrastructure, and cost, and scale, and stuff like that, but they, for one reason or another, just don't see fit to build user experience as they're compelling and UIs that are compelling. So that's one thing that we you know continue to invest heavily in is the UI, and the API, and how those things work together.


Corey:  Yeah, and it seems that a number of the higher level differentiated services, the user experience lacks a certain polish. I think that's something that only comes with time for starters, but it also seems that it's not a high priority. And when you're dealing with a tool like this where you're going to be spending not inconsiderable amounts of time gazing into it, that experience should be reasonable and polished. And the idea that someone should be able to go from, "I've never heard of this thing before," to using it effectively should be measured in hours, not weeks, and I think that that's a lesson that sometimes gets lost.


Paul:  Yeah. I mean, our goal is to measure it in minutes and hopefully seconds.


Corey:   Exactly. Pictures are worth a thousand words as they say, and graphs on the other hand lets you figure out exactly how many words each picture is worth if you wind up getting your axis and calibration done appropriately.


Paul:  Mm-hmm (affirmative).


Corey:  So what's coming next if you have anything to share as far as what Influx is doing? What's interesting that people should keep an eye out for? What does the future hold?


Paul:  Uh so we recently launched uh basically monitoring and alerting features inside our cloud 2.0 product, uh and that basically turns you know Influx 2 into a full monitoring/alerting platform in addition to a time series database and an agent that can collect data. Within Flux the language, uh what I'm most excited about is basically packages. Right? The ability for users of the system to create their own packages and share them with other people, and those packages could be bits of you know Flux source codes, so they're shared like you would on NPM, or RubyGems, or Crates.


Paul:  Um and then, the other piece is you know packages that allows people to share essentially entire application experiences, which could be dashboards that you see. It could be drill-downs that you can do within your time series data, and all of that is kind of scope to the structure and schema of what data looks like inside of InfluxDB, um so that packaging thing is something I'm excited about.


Paul:  And then, the last bit is... You know obviously, like InfluxDB, all the core components are open-source, and we really need to drive towards getting open-source InfluxDB 2.0 into beta. And for that, what we need is the basically the GA of 1.0 of Flux, the language, and we need the compatibility layers that users of InfluxDB 1.X can point to InfluxDB 2 and work with it as though it's a 1.X server, and we needed the migration tooling. And then, after we're in the beta, it's all about performance testing, robustness, and getting to the point where we can get open-source InfluxDB into to 2.0 and to general release.


Corey:  Got it. Well, it sounds like there's going to be some interesting stuff coming up, and I'm very curious to see how that winds up manifesting in the marketplace and seeing it in increasing numbers of environments. I want to thank you for taking the time to speak with me today. If people want to learn more about Influx, about you, your sage thoughts on things that people should and absolutely should not do, where can they find you?


Paul:  Uh so Influx, you can find at InfluxData or on Twitter as @InfluxDB, and I can be found on Twitter as @PaulDix.


Corey:  Thanks so much for taking the time to speak with us today. I appreciate it. Paul Dix, founder of InfluxData, makers of InfluxDB. I'm Corey Quinn. This is Screaming in the Cloud.


Announcer:  This has been this week's episode of Screaming in the Cloud. You can also find more Corey at screaminginthecloud.com or wherever fine snark is sold.


Announcer:  This has been a HumblePod production. Stay humble.