Let’s talk Timeseries
Here’s a riddle: what do self-driving Teslas, autonomous Wall Street trading algorithms, smart homes, transportation networks that fulfill lightning-fast same-day deliveries, and an open-data-publishing NYPD have in common?
For one, they are signs that our world is changing at warp speed, thanks to our ability to capture and analyze more and more data in faster and faster ways than before.
But if you look closely, you’ll notice that each of these applications requires a special kind of data:
Self-driving cars continuously collect data about how their local environment is changing around them.
Autonomous trading algorithms continuously collect data on how the markets are changing.
Our smart homes monitor what’s going on inside of them to regulate temperature, identify intruders, and respond to our beck-and-call (“Alexa, play some relaxing music”).
Our retail industry monitors how their assets are moving with such precision and efficiency that cheap same-day delivery is a luxury that many of us take for granted.
The NYPD tracks its vehicles to allow us to hold them more accountable (e.g., for analyzing 911 response times).
These applications rely on a form of data that measures how things change over time. Where time isn’t just a metric, but a primary axis.
This is time-series data. And it’s starting to play a larger role in our world.
Software developer usage patterns already reflect this. In the past 24 months, time-series databases (TSDBs) have emerged as the fastest growing category of databases:
As the developers and heavy users of open source time-series database, we are often asked about this trend. Typically, we get these three questions:
What is time-series data?
When would I need a time-series database?
Why should I use (or not use) TimescaleDB?
What is time-series data?
Some think of “time-series data” as a sequence of data points, measuring the same thing over time, stored in time order. That’s true, but it just scratches the surface.
Others may think of a series of numeric values, each paired with a timestamp, defined by a name and a set of labeled dimensions (or “tags”). This is perhaps one way to model time-series data, but not a definition of the data itself.
Let’s go a little deeper.
Here’s a basic illustration. Imagine sensors collecting data from three settings: a city, farm, and factory. In this example, each of these sources periodically sends new readings, creating a series of measurements collected over time.
Here’s another example, with real data from the City of New York, showing every taxicab ride for the first few seconds of 2016. As you can see, each row is a “measurement” collected at a specific time:
There are many other kinds of time-series data. To name a few: DevOps monitoring data, mobile/web application event streams, industrial machine data, scientific measurements.
These datasets primarily have 3 things in common:
The data that arrives is almost always recorded as a new entry
The data typically arrives in time order
Time is a primary axis (time-intervals can be either regular or irregular)
In other words, time-series data workloads are generally “append-only.” While they may need to correct erroneous data after the fact, or handle delayed or out-of-order data, these are exceptions, not the norm.
You may ask: How is this different than just having a time-field in a dataset?
Well, it depends: how does your dataset track changes? By updating the current entry, or by inserting a new one?
When you collect a new reading for sensor_x, do you overwrite your previous reading, or do you create a brand new reading in a separate row? While both methods will provide you the current state of the system, only by writing the new reading in a separate row will you be able to track all states of the system over time.
Simply put: time-series datasets track changes to the overall system as INSERTs, not UPDATEs.
This practice of recording each and every change to the system as a new, different row is what makes time-series data so powerful. It allows us to measure change: analyze how something changed in the past, monitor how something is changing in the present, predict how it may change in the future.
And so, here’s how we define time-series data: data that collectively represents how a system/process/behavior changes over time.
This is more than just an academic distinction: by centering our definition around “change”, we can start to identify time-series datasets that we aren’t collecting today, but that we should be collecting.
In fact, what we’ve found is that often people have time-series data but don’t realize it.
Imagine you maintain a web application. Every time a user logs in, you may just update a “last_login” timestamp for that user in a single row in your “users” table. But what if you treated each login as a separate event, and collected them over time? Then you could: track historical login activity, see how usage is (in-/de-)creasing over time, bucket users by how often they access the app, and more.
This example illustrates a key point: by preserving the inherent time-series nature of our data, we are able to preserve valuable information on how that data changes over time.
(In fact, this example illustrates another point: event data is also time-series data.)
Of course, storing data at this resolution comes with an obvious problem: you end up with a lot of data, rather fast. So that’s the catch: time-series data piles up very quickly.
Having a lot of data creates problems when both recording it and querying it in a performant way.
Which is why people are now turning to time-series databases.
Why do I need a time-series database?
You might ask: Why can’t I just use a “normal” (i.e., non-time-series) database?
The truth is that you can, and some people do:
Yet why do the majority of respondents to that survey use a time-series database instead of a normal db? And why are TSDBs the fastest growing category of databases today?
Two reasons: (1) scale and (2) usability.
Scale: Time-series data accumulates very quickly. (For example, a single connected car will collect 25GB of data per hour.) And normal databases are not designed to handle that scale: relational databases fare poorly with very large datasets; NoSQL databases fare better at scale, but can still be outperformed by a database fine-tuned for time-series data. In contrast, time-series databases (which can be based on relational or NoSQL databases) handle scale by introducing efficiencies that are only possible when you treat time as a first class citizen. These efficiencies result in performance improvements, including: higher ingest rates, faster queries at scale (although some support more queries than others), and better data compression.
Usability: TSDBs also typically include functions and operations common to time-series data analysis: data retention policies, continuous queries, flexible time aggregations, etc. Even if scale it not a concern at the moment (e.g., if you are just starting to collect data), these features can still provide a better user experience and make your life easier.
This is why developers are increasingly adopting time-series databases and using them for a variety of use cases:
Monitoring software systems: Virtual machines, containers, services, applications
Monitoring physical systems: Equipment, machinery, connected devices, the environment, our homes, our bodies
Asset tracking applications: Vehicles, trucks, physical containers, pallets
Financial trading systems: Classic securities, newer cryptocurrencies
Eventing applications: Tracking user/customer interaction data
Business intelligence tools: Tracking key metrics and the overall health of the business
But even then, you’ll need to pick a time-series database that best fits your data model and write/read patterns.
Why should I use (or not use) TimescaleDB?
If you do need a time-series database, there are already quite a lot of options out there. You may be happy with one of them. But we weren’t.
Why weren’t we satisfied with the state-of-the-art? Because we wanted the power of full SQL at scale, which none of the existing options provided.
Specifically, we found that existing time-series databases:
Underperformed on many of our queries(read: high latencies)
Would not even support many other queries (varies by database)
Required us to learn a new query language (read: not SQL)
Would not work with most of our existing tools (read: poor compatibility)
Required us to silo our data into two databases: a “normal” relational one, and a second time-series one (read: operational and development headaches)
So we built TimescaleDB because we needed it. And then other people wanted to use it, so earlier this year we open sourced it under an Apache 2 license.
When should you consider TimescaleDB? If you want:
A normal SQL interface to time-series data, even at scale
Operational simplicity: one database for your relational and time-series data
JOINs across relational and time-series data at query time
PostgreSQL! (Timescale looks and acts just like PostgreSQL)
Query performance, especially for a broad, varied set of queries (via robust secondary index support)
Native support for geospatial data (via PostGIS compatibility)
Third-party tools: Timescale supports anything that speaks SQL, including BI tools like Tableau
Then again, if any of the following is true, you might not want to use TimescaleDB:
If you have only simple query patterns(e.g., key-value lookups, one dimensional rollups). Other database options are more optimized for these types of queries.
If you have sparse and/or completely unstructured data. While TimescaleDB supports structured and semi-structured data (including JSON), if your data is typically sparse or completely unstructured, there are other better database options.
A parting thought: Is all data time-series data?
Let’s return to the topic of time-series data for one parting thought.
For the past decade or so, we have lived in the era of “Big Data”, collecting massive amounts of information about our world and applying computational resources to make sense of it. Even though this era started with modest computing technology, our ability to capture, store, and analyze data has improved at an exponential pace, thanks to major macro-trends: Moore’s law, Kryder’s law, cloud computing, an entire industry of “big data” technologies.
Now we need more. No longer content to just observe the state of the world, we now want to measure how our world changes over time, down to sub-second intervals. Our “big data” datasets are now being dwarfed by another type of data, one that relies heavily on time to preserve information about the change that is happening.
But does all data start off as time-series data?Recall our earlier web application example: we had time-series data but didn’t realize it.
Or think of any “normal” dataset. Say, the current accounts and balances at a major retail bank. Or the source code for a software project. Or the text for this blog post.
Typically we choose to store the latest state of the system. But what if we instead stored every change, and computed the latest state at query time? Isn’t a “normal” dataset just a view on top of an inherently time-series dataset (cached for performance reasons)? Don’t banks have transaction ledgers? (And aren’t blockchains just distributed, immutable time-series logs?) Wouldn’t a software project have version control (e.g., git commits)? Doesn’t this blog post have revision history? (Undo. Redo.)
Put differently: Don’t all databases have logs?
We recognize that many applications may never require time-series data (and would be better served by a “current-state view”). But as we continue along the exponential curve of technological progress, it would seem that these “current-state views” become less necessary. And that by storing more and more data in its time-series form, we may be able to understand it better.
Is all data time-series data? We’ve yet to find a good counter example. If you’ve got one, we’re open to hearing it.
Regardless, one thing is clear: time-series data already surrounds us. It’s time we put it to use.