Jul 16, 2024

Best practices for timestamps and time zones in databases

Confused about how to handle dates, times, and DateTimes in your database? We’ve got you covered, now() and always.
Dan Chaffelson
Sales Engineer

The old joke goes that there are only two hard things in Computer Science: Naming variables and cache invalidation and off-by-one errors. But I contend that nothing makes experienced developers want to delegate more than reworking the DateTime handlers in the analytics codebase.

Of course, the real problem is that it’s both annoying to solve correctly and doesn't add any net value to the service you’re building - you just need it to work so you can report accurately.

Well, fear not! For this is a blog post on best practices for working with Date, DateTime, and Timezone types in your databases, data processing, and analytics systems. This guide is all about general best practices, with an associated follow-up Guide in Tinybird's docs (“Working with Time”) with specific functions for different tasks, pre-written queries, and test data for common use cases for time-based analytics.

These are the Ten Commandments for timestamps and time zones in databases. Use them to master time management when storing Dates and DateTimes in your database.

1. Thou shalt default to UTC

"Simplicity is the ultimate form of sophistication." - Leonardo da Vinci

Regardless of the time zone of the server where an event is created, you should first convert the timestamp into UTC when you store it. You may also keep A) the relevant time zone offset in seconds, B) the name of the time zone, and C) the local timestamp as a string. 

But always, always, make your primary storage of a historical event timestamp in UTC.

Why? UTC is an absolute point in time. It is based on a highly accurate and stable standard recognized worldwide. It is not affected by changes in time zones or other adjustments at the whim of the latest local government.

UTC is always constant - every other modern time is relative.

Note that we specifically call out historical timestamps here, because you are storing information about something that has already occurred. When describing something that might happen in the future, like scheduling an event, you should consider the appropriate context relative to the information. UTC is great for coordinating something to happen at exactly the same moment across many time zones, but if you want everyone to do something at 0900 in their own local time then the local time zone is the better context to schedule it.

2. Thou shalt only do that which is necessary

"Too much analysis leads to paralysis." - Anonymous

It can be tempting to process timestamp data in a way that seems comprehensive or exhaustive, but this can lead to wasted effort and unnecessary complexity. This is why you should only do what is needed to serve the use case in front of you.

It is important to clearly define the use case and determine the specific precision, calculations, or aggregations necessary to support it. If you're unsure, use the recommended defaults and then stray only as needed.

This is also part of why we at Tinybird store data in UTC with metadata about where it came from - it is already in a common standard and usually only one aggregation and/or conversion away from how our users want to see it represented. It is why we default to storing DateTime with accuracy in seconds (rather than milliseconds), and monetary values in cents as Integers rather than Floats - there are known to be fewer edge cases and more standard functions work with them out-of-the-box, reducing your complexity and test footprint.

This is not to say that you shouldn't use millisecond and cent-fraction accuracy if you really need it (hello folks in AdTech), but being fancy for its own sake just makes more work. This also means if you are finding that a great many use cases require the data to be in a particular representation (for example, if all of your business analysts work out of a specific time zone and always view data relative to that time zone) then also precalculating the data to that time zone for efficiency is perfectly reasonable and a good idea. But, you should still canonically store it in UTC.

3. Thou shalt be unambiguous

"Ambiguity is the enemy of clarity." - Anonymous

You should always display timestamps in standard formats, specifically ISO8601 formats by default. Only use regional formats (I'm looking at you, North Americans and your mm-dd-YYYY situation) if absolutely necessary.

Why? Because a user should never have to guess what a timestamp represents. Any plain timestamp presented in a UI will be assumed to be local time, and any plain timestamp stored in a database is assumed to be UTC - if it is anything different, you must provide the time zone information with it, or explain it in metadata.

There are good technical reasons for this, too, such as efficient sorting and indexing by date in these standard formats. Also, most people assume UTC by default (see The First Commandment), which makes it the least confusing way to present data to users. Surely that’s enough of a reason.

More generally, avoiding ambiguity means using the right data Type for the right purpose, and thus leads us to the fourth commandment...

4. Thou shalt not cross thy Types

"Don't. Cross. The Streams." - Egon Spengler

Use the Type that matches your data and don't compare across dissimilar Types. Use a Date when you have a Date, and a DateTime when it's a Date and a Time, etc. This might seem glaringly obvious, but people still make this mistake with immense regularity. Don’t feel bad. Just don’t do it in future.

Why? If you store a Date as a DateTime with some arbitrary time of day selected (most systems would set it to midnight or midday by default), and it is then converted to a time zone with a sufficiently large offset (like New Zealand in +13), it could be a completely different calendar day and thus not match the Date you were expecting.

Remember: a Date refers to the period of a calendar day. A DateTime refers to a specific point in time which might be on a particular day in one time zone, and another day elsewhere.

5. Thou shalt know whence you came

"You can't really know where you are going until you know where you have been." - Maya Angelou

To compare things across different time zones, you must know which time zone you started with. For smaller datasets you would generally store this information in an additional column (see The First Commandment). For larger datasets, however, you typically have a Fact table with all your time-stamped data with one of your Dimension tables containing the relevant location and time zone information, as this avoids bloating your Fact table with redundant data.

Most of the time we are processing data from sources that don't move around much, like physical stores, or digital services that typically operate in UTC for simplicity, like online shopping. Truly mobile data sources like ships and airplanes have highly sophisticated time and positioning systems and will also usually give you data in UTC.

See why putting data in UTC is so handy? That's why it's The First Commandment.

Ultimately, many analytics queries require you to know where the data came from (event time zone), how it is stored right now (UTC, of course), and where it is going (query time zone) - which then naturally leads us to...

6. Thou shalt transform carefully

"A child learns that a day is 24 hours long; a developer needs to know that is only true in UTC." - Me, probably

When naively converting between time zones, it’s easy to get tripped up by time zone or Daylight Saving Time (DST) changes. This results in data only correct for ~363 days per year, or always mistakenly 30mins offset for one region. Part of the problem with working with timestamp data is that wrong and right can look remarkably similar, and it’s easy to make mistakes.

This is particularly true when aggregating over longer time periods - when aggregating by day you must first consider: “Which relative day do I care about?” Each local site? Some global service center? Maybe you want to normalize all the data relative to the business day across locations, so you can compare intersite event patterns on an hourly basis.

In some cases, the tradeoff of simplicity in data processing against strict accuracy is acceptable for the use case (remember The Second Commandment), but if you’re going for accuracy, then you should try to do all your conversions either at the very beginning or very end of your processing based on your required output, or in such a way that it doesn't cause problems.

Currently, all active DST changes and time zone offsets are a multiple of 15 mins, so it is therefore the largest period you can pre-aggregate by that won't need complex adjustments at query time - this is very handy and leveraged by many tools, and wel show you how to do it yourself in the companion guide: “Working with Time”.

7. Thou shalt understand time zone relationships

"It's no use going back to yesterday, because I was a different person then." - Lewis Carroll

It is important to understand how time zones and offsets relate to each other. Time zones are labels, like Pacific/Tokyo, used by governments and changed infrequently. Sometimes a time zone is retired, or changed, or a new one is introduced.

Any given time zone will have one or more offsets. An offset is usually expressed in seconds in programming and represents the number of seconds that the time zone is ahead of or behind UTC for a given period of a given year. These can also be changed in modern times, and have historically been very weird indeed. The Netherlands was +19:32:13 (yes nineteen and a bit minutes) for nearly 28 years.

Generally, it is good to think of time zones and offsets as a collection of 1:N relationships which slowly morph over time, and that when you fetch the offset for a timestamp it will be given for that particular point in time and may not apply to all other timestamps for that time zone in that dataset.

Most systems, including the one we use in Tinybird, are based on the IANA Time Zone Database which is scrupulously kept up to date. This means the native conversion functions within the system are probably better at it than your custom solution, which brings us to...

8. Thou shalt trust the computer

"Don't reinvent the wheel, just realign it." - Anthony J. D'Angelo

You should default to using system-provided conversion functions, if possible. One of the oft-lauded positives of Open Source Software is that you get the benefit of an often very large number of people using the same functionality and therefore detecting and fixing bugs long before you encounter them. 

One of the downsides of OSS is the system sometimes has very strong opinions about how you should do things, so we must additionally have carefully-considered workarounds to achieve the outcome we need.

For example, ClickHouse (and therefore Tinybird) has a large collection of functions for common tasks like: Interpreting different kinds of Strings into a DateTime (parseDateTimeBestEffort()), converting toDateTime() (including changing Timezone), handling different levels of precision (now() vs now64()), or writing a timestamp to many specific kinds of String formats (formatDateTime()).

It also requires that a Timezone be associated with DateTime fields at the column level, preventing mixed time zone data from being stored together in this native format. In our companion guide, we show you ways to achieve typical analytics use cases while working with this constraint, and the impact it has on other queries you may run.

So the computer will probably do what you want, but it might have constraints or opinions that you need to work around, so...

9. Thou shalt also verify the computer

"In God we trust; all others must bring data." - W. Edwards Deming

Even if you use the system-provided Date and DateTime manipulation functions, and especially if you write your own, you should have golden test data and tests for your specific use cases as a part of your analytics platform.

An upgrade could change a default you unknowingly relied upon, an administrator might accidentally change the server from UTC to BST for a total of 13 minutes on some idle Tuesday morning, a new version of a tool could introduce a new optional positional parameter to a function you use and it now silently gives a different answer in 3% of your data.

In our companion guide, we provide an overview of the different functions you can use along with examples of test data using the extremely unusual Pacific/Chatham time zone, and how you can run various kinds of checks using it.

10. Thou shalt put this into practice

If you've read this far, then cement your understanding by running through the practical examples in the companion guide, Working with Time. They're implemented in Tinybird, which uses standard SQL running on ClickHouse. It's free to sign up, and the example data and queries are already there for you to review. 

Have fun!

Do you like this post?

Related posts

A new dashboard for Tinybird Analytics
More Data, More Apps: Improving data ingestion in Tinybird
Improved Support for Replacing or Appending Data
Product design: how our SaaS integrates with git
Publish SQL-based endpoints on NGINX log analysis
Tinybird is now available in AWS us-west-2
Tinybird
Team
Jun 10, 2024
Operational Analytics in Real Time with Tinybird and Retool
Iterating terabyte-sized ClickHouse®️ tables in production
Iterating terabyte-sized ClickHouse®️ tables in production
Selective data deletion: a new feature for data quality management

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.