Problematic datetime Values and What to Do With Them

While I was importing the data records containing timestamps from an existing legacy system into a newly created database using Python, my code caught an exception which I thought is worth writing about and the code using which I solved the problem.

The legacy system provided an API that produced JSON objects for each data record. Records contained timestamps in several columns which, naturally, I should convert to Python datetime objects and then store in a PostgreSQL database. The values span over many years in the past and future and were not limited to a known period of time.

JSON, being a text encoding for data representation, does not prohibit the sender from putting anything in the timestamp columns, so separate standards have been devised to offer a universally agreed-upon format to express precise timestamp values in text, that is not prone to interpretation and inaccuracies. See RFC #3339.

But the people who had developed that system were entirely oblivious to these standards and complexities associated with time handling. The timestamps were represented in YYYY/mm/dd HH:MM:SS format, entirely ignoring the most basic thing which is the timezone to which the timestamp belongs. And if you think a particular timezone has been implicitly assumed by the developer, as I did, you have a thing or two to learn about time complexities too. In fact, Python datetime docs actually call these types of timestamps naive, which is also appropriate naming for people who use them.

It’s like assuming because it’s summer where I live, It must be summer all over the earth (and beyond). If one day we decide to represent seasons in a data structure to be used all over the world, we can’t just use two bits to encode the four values of [Spring, Summer, Fall, Winter]. We would have to answer questions like in which hemisphere are we talking since people in northern and southern hemispheres experience opposite seasons. If we want to map the seasons over a year, we can’t just divide the year length by four. How long seasons last depends on exactly where on earth we are talking (for example Canadian summers are famously short, does Mars even have summers?). When does a season start? Even for the same place, the duration of seasons are not always the same, because dependence on astronomical phenomena for time-keeping is infamously inaccurate (rotation speed of earth around itself and around the sun are both slowing down at, you guessed it, inexact rates). You might know about leap days (also known as “leap years”), but leap seconds is also a thing, because not all days could be assumed to be exactly 24 * 60 * 60 seconds long. To make things worse, the standards (yes plural!) on when to add the leap seconds do not distribute them uniformly and monotonic. This means that to be more accurate, I have to add a location, a date field and the name of a standard to my Season abstraction to make it representative enough of the real world and therefore have a chance of being useful in real programs. Or we can keep the two-bit representation of Season and just append a pointer to a database of locations and dates, and leave exact processing of seasons to libraries which take the related entry in that database into account.

This is exactly what has happened to timestamps and the timezone database. The timezone rules database (the standard) - is called the Olson database. On Linux, it is installed via tzdata package and the related Python module is pytz. This database and associated packages are updated quarterly (or so) to keep up with the latest published changes that need to be accounted for.

But back to my project, since it is written in Django, I wrote the following code to process these columns into nice timezone aware (as opposed to naive) datetime objects and store them in an appropriate timestamp with time zone PostgreSQL column type.

    from datetime import datetime
    from django.utils import timezone

    ts = datetime.strptime(json_col, "%Y/%m/%d %H:%M:%S")
    ts = timezone.make_aware(ts)

The last line is where I get these two exceptions, after processing enough records:

    pytz.exceptions.NonExistentTimeError
    pytz.exceptions.AmbiguousTimeError

This is because I like to have accurate timestamps stored in my database (I’m weird like that!), so I pass the timestamp value through Django’s timezone.make_aware which uses pytz library and the implicit Django active timezone which has been set in Django settings to create an aware timestamp. But turns out if we take timezone database DST rules into account, some of those naive timestamps can not exist, which results in pytz throwing the first exception. Consequently, some naive timestamps occur twice!, which results in the second exception. This is the effect of Daylight Saving Time rules, which is yet another complexity that needs to be dealt with.

Since talking to the developer of the legacy system to find the correct value for these timestamps was hopeless (they actually stored the timestamps in database text fields and not even naive timestamp fields, this wasn’t the limitation of their API!), I opted for a best-effort method to fix these problematic values before persisting them into my database. Here is what I came up with:

    tz_active = timezone.get_current_timezone()
    try:
        result = timezone.make_aware(value)
    except pytz.exceptions.AmbiguousTimeError:
        # At the end of a DST transition period, a particular wallclock time will
        # occur twice (once before the clocks are set back, once after). Both
        # possibilities may be correct, unless further information is supplied.
        # One correction strategy is to always assume a fixed value
        # of True or False for is_dst.
        # Another strategy is to assume that the datetime value has
        # been generated in a place with the same timezone as currently
        # active timezone, and we are hopefully close enough to the
        # moment that this datetime value has been generated that
        # current DST status applies to it too.
        now = timezone.make_naive(timezone.localtime())
        is_dst = tz_active.dst(now) != datetime.timedelta(0)
        result = timezone.make_aware(value, is_dst=is_dst)
    except pytz.exceptions.NonExistentTimeError:
        # At the start of a DST transition period, the wallclock time jumps forward.
        # The instants jumped over never occur.
        # Here we assume that transition to DST has resulted in the
        # wall clock to jump forward, but for whatever reason the
        # system wall clock has failed to follow. So we let pytz
        # add the amount of DST offset to the given value, hoping this
        # "corrected" value does actually exist.
        result = timezone.make_aware(value, is_dst=True)

The Zen of Python is enlightening as usual:

> In the face of ambiguity, refuse the temptation to guess.

But in this case, there is no other way than guessing. Maybe the faulty datetime was created on a system with a different timezone, outdated timezone database, or on Windoz. In the absence of the timezone field, we will never know and have to guess. So, in this case, I pass all naive datetime objects to a best-effort function similar to above. The Zen strikes again with these words of wisdom:

> Errors should never pass silently.
> Unless explicitly silenced.

And that is exactly what I did in this project, with a couple of warning log messages thrown in for record-keeping. A more complete and readily usable version of this code is available as part of my django_commons Django app in django_commons.utils.date_time.in_active_timezone().

As you have noticed, there is mention of a current timezone in the code above, and sometimes we are certain that our code is only used by people in the area governed by a single timezone. But usually this is not the case. Computer programs can run all over the world (surprise!), specially if they are served over the web to be run on web browsers. In this case, what should we use as the current timezone? The best bet would be to settle on a standard timezone for processing and storing the datetime objects internally, and let the user set its timezone manually or detect it automatically. Then we can convert the objects on input/output time. For the standard timezone, it is advised to use UTC, because it does not have DST changes and makes calculations simpler.

This is similar to the general advice about handling string character set and encoding, which says always use Unicode (with the most handy encoding) internally and convert to target character set/encoding at the edges. So for datetime objects, always use UTC internally. Convert to/from local timezone only when generating output for or taking input from humans.

Keeping accurate enough timestamps (defined as an unambiguous representation of some instant in time) is an essential part of computer engineering, so I hope this post clarifies some of its complexities and helps decide on a suitable data structure early on in a project to avoid headaches like this later on its lifetime.

Social