While I was importing the data records containing timestamps from an
existing legacy system into a newly created database using Python,
my code caught an exception which I thought is worth writing about
and the code using which I solved the problem.
The legacy system provided an API that produced JSON objects for each data
record. Records contained timestamps in several columns which, naturally,
I should convert to Python datetime
objects and then store in a PostgreSQL
database. The values span over many years in the past and future and were not
limited to a known period of time.
JSON, being a text encoding for data representation, does not
prohibit the sender from putting anything in the timestamp columns, so
separate standards have been devised to offer a universally agreed-upon
format to express precise timestamp values in text, that is not prone to
interpretation and inaccuracies. See
RFC #3339.
But the people who had developed that system were entirely oblivious to these
standards and complexities associated with time handling.
The timestamps were represented in YYYY/mm/dd HH:MM:SS
format, entirely
ignoring the most basic thing which is the timezone to which the timestamp
belongs. And if you think a particular timezone has been implicitly assumed
by the developer, as I did, you have a thing or two to learn about time
complexities too. In fact, Python datetime
docs actually call these
types of timestamps naive
, which is also appropriate naming for
people who use them.
It’s like assuming because it’s summer where I live, It must be summer all
over the earth (and beyond). If one day we decide to represent seasons in
a data structure to be used all over the world, we can’t just use
two bits to encode the four values of [Spring, Summer, Fall, Winter]
.
We would have to answer questions like in which hemisphere are
we talking since people in northern and southern hemispheres
experience opposite seasons. If we want to map the seasons over
a year, we can’t just divide the year length by four. How long
seasons last depends on exactly where on earth we are talking (for example
Canadian summers are famously short, does Mars even have summers?).
When does a season start? Even for the same place, the duration
of seasons are not always the same, because dependence on
astronomical phenomena for time-keeping is infamously inaccurate
(rotation speed of earth around itself and around the sun are
both slowing down at, you guessed it, inexact rates). You might
know about leap days (also known as “leap years”), but
leap seconds
is also a thing, because not all days could be assumed to be exactly
24 * 60 * 60 seconds long. To make things worse, the standards (yes plural!)
on when to add the leap seconds do not distribute them uniformly and monotonic.
This means that to be more accurate, I have to add a location, a date field
and the name of a standard to my Season
abstraction to make it
representative enough of the real world and therefore have a chance
of being useful in real programs. Or we can keep the two-bit representation
of Season
and just append a pointer to a database of locations and dates,
and leave exact processing of seasons to libraries which take the
related entry in that database into account.
This is exactly what has happened to timestamps and the timezone database.
The timezone rules database (the standard) - is called the Olson
database. On Linux, it is installed via tzdata
package and the related
Python module is pytz
. This database and associated packages are
updated quarterly (or so) to keep up with the latest published changes that
need to be accounted for.
But back to my project, since it is written in Django, I wrote the
following code to process these columns into nice timezone aware
(as opposed to naive) datetime objects and store them in an
appropriate timestamp with time zone
PostgreSQL column type.
from datetime import datetime
from django.utils import timezone
ts = datetime.strptime(json_col, "%Y/%m/%d %H:%M:%S")
ts = timezone.make_aware(ts)
The last line is where I get these two exceptions, after processing
enough records:
pytz.exceptions.NonExistentTimeError
pytz.exceptions.AmbiguousTimeError
This is because I like to have accurate timestamps stored in my database
(I’m weird like that!), so I pass the timestamp value through Django’s
timezone.make_aware
which uses pytz
library and the implicit Django
active timezone which has been set in Django settings to create an aware
timestamp. But turns out if we take timezone database
DST rules into account,
some of those naive timestamps can not exist, which results in pytz
throwing the first exception. Consequently, some naive timestamps occur
twice!, which results in the second exception. This is
the effect of Daylight Saving Time rules, which is yet another
complexity that needs to be dealt with.
Since talking to the developer of the legacy system to find the correct
value for these timestamps was hopeless (they actually stored the
timestamps in database text fields and not even naive timestamp fields,
this wasn’t the limitation of their API!), I opted for a best-effort
method to fix these problematic values before persisting them into my database.
Here is what I came up with:
tz_active = timezone.get_current_timezone()
try:
result = timezone.make_aware(value)
except pytz.exceptions.AmbiguousTimeError:
# At the end of a DST transition period, a particular wallclock time will
# occur twice (once before the clocks are set back, once after). Both
# possibilities may be correct, unless further information is supplied.
# One correction strategy is to always assume a fixed value
# of True or False for is_dst.
# Another strategy is to assume that the datetime value has
# been generated in a place with the same timezone as currently
# active timezone, and we are hopefully close enough to the
# moment that this datetime value has been generated that
# current DST status applies to it too.
now = timezone.make_naive(timezone.localtime())
is_dst = tz_active.dst(now) != datetime.timedelta(0)
result = timezone.make_aware(value, is_dst=is_dst)
except pytz.exceptions.NonExistentTimeError:
# At the start of a DST transition period, the wallclock time jumps forward.
# The instants jumped over never occur.
# Here we assume that transition to DST has resulted in the
# wall clock to jump forward, but for whatever reason the
# system wall clock has failed to follow. So we let pytz
# add the amount of DST offset to the given value, hoping this
# "corrected" value does actually exist.
result = timezone.make_aware(value, is_dst=True)
The Zen of Python is enlightening as usual:
> In the face of ambiguity, refuse the temptation to guess.
But in this case, there is no other way than guessing. Maybe the faulty
datetime was created on a system with a different timezone, outdated
timezone database, or on Windoz.
In the absence of the timezone field, we will never know and have to guess.
So, in this case, I pass all naive datetime objects to a best-effort
function similar to above.
The Zen strikes again with these words of wisdom:
> Errors should never pass silently.
> Unless explicitly silenced.
And that is exactly what I did in this project, with a couple of warning
log messages thrown in for record-keeping.
A more complete and readily usable version of this code is available as
part of my django_commons
Django app in django_commons.utils.date_time.in_active_timezone()
.
As you have noticed, there is mention of a current timezone in the
code above, and sometimes we are certain that our code is only used
by people in the area governed by a single timezone. But usually this is
not the case. Computer programs can run all over the world (surprise!),
specially if they are served over the web to be run on web browsers. In
this case, what should we use as the current timezone? The best bet
would be to settle on a standard timezone for processing and storing the
datetime objects internally, and let the user set its timezone manually
or detect it automatically. Then we can convert the objects on
input/output time. For the standard timezone, it is advised to use
UTC, because it does not have DST changes and makes calculations simpler.
This is similar to the general advice about handling string character set
and encoding, which says always use Unicode (with the most handy encoding)
internally and convert to target character set/encoding at the edges.
So for datetime objects, always use UTC internally. Convert to/from
local timezone only when generating output for or taking input from humans.
Keeping accurate enough timestamps (defined as an unambiguous
representation of some instant in time) is an essential part of computer
engineering, so I hope this post clarifies some of its complexities
and helps decide on a suitable data structure early on in a project
to avoid headaches like this later on its lifetime.