Blog |

Daily summary email bugfix

Daily summary email bugfix
Table of Contents

Last week we fixed a bug in our daily summary emails that a customer reported on May 1st. The bug
caused some of the emails to miss information on errors that occurred during certain parts of the
day. This was a pretty serious bug. It affected almost all of our customers and had been around for
a very long time.

The issue ended up being a miscalculation in our timezone handling code. Most developers who have
worked with timezones know how difficult it can be to get it right. I'll explain how we did it and
how Rollbar helped us quickly narrow in on the problem.

Debugging

The symptoms:

  1. The occurrence counts in some of the daily summary emails were too low
  2. Some new errors were not even in the daily summary emails

The first thing I did was to recreate the database query by hand. After reading the code and pulling
out the relevant queries, I was able to recreate the daily summaries for production but in a
development environment. At this point, it was clear that we were missing data but the query was
correct.

Some possible culprits:

  1. MySQL slave delay
  2. Long-running MySQL transactions
  3. Nebulous time-zone related bug in our code

I was able to verify that MySQL slave delay was not the cause by looking at our Scout graphs for the
database host. Slave delay was nowhere near where it needed to be to cause this.

Next, I looked for long-running MySQL transactions. We have quite a few long-running connections so
it was impossible to determine if we had long-running transactions by looking at just the output
from MySQL's SHOW FULL PROCESSLIST.

I added some code to the script that sends the daily summary emails to print out the maximum id on a
table in our database during each run of the script. If the maximum id never changed, it would mean
the transaction in which the query was executing was never being closed, (since we use
REPEATABLE READ by default on all connections.)

The ids were changing. This wasn't the problem...

At this point it was clear that the problem was in our code. I had pored over it for a couple of
hours, testing everything and still could not find the problem. Then, I decided to add some Rollbar
debugging to get more information about the code that was running in production.

After looking at the first occurrence it was clear what the problem was. The min_ts and max_ts
were incorrect. This code path was pretty straight forward and it was easy to find the culprit -
time.mktime().

The fix

The docs for time.mktime() say that it assumes the
datetime object passed in uses the local timezone. Our code was passing in a datetime object for
UTC which was getting ignored when we generated the timestamps for our query.

The fix was a one-liner. Instead of using time.mktime() we now us calendar.timegm() since it
expects UTC.

Happy coding and watch out for timezone bugs!