Saturday, January 8, 2011

Feedback in software engineering

Sometimes you encounter clear examples of how the timing of feedback can lead to  problems. Yesterday I encountered an interesting bug in our software platform, due to a strangely implemented PHP function. Due to an incorrect choice by a core PHP developer, a bug on our side was hidden for a few years, becoming apparent on January 1st, 2011.

Some background:
The "mktime" function within PHP (which is a modified copy of the C-code mktime function) can be used to do various smart calculations on dates and time. One of its parameters is the year. This parameters has two ranges of valid values: 00 to 99 and the four character full year (e.g. 2011). The C-code version has a different definition: it takes the amount of years since 1900. Effectively this means that up until the year 2000 these two variant had overlapping valid input. To make things worse, a standard usage off this function in C would be to feed the result of the "localtime()" function into mktime. Localtime() uses the same years since 1900 as mktime in C. In PHP localtime() is also available. (Identical to C, years since 1900) The standard way of using this in C therefore worked correctly in PHP up until the year 2000. In many ways a standard millenium problem: code that worked correctly would brake on January 1st 2000.

Most PHP code in the world will have fixed in that period. (for example by adding 1900 to the result of localtime() before feeding it into mktime)

But then, in 2005, for reasons I can't really imagine, a core PHP developer decided to change the implementation of mktime by allowing 3 number dates (years since 1900). So far, so good, this would not be a big problem, but he did this only by allowing values up until 110 instead of the more logical 999! Effectively he reintroduced the same millenium bug for the year 2011.

Back to the concept of feedback:
What this change did was to hide (for a few years) incorrect implementations. Instead of calling out, spectacularly crashing, or some other obvious feedback, it didn't provide information for the developers that they did something wrong. No, at some inconvenient moment (by definition in the middle of the night) it just stopped producing valid output, creating a potential dangerous situation. (Murphy's law 2.0: All software ever developed will be used in some life critical situation)
Through the lack of correctly timed feedback this created an unstable situation. Especially the timing of the moment at which the problem would become known has been moved from during development and testing, to during production.

1 comment:

Anonymous said...

Really good post!