Software Update for Network Devices

Networked devices like "Internet appliance" boxes need to have their software updated automatically. Some updates are optional, and can be installed at the user's discretion, while others are mandatory because they close security holes or track changes to network services. Sometimes updates need to be recalled, or only rolled out to those who need them.

The variety of situations that can arise took many of us at WebTV by surprise. What follows is a brief description of the system I came up with to manage them. I'm writing this some three years after leaving, so it's unlikely that I'm describing anything that is secret or not immediately obvious to the people who managed to hack into "Big Willie's Flash ROM Download Center" (the naming of which I must take the blame for). I imagine some of this is described in the various software update patents that people at WebTV wrote up (US05940074, US06023268, US06230319, US06259442, US06473099), but to the best of my knowledge the classification and versioning scheme described below is not actually part of the patent claims.

There are four distinct pieces of information that combine to determine not only whether a box should be updated, but also which update should be sent to it.

First, every box has a hardware class. This is a short string that uniquely identifies every hardware configuration that needs its own update package. For example, you can't install 4MB worth of ROM updates on a machine that only has 2MB of flash ROM chips. Some combination of product name and hardware configuration will suffice. (Some people started parsing the string to determine device characteristics. Resist the temptation to do so, as it can result in multiple strings that actually want the same update, which is a pain in the bookkeeping.)

Second, every software update has a version number (a single integer, usually an "official build" number). This increases slowly but steadily over the lifetime of a product. The version number can start over at zero for each new hardware class, since there's no risk of pushing the update for product A out to product B. Some people like to start each product in its own thousands range (1000-1999 for the first product, 2000-2999 for the next), which can make hallways conversations about "version 1505" less ambiguous. Resist the temptation to test for the existence of features by checking the version number. Use explicit feature flags instead.

Third, every user has an update category. At a minimum you need different categories for internal users on the production network, alpha testers, and for gradual rollouts of new software. Try not to use the update category for other things, such as determining which set of servers a box uses. You may want one set of users to alpha-test your client software while a different set alpha-tests your network services.

Fourth, every combination of hardware class and update category has a set of four version numbers that determine whether or not a device needs to be updated. Adjusting these four values allows great flexibility. This data can be viewed as a two-dimensional table, with four numbers in each entry. (At WebTV this was just a flat file; there's no need to use fancy database mechanisms.)

The four versions in the table are:

Minimum allowed version
Acceptable version
Current version
Maximum allowed version

The use of these versions is most clearly illustrated with an example. Suppose we have the following values for "beta-test" users with "box2-fpu-4mb" devices:

Minimum 75

Acceptable 100

Current 105

Maximum 120

The decision of whether or not to update is resolved as follows:

Box has version 42. This is too old to work with any of the network services (except the software update mechanism), so the software is updated to v105 without giving the user a chance to accept or decline. Also useful if a box has dangerous bugs, e.g. vulnerability to remote tampering.
Box has version 82. An update with useful features is available, but there's no need to force the user to update. An update to v105 is offered, and installed at user request.
Box has version 102. There is a later version, but the update is either cosmetic or only required for some boxes (e.g. some were shipped with a defective part for which a workaround exists). No update is offered to the user, because there's no need to waste resources or the user's time with an upgrade. If they have one of the defective boxes, they can be given simple instructions that will cause their box to get the latest version (v105).
Box has version 112. This probably happened because a previously-offered update was withdrawn once some problems were noticed. This can also happen during "timed rollouts", where limited-scope updates are performed by enabling the new version for a limited time only. No update is offered.
Box has version 121. The v121 software was defective, and now we need to roll everybody back to an earlier version. The user will be "updated" to v105 without having an opportunity to accept or decline. This must be used carefully, because it's possible that the recalled update created files or data structures that the older version will be unable to recognize.

The upgrade decision can be made on the box or on the service. For a dial-up device, the choice can be made on the service when the box connects. For a box connected to a broadcast medium (such as a satellite dish), all possible updates are sent to all boxes, because there's no way to send specific streams to specific subscribers. The version map is sent to the boxes as well, and the device decides what to do after the download completes.

There is an additional level of complexity that must be taken into account as well. While the above is sufficient for delivering "release" builds to customers and internal users, it does not provide for "debug" builds or private developer builds. One way of handling this is to refuse to update them automatically, identifying them with a special version number (such as 0 or MAXINT). Developers can manually update their boxes. This is less desirable for QA "daily debug" builds, which should have the same version numbering information as the "release" counterparts, and should usually be upgraded automatically to track the latest developments. The easiest way to deal with this is to have "debug" builds report a modified hardware class that includes "-debug", so that boxes with debug builds always get debug builds, and boxes with release builds always get release builds.

The customer database for every user should hold the hardware class of their device and the version number of the software being run on it. This allows you to monitor the rollout of software updates, and to determine if the error logs coming up from the box are due to an old version of the software or have carried over into an update where they were thought to be fixed.

Remember that there are exceptions to every rule. Provide a way to override the settings for specific users or devices. This will greatly simplify matters when requests for special handling come in. Examples include test boxes that must never be updated, demo boxes that need to have a special version, and one-off user upgrades when evaluating fixes for rare problems. You could accomplish the same thing by defining new users categories, but in general a simple override (box ID combined with the 4 versions described above) is easier to set up and maintain.

I can say from experience that this is one area where you really want to put some thought and effort in up front. Trying to make all this stuff work after multiple versions of a product are already in the field is a tricky business.