Engineering for Reliability: Learning from a chair

The other day, I posted an article about Engeneering for Usability. I hadn’t intended to make a series of this, but it may very well turn out that way as today’s post is about engeneering for reliability. Who knows what future posts may hold.

I was sitting in a doctors office the other day, waiting very patiently to be called in. I fidgetted as I always did. Leaning forward, leaning sideways, flopping around like an idiot just trying to reconcile myself with the fact that I didn’t have a laptop or something in front of me to do. But then it sort of hit me. I was abusing that chair to death and it was taking it.

Lets think about the first sentance of this blog post. I said that I was sitting in the doctors office. I’ll bet not one person reading this said “Oh man, what kind of chair was it?” or “Oh my goodness, what if it breaks?” or “Holy cow, that guy is really rolling the dice there isn’t he?”. All of you most likely pictured someone just plopped down on a chair. No one was really fixed on the idea that the chair could fall apart or that it wasn’t an appropriate height. The reason behind this is that, in general, we tend to think that a chair will work when we use it. When was the last time you tested a chair to see if it would hold your weight? I’m sure under the right circumstances, you just might — like if the chair looked visibly weak or damaged, but once again, the general concensus is that its going to work out for you.

When was the last time you could rest this easy with the software we use on our desktops? Do we ever hear someone talk about a piece of software crashing and say, “wow, that’s unusual”. No, in fact, many computer vendors design their software to handle crashing more gracefully. Microsoft themselves have implemented the ability to report errors to them that occur (with a prompt to the user, of course). We have event logs, log files, dump logs, and the like. As developers we spend more time preparing for our software to crash than I think we should sometimes spend making sure our software doesn’t crash to begin with. But while we should spend a great deal of time testing and thinking through our applications — making sure they can work, that isn’t even half of what is required to develop a reliable software system. There are things that are out of our control such as network connectivity, power failures, hard drive crashes, and other hardware failures.

For these very reasons, desktop software most likely never will be as reliable as a chair — its too dependant on too many things in the environment. We have to depend on outside influences such as universal power supplies, raid controllers and backup network connections. These are not typically all seen in your typical home system and even if they are, you can’t make those things prerequisite to installing your software because your available user base would sink considerably. The chair, on the other hand, has everything it needs on hand to appease its user base. Despite the fact that there are a million chair varieties, they all come with their own support system and don’t depend on anything but gravity to make it work right (there’s that gravity constant again).

When it comes to desktop software and even small business applications we are forced to plan for and handle as many failures as we can forsee. We have to accept the fact that outside influences such as memory corruption and IO errors may cause our application to misbehave. Obviously with enterprise software, the guys with the big bucks pay us to implement systems with hardware failovers, double, tripple or quadruple redundancy, and the like. We can make most of those applications truely reliable and nearly as reliable as a chair. However, till we see Dell shipping every system out the door with a UPS, backup and restore operations, RAID and free connectivity to multiple providers, prepare your applications well to handle these errors.

So what can you catch? What can you do to make the user trust your application in spite of these obvious physical problems?

  1. Call it as you see it when a failure occurs. Make sure before you crash that you point the finger at the culprit: “An I/O error has occurred, shutting down to prevent data loss.” or “This application was shut down unexpectedly, would you like to restore?” are common dialogs that you see in highly reliable desktop applications such as Microsoft Word. Which brings us to our second point.
  2. Provide data recovery after failure. This typically means that you have to save state in your application often. One of the most famous uses of this was already noted above. Microsoft word can recover from an error by presenting you with a recovered document as of the last autosave. Obviously, you dont’ want to automatically save to the file that was opened. Microsoft Word creates a temporary file with a similar name to the opened file (placing a “~” at the beginning of the name and setting the file attributes to hidden). That way, when the application crashes, and someone attempts to open the document again, Word can recognize that changes were made to the file. Office knows then to alert you that you can recover the changes to the document you are trying to open.
  3. Provide failover storage durring periods of communications disconnection. When you cannot communicate with a remote database, consider writing changes to a local storage repository and allow those changes to be uploaded when communications are restored. This allows the user to trust that they can continue working on the application and their changes wont be lost or have to be retyped should communications go down. This store and forward style of communication is even seen at lower levels of the OSI layer in routers, switches, and firewalls. Packets are received, stored, and then routing is attempted. If the routing fails, the attempt can be made again because the packet was stored in the device.
  4. Restore network communications automatically. Take note of applications like ICQ, MSN, Yahoo, and the like. They can detect when an internet connection is available and automatically reconnect to their respective services when communications are restored. Don’t expect the user to reconnect every time. This can get annoying if a network connection is having particular difficulty.
  5. Provide logging and feedback capabilities so you can determine what errors occur and how user experience is impacted. Not every user will use it, in fact most will opt out. But for the users that are willing to take the time to fill you in on what’s happening, you should take it. Consider it free QA. Make sure you aren’t just collecting this information. Be sure to respond appropriately with hot fixes and service packs that address the issues you find. Let the user’s know that you cared enough about their input to make sure it didn’t happen again.

These certainly isn’t a comprehensive list, but it should be enough to get you started. You’ll never be able to meet the same reliability standard of a chair with software, but you should be able to instill confidence in your application’s users. Let them know that they can “sit back and relax” knowing that you, your application and their chair will be there for them when they expect it to be.

Be Sociable, Share!

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Post Navigation