engineering | Tobin Titus

Engineering for Reliability: Learning from a chair

The other day, I posted an article about Engeneering for Usability. I hadn’t intended to make a series of this, but it may very well turn out that way as today’s post is about engeneering for reliability. Who knows what future posts may hold.

I was sitting in a doctors office the other day, waiting very patiently to be called in. I fidgetted as I always did. Leaning forward, leaning sideways, flopping around like an idiot just trying to reconcile myself with the fact that I didn’t have a laptop or something in front of me to do. But then it sort of hit me. I was abusing that chair to death and it was taking it.

Lets think about the first sentance of this blog post. I said that I was sitting in the doctors office. I’ll bet not one person reading this said “Oh man, what kind of chair was it?” or “Oh my goodness, what if it breaks?” or “Holy cow, that guy is really rolling the dice there isn’t he?”. All of you most likely pictured someone just plopped down on a chair. No one was really fixed on the idea that the chair could fall apart or that it wasn’t an appropriate height. The reason behind this is that, in general, we tend to think that a chair will work when we use it. When was the last time you tested a chair to see if it would hold your weight? I’m sure under the right circumstances, you just might — like if the chair looked visibly weak or damaged, but once again, the general concensus is that its going to work out for you.

When was the last time you could rest this easy with the software we use on our desktops? Do we ever hear someone talk about a piece of software crashing and say, “wow, that’s unusual”. No, in fact, many computer vendors design their software to handle crashing more gracefully. Microsoft themselves have implemented the ability to report errors to them that occur (with a prompt to the user, of course). We have event logs, log files, dump logs, and the like. As developers we spend more time preparing for our software to crash than I think we should sometimes spend making sure our software doesn’t crash to begin with. But while we should spend a great deal of time testing and thinking through our applications — making sure they can work, that isn’t even half of what is required to develop a reliable software system. There are things that are out of our control such as network connectivity, power failures, hard drive crashes, and other hardware failures.

For these very reasons, desktop software most likely never will be as reliable as a chair — its too dependant on too many things in the environment. We have to depend on outside influences such as universal power supplies, raid controllers and backup network connections. These are not typically all seen in your typical home system and even if they are, you can’t make those things prerequisite to installing your software because your available user base would sink considerably. The chair, on the other hand, has everything it needs on hand to appease its user base. Despite the fact that there are a million chair varieties, they all come with their own support system and don’t depend on anything but gravity to make it work right (there’s that gravity constant again).

When it comes to desktop software and even small business applications we are forced to plan for and handle as many failures as we can forsee. We have to accept the fact that outside influences such as memory corruption and IO errors may cause our application to misbehave. Obviously with enterprise software, the guys with the big bucks pay us to implement systems with hardware failovers, double, tripple or quadruple redundancy, and the like. We can make most of those applications truely reliable and nearly as reliable as a chair. However, till we see Dell shipping every system out the door with a UPS, backup and restore operations, RAID and free connectivity to multiple providers, prepare your applications well to handle these errors.

So what can you catch? What can you do to make the user trust your application in spite of these obvious physical problems?

Call it as you see it when a failure occurs. Make sure before you crash that you point the finger at the culprit: “An I/O error has occurred, shutting down to prevent data loss.” or “This application was shut down unexpectedly, would you like to restore?” are common dialogs that you see in highly reliable desktop applications such as Microsoft Word. Which brings us to our second point.
Provide data recovery after failure. This typically means that you have to save state in your application often. One of the most famous uses of this was already noted above. Microsoft word can recover from an error by presenting you with a recovered document as of the last autosave. Obviously, you dont’ want to automatically save to the file that was opened. Microsoft Word creates a temporary file with a similar name to the opened file (placing a “~” at the beginning of the name and setting the file attributes to hidden). That way, when the application crashes, and someone attempts to open the document again, Word can recognize that changes were made to the file. Office knows then to alert you that you can recover the changes to the document you are trying to open.
Provide failover storage durring periods of communications disconnection. When you cannot communicate with a remote database, consider writing changes to a local storage repository and allow those changes to be uploaded when communications are restored. This allows the user to trust that they can continue working on the application and their changes wont be lost or have to be retyped should communications go down. This store and forward style of communication is even seen at lower levels of the OSI layer in routers, switches, and firewalls. Packets are received, stored, and then routing is attempted. If the routing fails, the attempt can be made again because the packet was stored in the device.
Restore network communications automatically. Take note of applications like ICQ, MSN, Yahoo, and the like. They can detect when an internet connection is available and automatically reconnect to their respective services when communications are restored. Don’t expect the user to reconnect every time. This can get annoying if a network connection is having particular difficulty.
Provide logging and feedback capabilities so you can determine what errors occur and how user experience is impacted. Not every user will use it, in fact most will opt out. But for the users that are willing to take the time to fill you in on what’s happening, you should take it. Consider it free QA. Make sure you aren’t just collecting this information. Be sure to respond appropriately with hot fixes and service packs that address the issues you find. Let the user’s know that you cared enough about their input to make sure it didn’t happen again.

These certainly isn’t a comprehensive list, but it should be enough to get you started. You’ll never be able to meet the same reliability standard of a chair with software, but you should be able to instill confidence in your application’s users. Let them know that they can “sit back and relax” knowing that you, your application and their chair will be there for them when they expect it to be.

Engineering for usability: Vending machines

Today, I used a vending machine. Anyone that has met me will surmise that this was obviously not my first attempt to purchase an unhealthy snack from such a contraption. In fact, not only have I excessively used of vending machines, I’ve written code for kiosks that use the same currency and coin accepting hardware as vending machines. Suffice it to say this long paragraph was meant to proclaim my self-appointed title of “Certified Vending Machine Professional” (CVMP).

However, with all of my years of experience, today’s experience was different. Today, I approached the vending machine, tired from lack of sleep the past month, distraught over having to get my cat a permanent tracheotomy, cranky from a back problem, and distracted by tons of work-related issues running through my head. I carefully stood in front of the machine contemplating what I would get this time — as though the product selection would have changed between now and the last time I looked. Well, I am supposed to be on a diet, but due to the massive amount of life-issues I’m having at the moment, I decide a “healthier” vending machine snack of pretzels will suffice this time. “D10, 65 cents, ok.” I think to myself as I begin to press the numbers out on the keypad “D, One, Zero — CRAP!”. As it turns out D-10 expected me to press D and then press a key they had specifically set to 10 — not 1 then 0 as I was thinking. I knew this. I’ve used enough vending machines in my life to know how they work. But today, I failed the test. I immediately began to blame the vending machine and started re-engineering a smart vending machine in my head.

Here were the problems I saw with the current design.

The keypad was placed at the bottom of the machine, forcing me to bend my already soar back over to press the numbers. I’m not exactly tall but these numbers were way the @#$* down there!

The vending machine made me bend down even further to retrieve my calorie-clad selection from a bin.

The numbering / item selection routes were flawed with that D10 vs D1 conundrum.

The selection sucked

They only accepted cash or coin which I often don’t carry.

The “lets design something cool” personality in me immediately started thinking about solutions to this problem.

Put a camera on the vending machine that recognized the height of the customer. The vending machine should immediately raise or lower the keypad based on the user’s height.

Place the items on a selection belt where the item is dispensed into a bin and procured to the user at a height that’s reasonable — again based on the height of the user requesting the product.

At the minimum, remove single-numbered selections and replace with all double digit numbers — force the user to press the number sequence out. At best, prompt the user to press “submit” when they are done selecting their item number. Don’t automatically dispense the product based on the first match of a selection.

Provide a digital feedback center where you could request the vending machine carry your favorite flavor of twinkie.

Place a credit card slot on the machine

But then the “practical architect” personality interjected with complaints to these solutions. First off, lets think about this. Who is the customer of a vending machine. Was it me? Surprisingly, no. The vending machine “customer” was the company who bought it with the intent to make a profit from it. Sure, I may use the vending machine, but ultimately, its the vendor that wants the mechanism that accepts payment and dispenses product. We unwittingly provide our services as testers of the product and give feedback on it (based on our use or non-use of the product). That whole argument is reserved for another post though. Blindly accepting my argument that the vendor is the customer of a vending machine, your next question is to ask, what are the vendor’s requirements for the machine.

Using a lose set of some of those spiffy PASS MADE criteria from the MSF exam (70-300) for your MCSD, lets take a look at those requirements.

Provide a fast transaction from which a user can request a product and get out. Quick response to user request for product.(Performance)

Provide a usable interface from which to conduct unmanned transactions. (Accessibility)

Provide a safe way to keep money received from the transactions. (Security)

Provide the ability to easily add more product to the machine — for those of you that don’t know this, they actually can add additional slots next to the vending machine without adding additional currency acceptors. (Scalability)

Provide a mechanism that works nearly every time with little intervention required by the vendor. (Maintainabilty)

Provide 24/7 access to any would-be customers. (Availability)

Allow these items to be shipped easily without fear of breaking during transit or movement from one location to another (Deployability)

Allow alternate methods of payment, exchange of hardware, and various types of product to be dispensed. (Extensibility)

So lets apply these characteristics to my “solutions” from above. We’ll find out these were not solutions at all, but fixes to perceived problems from a single-minded personality. My Architect (as I’ll refer to that voice in my head), says this.

The product isn’t on a belt because the time spent between asking for the product and obtaining it could potentially take a lot longer (Performance). The keypads were not arbitrarily placed at waiste level to tick me off, they were blatantly placed there to provide easier accessibility to those who are height-challenged or otherwise requiring accessibility. The credit card reader isn’t placed on the machine because it would be too easy to intercept. These vending machines can be placed anywhere, and sometimes in really poor choices (rest areas, for instance). While having a vending machine broken into would be bad, having credit card data compromised from one of these locations would be worse. (Security) Placing the items on a belt would make it harder to provide additional items in the physical range of motion of the device. It would be harder to add more product to the machine this way — and much more expensive (Scalability). Those devices would also have a large amount of moving parts that could break. Putting the keypad on a moving device as well as the product would reduce the useful life of a vending machine while raising the cost of it — bad for any investment. It would also be susceptible to significant breaks and massive amounts of upkeep (Maintainability). Putting these items in a bad area would be an even worse move. They would be much more expensive to replace and fix if someone vandalized them — much more, that is, in relation to a simple spring delivered product dispenser that relies on gravity to get the product to its location. The parts would definitely have a much higher dead-on-delivery rate than normal vending machines — causing massive replacement shipping/delivery costs (Deployability). This is a stretch, but by placing additional overhead on these machines, you actually reduce the available room to provide additional hardware to the machine. You could be restricted to a smaller area and some additional hardware may interfere with other pieces of hardware causing more issues than it solved (Extensibility).

Availability, therefore, seems to be the only area that I wouldn’t have negatively impacted by my decisions for a vending machine. As software engineers, we need to make sure we aren’t over-engineering our products. Many times the “cool” solution isn’t the right solution. For instance, the delivery mechanism, while annoying for anyone that is tall or has a back problem, is fairly consistent. I mean, who has a better system than gravity? Last I saw it was a constant! Sure, occasionally product gets stuck in the machine before gravity can do its part, but I guarantee the mechanical delivery system would be much more prone to mistakes than the simple “twist and drop” method of most vending machines. While I may have potentially solved “my” problems, I’ve caused many more for my real customer — the vendor. In actuality the dependability of my vending machine would have cost more money to purchase and maintain, caused dependability problems due to mechanical failure, and most likely annoyed everyone more than “New Coke” ever did.

Yes, its great to solve your problems in cool ways, but make sure the problems you are solving are for the right person and actually advance your product design before adding more features than are necessary.

Tag Archives: Engineering

Engineering for Reliability: Learning from a chair

Engineering for usability: Vending machines

Building a Better Web App

Categories

Archives

Tag Archives: Engineering

Engineering for Reliability: Learning from a chair

Engineering for usability: Vending machines

Building a Better Web App

Categories

Archives

Tags