Computing Without Boundaries: Guerrilla Troubleshooting Tactics

I created a training video over ten years ago called Guerrilla Troubleshooting Tactics which simplified and codified successful troubleshooting patterns. I recently realized it has broad applicability in other areas of our business here at Inductive Automation so I thought I would share the key points of it with you.

Troubleshooting procedure is very distinct from the technical knowledge of something. Anyone can follow this procedure up until the last step, even if the're not technical, and troubleshoot successfully. Sometimes it's just asking the right questions of someone who does have the technical knowledge.

The following simple questions are arranged in the order given so as to pick the low hanging fruit first. A frequent mistake is to dive into too much detail with too little information too early. Here are the steps:

Diagnostic Sequence

1) What is the problem?
2) How is it supposed to work?
3) Did it ever work?
4) What changed?
5) What else is affected?
6) Has it ever happened before?
7) Why do the affected outputs act this way?

A little explanation about these steps is in order. Remember, the order of these steps is one of trying the easy things first. The problem could be, and many times is, resolved on step one. The procedure shouldn't take very long - sometimes only minutes.

First Step
The first step is to get a clear description of the problem - obvious, right? Yet, I have personally made the mistake of getting only a partial description to a problem and then taking off on a tangent (half cocked as they say), only to later have to back up and restart. Take your time and get the whole description. Another point is this... if they say there is a problem, there is a problem. Sometimes troubleshooters "blow off" the user as "using it wrong" or being "nuts" or something. But the troubleshooter should always assume full responsibility for the system and the person using it. It's an end to end approach. If the problem is how the user is using it, then this should become the target of the successful troubleshooter - education of the user. Sometimes when doing this you discover something really is wrong with the system and not the user, which is a bit embarrassing, though productive.

Second Step
You'd better know how something is supposed to work before you attempt to "fix it." It's really hard to fix something that's operating normally! Sometimes the design is so poor you assume it would never work that way, but that would be a bad assumption on your part. How would you get this information? Ask the user and verify it with a manual. You can Google a manual or other information on practically anything these days. You don't sit there and read the whole manual. But get good at scanning quickly and picking out the relevant info. Sometimes asking the user is sufficient because they most familiar with the thing (they are the user aren't they?) If that leads you astray or into confusion then resort to the manual or use your own common sense. This step is vital since troubleshooting is done by comparing how something is... to how it should be.

Third Step
When working with "one-off" systems the step, "did it ever work?", is vital or you'll waste massive amounts of time. Your approach will be totally different if it's never worked versus the case where it did work and then something changed and it stopped working. The first case means the development-debugging process was never completed. That's an engineering concern and the reason it doesn't work could be one or many reasons or even owe to a fundamentally flawed design. In most cases that's where the troubleshooting ends and it goes back to the designers, or if not, at least you know what you're up against. But if you can ascertain that it did work at some point then you've got something you can work with. Now you've got certainty and a starting point.

Fourth Step
When you ask "what changed?" the response is often "Nothing." But here is your stable datum - something changed - or else it would still be working. So you might have to ask "when it was working fine?", and then ask "when did it stop working?" Getting them to place it in time will often get them to realize what it was. It will be something like "oh yeah, we changed such and such." This is usually the point where the light bulb goes off and it will lead you right to the solution. There is one caveat though. Before you arrived on the scene someone else might have been troubleshooting it, and if they used a shotgun approach they could have caused new problems in addition to the original one. They might have replaced parts with whatever was on hand which are the wrong ones. They might tweaked and adjusted things out of desperation and thrown things out of alignment. So you have to ask about what steps were taken exactly before you arrived, and in many cases you will need put every back to the original state before starting your troubleshooting.

Fifth Step
Usually the problem will resolve on step four. But with particularly stubborn problems it's time to step back and look at the big picture. Look for other things that depart from what they should be. In step two you determined how it's supposed to work, so now try to find anything else that's affected adversely. Things like, "oh yeah, memory use is also high", or "this component is also unusually hot." Once you determine the other things that are affected, try to determine what the common denominator is. That might lead you to the problem. But be aware if there are multiple problems you could get really confused doing this step. If that happens, factor into your thinking the multiple problem possibility and see if that helps. If not, then go to the next step.

Sixth Step
Has it ever happened before? Some people think this should be the first step and sometimes that works. But I have found that more often than not it can lead you astray if you haven't done the other steps first. These steps are designed to lead toward greater understanding, whereas if you do step six first it's just rote procedure. The point of the diagnostic sequence is to proceed methodically and with certainty toward problem resolution. Doing step six first not only misses the root cause but also can lead to shotgun troubleshooting (a bad thing). But if you've done the preceding steps already, you can now safely ask for previous similar problems and their resolution or look into the maintenance logs. You can also call the manufacturer for help. If they are any good they are going to ask you steps one to five anyway. Steps one to five put step six into the proper context.

Seventh Step
Why do the affected outputs act this way? Malfunctions will usually show up as misbehaving outputs. Reports will have bad numbers. Motors won't start. Screens won't display. Up until now any lay person with little or no technical knowledge could perform the diagnostic sequence and win in troubleshooting. But on this step you need the technical background. Fortunately, most problems resolve before this step. But even without a technical background the lay person can direct the technician on this step to success.
The technician on this step traces logic, wiring, hydraulics, pneumatics, or whatever the system consists of, back from a defective output to the exact thing that is causing the malfunction. When there are multiple problem causes this step is usually the only one that works. In this case it is an iterative process (trace one problem, trace the next, etc.). As mentioned before, multiple problems can exist due to prior shotgun troubleshooting being done, but they always exist during the development-debug phase of any product or project. During this development cycle there are often hundreds or thousands of problems to rectify. That's why you ask if it ever worked as in step three. Sometimes projects are left unfinished or a few bugs slip by.

There are a few other factors to keep in mind. If the system ever worked then odds are there is only single problem to find. This procedure will help you find it quickly. When things start getting complicated you should suspect multiple problem causes are present. Start asking about prior troubleshooting attempts. Another approach is to back up a step or two. Maybe you missed something.

Another thing to know is that users will tell you all the things they have tried. I recommend you verify everything for yourself and not trust their narratives. I've been led astray by before I.e. "I tested all the fuses and they're all good" only to discover half an hour later that one of them is blown (they didn't know how to properly test fuses). Only by moving forward methodically and gaining your own certainty of things can you conquer the problems. Sometimes you need to be tactful and say "Please don't be offended if I recheck a few things. It's part of the procedure I use." People are never offended, they're just glad you're there to help.

Realize that when you come onto the scene, people are really confused, or else they would have solved it already. Don't let their confusions become yours - rely on your procedure. They will tell you all types of things. They will even tell you they have tried everything and try to tell you why it can't be fixed. They are just trying to justify why they couldn't solve it - they are usually embarrassed - so be kind. Take their data (you need it) but don't take their recommendations - trust only your procedure and win!

Computing Without Boundaries

Intro

Guerrilla Troubleshooting Tactics

1 comment: