Narrowing the Problem Domain
One of the ongoing tasks in industrial automation is troubleshooting. It’s not glamorous, and it’s a quadrant one activity, but it’s necessary. Like all quadrant one activities, the goals is to get it done as fast as possible so you can get back to quadrant two.
Troubleshooting is a process of narrowing the problem domain. The problem domain is all the possible things that could be causing the problem. Let’s say you have a problem getting your computer on the network. The problem can be any one of these things:
- Physical network cable
- Networks switch(es)
- Network card
- Software driver
In order to troubleshoot as quickly as possible, you want to eliminate possibilities fast (or at least determine which ones are more likely and which are unlikely). If you don’t have much experience, your best bet is to figure where the middle point is, then isolate the two halves and determine which half seems to be working right and which isn’t. This is guaranteed to reduce the problem domain by 50% (assuming there’s only one failure…). So, in the network problem, the physical cable is kind of in the middle. If you unplug it from the back of the computer and plug it into your laptop, can the laptop get on the internet? If yes, the problem’s in your computer, otherwise, it’s upstream. Rinse and repeat.
As you start to gain experience, you start to get faster because you can start to assign relative probabilities of failure to each component. Maybe you’ve had a rash of bad network cards recently, so you might start by checking that.
In industrial automation, I’ve seen a pattern that pops up again and again that helps me narrow the problem domain, so I thought I’d share. Consider this scenario: someone comes to you with a problem: “the machine works fine for a long time, and then it starts throwing fault XYZ (motion timeout), and then after ten minutes of clearing faults, it’s working again.” These annoying intermittent problems can be a real pain, because it’s sometimes hard to reproduce the problem, and it’s hard to know if you’ve fixed it.
However, if you ask yourself one more question, you can easily narrow it down. “Is the sensor that detects the motion complete condition a discrete or analog sensor?” If it’s a discrete sensor, the chance that the problem is in the logic is almost nil. I know our first temptation is always to break out the laptop, and a lot of people have this unrealistic expectation that we can fix stuff like this with a few timers here or there, but that’s not going to help. If you have discrete logic that runs perfectly for a long time and then suddenly has problems, it’s unlikely there’s a problem in the logic. There’s a 99% certainty that it’s a physical problem. Start looking for physical abnormalities. Does the sensor sense material or a part? If yes, is the sensor position sensitive to normal fluctuations in the material specifications? Is the sensor affected by ambient light? Is the sensor mount loose? Is the air pressure marginal? Is the axis slowing down due to wear?
The old adage, “when all you have is a hammer, every problem is a nail”, is just as true when the only tool you have is a laptop. Don’t break out the laptop when all you need is a wrench.