TAG | quality
There’s a recurring theme on this blog when I talk about programming. It’s a skill you don’t learn in school, but which separates recent graduates from employees with 5 years of experience: how to manage complexity.
The reason you never learned how to manage complexity in nearly 20 years of school is that you never actually encountered anything of sufficient complexity that it needed managing. That’s because at the beginning of each term/semester/year you start with a blank slate. If you only ever spend up to 500 hours on a school project, you’ll never understand what it feels like to look at a system with 10,000 engineering hours logged to it.
That’s when the rules of the game change. You’ve only ever had to deal with projects where you could hold all the parts and variables in your head at one time and make sense of it. Once a project grows beyond the capability of one person to simultaneously understand every part of it, everything starts to unravel. When you change A you’ll forget that you had to change B too. Suddenly your productivity and quality start to take a nosedive. You’ve reached a tipping point, and there’s no going back.
The funny thing is that you learn about all the tools for managing complexity in school. When I was in high school I was taught about structured programming and introduced to modular programming. In university my first programming class used C++ and went through all the features like classes, inheritance and polymorphism. However, since we’d never seen a system big enough to need these tools, most of it fell on deaf ears. Now that I think about it, I wonder if our professors had ever worked on a system big enough to need those tools.
The rules of managing complexity are:
- Remove unnecessary dependencies between parts of your system. Do this by designing each module against an abstract interface that only includes the functionality it needs, nothing else.
- Make necessary dependencies explicit. If one module depends on another, state it explicitly, preferably in a way that can be verified for correctness by an automated checker (compiler, etc.). In object-oriented programming this is typically done with constructor injection.
- Be consistent. Humans are pattern-recognition machines. Guide your reader’s expectations by doing similar things the same way everywhere. (Five-rung logic, anyone?)
- Automate! Don’t make a person remember any more than they have to. Are there 3 steps to make some change? Can we make it 2, or better yet, 1?
It would be interesting if, in university, your second year project built on your first year project, and your third year project built on your second year project, etc. Maybe by your fourth year you’d have a better appreciation for managing complexity.
It’s easy to overlook the power of human motivation in automation systems.
I’m going to assume you don’t work in a lights-out factory. That’s pretty rare. Almost all automation systems interact with people on a regular basis, and even though we have high fidelity control over our automation processes, people are notoriously difficult to predict, let alone control.
For instance, consider a process with a reject station. Finished parts are measured and parts that don’t meet specification are diverted down a chute. Good parts continue down the line.
Question: how big do you make the reject bin? Naively you might want to make it as big as possible so the operator doesn’t have to waste time emptying it. Unfortunately that means it’s just as easy to make bad parts as it is to make good parts. You’d be better off to make the reject chute only hold 3 parts, and put a full sensor on the chute that throws a fault and stops the machine when it’s fully. Then it’ll be a pain in the ass to make bad parts, and the operator will have a lot more motivation to do something about it. While you’re at it, put the reject chute on the other side of the machine so they have to walk around the machine to empty it.
Consider a machine with an e-stop button. It’s a big red button with a mushroom head that’s supposed to be easy to hit. However, I’ve seen a lot of machines where the consequence of hitting that button was major downtime because the part tracking got screwed up or the machine just didn’t recover gracefully. I once watched a pallet get bumped out of the track and it was riding along the rail of the conveyor. I hit the e-stop just before it was about to damage some equipment. I was scolded for my efforts: “never hit the e-stop,” he said. That’s the wrong motivation. You want operators to press the button when they see something wrong, so make it easy to recover.
Consider an inventory tracking system. You want people to record stuff they’ve consumed, and what cost account they’ve consumed it against. What motivation does a person standing there with a bolt in their hand have to look up that bolt in your inventory system and mark it consumed? Very little. What if you lock the door to the store room and make them request an item before you unlock the door? That’ll help, but chances are some people who don’t know exactly what they want will click the first item on the list, and just go in and browse. What if you make the inventory storage system so convoluted that the only way to find an item is to look up the storage location in the computer? Well, that might work (until your inventory system breaks).
Like water, people tend to take the easiest path down hill. You’re better off digging a channel where you want it to go than expecting it to get there under its own power. Use gravity to your advantage. Make it harder to do things the wrong way and easier to do things the right way.
If you’re going to interview for a control systems job in a plant, they’ll ask you a lot of questions, but you should also have some questions for them. To me, these are the minimum questions you need to ask to determine if a future employer is worth pursuing:
- Do you have up-to-date electrical drawings in every electrical panel? – When the line is down, you don’t have time to go digging.
- Do you have a wireless network throughout the plant? – It should go without saying, having good reliable wireless connectivity all over your facility really helps when you’re troubleshooting issues. Got a problem with a sensor? Just setup your laptop next to the sensor, go online, look at the logic, and flag the sensor. You don’t have time to walk all over.
- Does every PC (including on-machine PCs) have virus protection that updates automatically? – We’re living in a post Stuxnet world. Enough said.
- Have you separated the office network from the industrial network? – Protection and security are applied in layers. There’s no need for Jack and Jill in accounting to be pinging your PLCs.
- What is your backup (and restore) policy? – Any production-critical machine must always have up-to date code stored in a known location (on a server or in a source control system), it must be backed up regularly, and you have to test your backups by doing regular restores.
- Are employees compensated for working extra hours? – Nothing raises a red flag about a company’s competency more than expecting 60+ hour weeks but not paying you overtime. It means they’re reactive, not proactive. It means they don’t value experience (experienced employees have families and can’t spend as much time at the office). It probably means they scored poorly in the previous questions.
You don’t have to find a company that gets perfect on this test, but if they miss more than one or two, that’s a warning sign. If they do well, they’re a proactive company, and proactive companies are sane places to work.
In case you’ve never read my blog before, let me bring you up to speed:
- Write readable PLC logic.
Now, I’m a fan of ladder logic, because when you write it well, it’s readable by someone who isn’t a programmer, and (in North America, anyway) maintenance people frequently have to troubleshoot automation programs and most of them are not programmers.
That doesn’t mean I’m not a fan of other automation languages. I think structured text should be used when you’re parsing strings, and I like to use sequential function chart to describe my auto-mode logic. I’m also a fan of function block diagram (FBD), particularly when working with signal processing logic, like PID loops, etc.
What I’m not a fan of is hard-to-understand logic. Here’s FBD used wisely:
Here’s an example of FBD abuse:
I’m still reading Clean Code: A Handbook of Agile Software Craftsmanship by Robert C. Martin. He’s talking about traditional PC programming, but one of the “rules” he likes to use is that functions shouldn’t have many inputs. Ideally 0 inputs, maybe 1 or 2, possibly 3, but never more than 3. He says if you go over 3, you’re just being lazy. You should just break that up into multiple functions.
I think that applies equally well to FBD. The reader can easily rationalize about the first image, above, but the second one is just a black box with far too many inputs and outputs. If it doesn’t work the way you expect (and it’s doubtful it does), you have to keep going inside of it to figure it out. Unfortunately once you’re inside, all the variable names change, etc.
I understand the necessity of code re-use, but not code abuse. If you find yourself in the situation of example #2, ask yourself how you can refactor it into something more readable. After all, the most likely person who has to read this later is you.
We’re all probably familiar with the idea that it takes half the time to get to 90% done and the other half to finish the last 10%. This is a staple of project management.
I think there’s actually a narrower scope of really dangerous solutions that you only become familiar with after you experience it. There’s a whole set of problems where the obvious solution gets you 95 to 98% of the way to your performance spec really quickly, but is almost impossible to reach 100% by incremental improvements. The reason I say they’re dangerous is because the feeling of being “almost there” prevents you from going back to the drawing board and coming up with a completely parallel solution.
I can remember a machine vision job from years ago where the spec was “100% read rate”. I only got it to about 94%, and someone else gave it a try. He got it up over 96%, but 100% was out of reach given the technology we had.
Experiences like that make you conservative. Now I unconsciously filter possible solutions by their apparent “flakiness”. I’m much more likely to add an extra prox to a solution to verify a position than to rely on timers or other kinds of internal state, because the latter are more prone to failure during system starts and stops. I press for mechanical changes when I used to bend under the pressure to “fix it in software”.
Still, you have to be careful. Its easy to discount alternatives just because they bear some passing resemblance to a bad experience you had before. You have to keep re-checking your assumptions. Unfortunately, rapid prototyping usually fails to uncover the “almost there” situation I’m talking about. If you prototype something up fast, and it works in 97% of your lab tests, you’ll probably think you have a “proof of concept”, and go forward with it.
The best way to test new solutions is to put them into production on a low risk system. If you’re an integrator, this means having a really good relationship with your customer (chances are you need their equipment to run your tests). If you work for a manufacturer, you can usually find some out-of-the-way machine to test on before you go all-in.
Start by watching this video about the Aeryon Scout robot (kudos Kareem):
I think what sets this aerial robot apart, as Kareem says, is the intuitive user interface. When I look at the state of automation today, I can see that good user interfaces are typically an after-thought. Custom solutions are sometimes so cobbled together that there isn’t enough bandwidth between one black box and the HMI, or the HMI is just a simple two line text display that ends up saying FAULT 53 (the manual with the list of faults, of course, is stuck inside the door of the panel, and it’s the only thing in the area that isn’t covered in grease because nobody bothers to look at it).
People frequently blame engineers for this mess, which I find a bit silly. Certainly user interfaces are a critical component of any system, but why do you hire an electrical designer to do the electrical design, hire a programmer to write the software, but expect one of these people to magically become a usability expert, which is a field unto itself?
I think there used to be an idea that there was no payback on usability. Certainly if you’re selling something like a VCR, you can only print features on the box (you can’t accurately represent the experience of using it) and people only buy one. However, as items become more social (think iPhone), we’re starting to see great user interfaces create viral marketing for products. I think I first saw this with the TiVo – once you saw what it could do, and how easy it was, you were hooked. Apple’s technology seems to be the same way, and I can see how the Aeryon Scout probably has the same “shock and awe” effect when you demo it.
Where does that leave us with industrial automation interfaces? Automation is always purchased based on a cost-benefit analysis because of the high capital cost. The operators typically don’t participate in the purchasing decision at all. I don’t think effort put into a better user interface is wasted; in fact I’m certain there’s a long term payback. But it’s not a selling feature and it takes more time to do right.
Still, when I programmed a machine recently, it was nice to overhear someone say, “it’s pretty intuitive, isn’t it?” So I guess I’ll keep trying, even if it’s not in my own best interest. Engineers are weird that way.
If you’re interested in making better user interfaces, first I recommend reading The Design of Everyday Things by Donald Norman. I also recommend this video called the least you can do about usability by Steve Krug, author of Don’t Make Me Think: A Common Sense Approach to Web Usability, 2nd Edition.
As we all know, there are 10 kinds of people in the world.
For those of you who haven’t read Zen and the Art of Motorcycle Maintenance by Pirsig, he spends at least one chapter at the beginning talking about how we naturally tend to divide things into smaller pieces in an effort to understand them. The novice looks at a motorcycle and sees the visible things, like a seat, handlebars, and wheels, but the expert sees a fuel system, a cooling system, and the suspension. The same thing or system (motorcycle) can be subdivided different ways depending on what we want to do with it.
My tongue-in-cheek title of this post is an acknowledgement of the many ways we can categorize something like Automation Software, but for my purposes today, I’m making two categories: hammers and levels.
A carpenter carries both a hammer and a level, but the two have fundamentally different failure modes. If a hammer stops working, you’ll know it as soon as you try to use it. As long as it hammers in a nail, it doesn’t matter if the hammer is rusty, dirty, scratched or dented, it’s a working hammer. The level, on the other hand, is a measuring instrument. As novices, we assume that it comes from the factory pre-calibrated, and we happily hang our shelf or picture without testing it, but a professional carpenter knows that they have to check their levels for accuracy, or else the level is useless. You could use a level for years, but if one day it stopped being accurate, you probably wouldn’t know. This is a very different situation than the hammer.
Software in general, and automation software in particular, both have similar examples. You never need to “calibrate” the Axis 1 Advanced proximity switch on a machine because if it doesn’t work, the machine won’t make parts (and you’ll know about it instantly, usually via a 2 am phone call). On the other hand, testing data collection logic is surprisingly difficult because the only way to test it is to compare it with a known-good equivalent. Assuming you created this data collection logic to automate away a manual process, the only measuring stick we can check it against is the manual process we’re replacing. Once the system is bought off and we get rid of the paper system, how do you prove that subsequent changes don’t break the data collection system?
It’s tempting to brush off the problem by saying that anyone who makes a subsequent change has to do a full regression test of the system, including the data collection system, but anyone who has worked in a real factory environment knows that this is unlikely to work in practice. Full regression tests are expensive.
In the greater software world, they use automated unit tests. They take the logic being tested and they run it through a series of automated checks to make sure nothing changes. This works well in an environment like PC programming, but is very difficult in practice for PLC programming because (a) you usually need a physical PLC to execute the logic (unless you have some kind of emulator) and (b) the people maintaining the system are likely not familiar with concepts like unit tests, and are likely to undervalue their importance.
This screams for a system-level solution. Take accounting for instance. Double-entry accounting (the use of debits and credits to force every action to be made twice) is deliberately created to help catch manual entry errors. If your debits and credits don’t balance, you know you’ve made a mistake somewhere, and you go back and check your arithmetic.
In the automation world, the solution is to measure every input to the data collection system two ways, analyze and aggregate both separately, and compare the end results. Create a system warning or fault if the results don’t match. For instance, measure the amount of material going into the machine, and measure the amount of material exiting the machine, both as finished product, and scrap. If the input doesn’t match the sum of the outputs over the same time period, you know you have a problem. The system becomes self-checking (a hammer rather than a level).
If you follow this route, you need to take care to avoid some common traps:
- Don’t re-use logic between the two sides (in fact, try to make them work differently)
- Try to use different sensors or sensing methods (can we measure the input by speed and duration, and the output by parts and scrap weight?)
- Record both, so if there is a discrepancy, you can check them against manual measurements
It sounds like more work, but making the system self-checking actually reduces the amount of testing you have to do, so it’s not as bad as you think. Besides, writing code is a lot more fun than testing it. We automate everyone else’s job, why not the boring parts of ours?
As an engineer I notice I’m in a minority of people who are obsessed with discovering the “right” way to do something. At the same time, I know that there’s more than one way to do it. Not only that, but the “right” way isn’t the right way until someone discovers it, and the right way becomes the wrong way once someone discovers a better way (and then does the hard work of convincing everyone else that it’s better).
When we say there’s a “right” way, we’re implying that there’s also one or more “wrong” ways, but we’re also implying some more subtle nuances:
- The “right” way is rarely the one that’s obvious to the novice
- The “right” way takes more time, effort, and resources up front
- It’s the “right” way because the additional investment pays off in the long term
I think these are interesting and insightful observations. All of them are strictly tied to experience. Nobody starts their first task on their first day of work and says, “wow, I can do this two different ways… I have no other way to weight the value of these strategies, so I’ll choose the one that takes more effort, time, and money.” By default, we choose the easiest, fastest, and cheapest route available to us. We change our strategy (if ever) only after seeing the outcome of the first trial. We only change if we see that the extra investment now will pay off for us down the road.
That’s why a homebuilder only builds your house “to code”. Have you ever seen how they build their own home? They put extra reinforcing material in the foundation, they use better materials, use longer lasting shingles, and they take care to get the best people to work on it. That’s because to them, the “right” way is different if they’re building your house vs. their home. Your house is just short-term profit, but they want their home to pay them back after the long haul. This is normal, logical, and selfish behaviour.
Yet I think Engineers, Architects, and Designers have some malfunction in their DNA that makes them obsessed with doing what’s in their client’s best long term interest even if there’s no benefit for them personally. There’s some kind of obsessive-compulsive aversion to a sub-optimal design. I would argue that it’s a prerequisite for those professions.
This often leads to frustrating conversations with clients, because the Engineer (not usually that good with social interactions to begin with) is trying to convince the client that they’re going to have to spend more money than they’d planned, it’s going to take longer than they thought, and it’s going to be more difficult than they’d imagined. That is, in fact, what the client is paying for: experience. The Engineer (who is a crazy deviant, always in search of some mythical “right” way of doing things) doesn’t understand why the client is upset, since they’ll be saving a boatload of anguish in the long term.
The most frequent complaint about Engineers has to be that they make things “too complicated”, and to be certain, you can take it too far, but what we’re really doing is inhabiting the losing end of every argument. As Engineers, we’re asking people to endure pain now for the promise of a better life later. If that were easy, the dessert industry would have perished long ago.