2015

Get to the Root Of the Cause

EP Editorial Staff | October 12, 2015

1508rootcause01RESIZE

If you’re good at root-cause analysis, you learn from your mistakes. If you’re great at it, you prevent mistakes from happening.

By David J. Mierau, P.E., CMRP, Life Cycle Engineering (LCE)

The concept of analyzing failures and tragedies to prevent them from recurring isn’t new. It is, in fact, the foundation of many of our legal systems and regulatory entities. The United States Pure Food and Drug Act of 1906 is a good example. Enactment of this legislation, which effectively created the Food and Drug Administration (FDA), was in response to the negative impact of dangerous and/or misbranded food and drugs on public health. Among other things, people wanted scientifically tested medicines with controlled ingredients and dosages—not snake oil.

Formal failure-analysis documentation began to appear in the defense, automotive, aerospace, and nuclear-power industries in the mid-twentieth century. What started with simple techniques, such as the Five Whys, evolved into more sophisticated approaches, such as Events & Causal Factor Analysis, and Management Oversight and Risk Tree (MORT). Today, there are hundreds, if not thousands, of methods for performing root-cause analyses across nearly every industry. How good are your skills in this area?

Leverage application knowledge

Engineers are classically trained in mathematics and physical sciences. We are taught to evaluate, analyze, and calculate. Most current engineering curricula contain courses that include problem-solving elements, something that’s often intertwined with root-cause analysis (RCA).

To be as effective as possible at RCA, however, engineers need to be able to combine process and theory with strong application knowledge. While they must understand RCA fault-tree methodology inside and out, without a working knowledge of the equipment or situation being analyzed—or access to personnel with that experience—results of a root-cause analysis won’t be comprehensive or effective.

In RCA, investigators often focus on the “who” and “what.” These areas can be categorized into “human” and “physical” causes of failures.

Human causes. We look at the ways people make mistakes and errors. Maybe an operator forgets to adjust the speed of a conveyor belt when the line changes from one product to another. Maybe a mechanic uses a standard socket wrench instead of a torque wrench to tighten the bolts on a machine. Overcoming these human causes of failure requires disciplinary or training actions, right?  Not necessarily.

Physical causes. We look primarily at mechanical factors. Perhaps your washing machine stops working. You remove the front access panel and discover the drive belt is broken between the electric motor and wash-drum pulleys. Overcoming this physical cause of failure requires nothing more than a belt replacement and restart, right? No and no.

Most human and physical causes are actually causal factors—or “proximate” causes. They are the events that occurred immediately before the main event or undesired outcome.

Granted, if you knew the drive-belt on your clothes washer was going to break shortly before it did (through vibration monitoring, perhaps), you might have replaced it and prevented a failure. But what if your machine were only five months old, or someone had recently tried to wash 15 heavy bath towels in it? The idea of replacing the belt every four months might not seem very reasonable, after all. That type of solution would only address an obvious physical issue.

Dig deep

Driving down several layers of causes, below human and physical proximate causes, will eventually bring an investigator to the root cause of a problem. That is what created the proximate cause and the subsequent undesired outcome.

Typically, a problem has only one root cause or, at the most, very few. Categorized as latent causes, these are underlying issues leading to a failure. If the latent causes of a given problem are eliminated, situations created by human and physical proximate causes, associated with the problem, won’t surface.

In the example of the broken drive belt, 15 wet towels created excess weight in the washing machine that the belt couldn’t handle. Thus, the belt snapped. The question is, why would someone load so many towels into the machine in the first place? Answer: Lack of operator knowledge.

When a big-box store or other vendor delivers and installs an appliance, it doesn’t necessarily provide a turnover checklist item to train the new owners/operators. This issue could be categorized as lack of procedure/checklist. If an administrative control had been in place for delivery personnel to review the washing-machine manual—including the section on loading limitations—with you and/or others in your household, fewer towels would have been jammed into the unit, an overload situation would have been avoided, and the belt would not have failed. But that’s just one solution to the problem. A better one might be for the manufacturer to redesign the washing machine with a load sensor that alerts the user, at the start of the wash cycle, to reduce the load. Some of today’s smart washers incorporate such sensors.

The goal

Determining and eliminating the root causes of non-working washing machines (serious though they may be in our personal lives) pale in comparison to doing the same for industrial problems that could put peoples’ lives and untold millions of dollars at stake. If your plant personnel are trying to be as safe and efficient as possible and working to drive out unplanned downtime, understanding and mitigating latent causes of equipment and process problems is a big part of getting there.

Ultimately, systems and processes need to be proactively evaluated and designed in ways that eliminate or mitigate common latent failure causes. Great engineers and technicians can take their knowledge of machine design, combine it with their operational experience, and apply this package of strengths to predict human and physical proximate causes. Fully understanding the potential latent causes of a problem’s proximate causes will lead to improved root-cause analyses and, ultimately, safer, more reliable failure-prevention measures and solutions. MT

David J. Mierau, P.E., CRMP, is director of Reliability Solutions for Life Cycle Engineering (LCE.com), based in Charleston, SC. A member of the International Society for Pharmaceutical Engineers (ispe.org) and Society for Maintenance and Reliability Professionals (smrp.org), he has a broad range of industry experience.

Screen Shot 2015-09-15 at 10.46.06 AM

To learn more, consult:

United States Department of Energy Guideline DOE-NE-STD-1004-92: Root Cause Analysis Guidance Document

DPST—87-209: “User’s Guide for Reactor Incident Root Cause Coding Tree,” E.I. du Pont de Nemours & Co.

Maintenance Engineering Handbook, 8th Ed., K. Mobley

FEATURED VIDEO

Sign up for insights, trends, & developments in
  • Machinery Solutions
  • Maintenance & Reliability Solutions
  • Energy Efficiency
Return to top