Sunday, 13 January 2013

Explaining Run-to-failure

I often read posts and articles railing against run-to-failure as a valid maintenance strategy. or worse, RCM practitioners trying to apologise for run to fail decisions and explain other ways of doing things.

RTF is hard to accept initially because engineers and tradesmen (Journeymen) have an affinity with machinery as Moubrey would say. We like to get them running, tune them to the max, and keep them running smoothly. 

It's in our DNA fortunately... But this doesn't help us to create the minimum safe levels of maintenance for our physical assets. 


The road to RTF


Risk is built into the RCM process, from the very first function statement. 

It is built in when we define all of the functions associated with an asset or asset system; built in when we list all of the functional failures to ensure we capture all reasonably likely failure modes, and built in when we list the failure modes themselves. 

Failure modes need to be both reasonably likely (the frequency / probability side of most criticality diagrams, and at the right level of causality.

So if we have done this right we now have all the failure modes that are reasonably likely to occur, and we are dealing with them at a level of causality where we can do something proactive about them

Next are the Failure Effects. These are detailed focusing on the typical worst case scenario. So this also takes care, in an interactive sense, of the consequence / severity side of most criticality matrices. 


Categorising consequences


Consequence categorisation within RCM is tightly controlled and is a hierarchy starting from safety and proceeding through to non-operational consequences (or direct cost implications only).

At each turn there are also self managing triggers. Within safety and environment for example the trigger is intolerable risk. "Is there an intolerable risk of... "

Determining tolerable risk is something we cover in the RCM Analyst course, and is becoming more widely acknowledge with the publication of recent standards in this area. 

So if you say no to this then we know that the typical worst case scenario effects, of the reasonably likely failure mode, at the right level of causality will not have an intolerable risk of safety or environmental impacts. 

If we had said that there was an intolerable risk, and we have decided this based on a rigorous and detailed approach to determining tolerable levels of risk, then RTF is NOT AN OPTION. Ever, at any point in time. if the risk is intolerable then you must do something about it. 

So this leaves only the Economic end of the decision algorithm. Operational and Non Operational consequences. 

So lets say we have determined that the typical worst case scenario of the reasonably likely failure mode at the right level of causality is Operational. (Sorry for repeating, it stops here)

Then we work down the algorithm to determine which tasks is both Applicable (can be applied and will get a result) and Effective (Will achieve a result that is worth having).

Without going through the entire decision path for Operational Consequences we decide that for whatever reason we cannot apply Predictive Maintenance, Preventive Restoration or Preventive Replacement. (Either the technology / task didn't suit, or the benefits were not worth having)

Then we find ourselves at the choice between Run to Failure. 

it needs to be said that RCM is a maintenance first approach, not a maintenance only approach. And the recent publication of the RCM standard supports this statement. (More on that later)

So we now have a failure mode with no intolerable risk of safety or environmental consequences, where no routine task can be applied for reasons of either Applicability or Effectiveness. 


In Summary


So our choice is either to RTF or to redesign. 

Redesign is the natural choice for many engineers, in fact fighting the urge to redesign is a very important skill set of competent Reliability Engineers. 

But in this case it is only able to be applied if it is cheaper over a reasonable period of time (more cost effective overall) than if it was allowed to Run to Fail.

And that's it. RTF is actually not an easy or an automatic decision, it is only applicable when the issues of cost and risk have been comprehensively dealt with. 

Good luck!