Plant Uptime: Risk Management

Showing posts with label Risk Management. Show all posts

Wednesday, 9 March 2011

An Avoidable Disaster - The Buncefield Final Report

In February of this year we saw the publication of the final report into the root causes of the buncefield explosion.

For those unfamiliar with this issue Buncefield is the location of a storage facility owned by TOTAL UK Limited (60%) and Texaco 40%. News reports described the incident as the biggest of its kind in peacetime Europe.

When combined with the 5 day fire it started this was Britain's costliest industrial incident at $1.6 billion USD. Fortunately there were no fatalities at all.

The report is as thorough, as you would expect, and quickly gets around to blaming corporate complacency and failures in the process safety management systems that are in place. All valid and expected criticisms.

However, when reviewed with an eye to the principles of Reliability-centered maintenance we see the all too familiar signs of a lack of application of rigorous maintenance management processes.

The primary responsibility of the maintenance manager

Here's a hint. It has nothing to do with efficiency, ERP systems or reliability information.

The role of the maintenance manager has been hijacked by software vendors intent on redefining their roles in terms of how their products will best benefit from it.

And that is fair I suppose. Every person has the right to take care of their commercial risks of course.

But there is a need to think a little deeper than the standard statements revolving around "get the job done", "collect asset data" and the rest of the corporate-speak terms you hear. (Don't you hate the word "strategic"? Seems to mean nothing...)

The primary role of the maintenance manager, in my humble view, is the management of failure.

Not the response to it, although that is important, and not the faithful recording of it; but the management of failure...

This means understanding the likely potential failure modes and putting in place strategies for their management.

Often this pushes people back to rambling on and on about data again, and there is some merit to that argument. But there is a need here to recall what has been termed the Resnikov conundrum...

This basically states that for historical analysis of failure to be accurate you need a lot of failure data, and to get a lot of failure data you need - failures!

And failures cost money, the kill or harm people and they impact on the environmental integrity of the assets being managed.

Not a very ethical way to do it is it?

As engineers we would all love to have oodles of failure data to make decisions with. Yet the reality is that the vast majority of decisions need to be made in the absence of data.

This is the primary role then of the maintenance manager. Building the failure management policies which can then be translated into maintenance processes, efficiency improvements, and data capture via proactive action.

Not ERP, not planning and scheduling, and not data management techniques...

Saturday, 4 July 2009

When is criticality analysis useful?

When used correctly criticality analysis can provide companies with a very powerful tool for ranking their assets, prioritizing their workloads, and for managing their capital spending.

Unfortunately, in their drive to achieve these sorts of results, many practitioners have regularly misapplied criticality analysis. In fact, one could say there is a cult of criticality out there. Trying incorrectly to use some form of matrix approach to solve every part of their maintenance problems.

In some cases the results are relatively harmless, and the only negative impact is the tremendous waste of time. However, on other occasions misapplication of criticality analysis can produce results that are counter productive, dangerous and provide asset owners with a false sense of security.

I can't tackle all of the reasons why criticality can lead to these sorts of problems here, that would take a full chapter of a book, But there are some clear guides that may help avoid this int he future. .

1. Always and only at the level of the failure mode.

It is not uncommon to see "-practitioners" applying criticality analysis at the level of the equipment, assembly, or even at the level of the "principle functions". (Whatever they are)

This practice is not only uninformed, it is extremely dangerous.

You cannot know the relative importance of an asset unless you know what happens when it fails.

This means understanding all of the functions, all of the functional failures and failure modes, and all of their consequences.

Any criticality that is done without going to this level is destined to produce results that are lightweight, inaccurate, and potentially misleading.

Some great examples of Criticality analysis at work.

1. Prioritization of corrective work orders (Works arising from..)

2. The criticality matrix in RBI, risk based inspection, is always at the failure mode level.

3. Criticality analyses prior to performing a Safety Instrumented Systems project. This is relatively easy to do. Most safety instrumented systems have only one function, therefore the failure modes are relatively straight forward.

2. Never sum the answers

Comparing operational risks to safety risks is always where this sort of thinking comes unstuck. There seems to be a belief that we always go for the next highest criticality action or activity, when this is actually not true.

It is also impossible to produce anything (and I have seen a heck of a lot of these now) that truly gives you the capability to compare operational / economic and safety / environmental risks.

The tactics that is often used (erroneously) is to quantify the scores in every area of criticality, and them sum up all the criticality scores, then we are able to choose the highest, the next highest and so on. Sounds logical right? In fact, it has always been a very intoxicating argument.

But it is wrong...The results are often that low safety risks get treated before high safety risks because they also carry high operational costs, which catapults them to the front of the line.

The result? High safety being left in a high risks position.

The option...

a) Only score the highest one. The first one you come to.

As with an RCM analysis if you decide that the failure mode has an intolerable level of risk of a safety event, then that is how it needs to be managed. It's other consequences in environment or operations do not matter. Safety wins, every time.

b) Treat each failure mode according to its consequences.

So what do I do? I have an intolerable risk of a safety incident, and a failure mode with $10,000,000 attached to its failure. Which do I manage first?

Always the intolerable safety items. Then the intolerable environmental integrity elements. No need to debate, compare or work through a cost/benefits calculation.

Safety wins, get it to the tolerable levels. Then environment, get it to a tolerable level also. Then deal with the economic issues. Do not over complicate things.

Even the HSE out of Great Britain has come out against this practice.

3. Never as a filter!!! (Ever)

I have seen this applied two or three times now. Once was in the infrastructure industry of the UK, a second time in the electricity industry of North America, and third was an application of software in the mining industry.

The thinking goes something like this.....

Now that I have all my strategies (from RCM) and all my functional tests (from SIS) and all my replacement options (from say Availability Modeling) I now want to reduce all fo the activities to only those that are critical and require our further attention.

This is idiot engineering at it's best. Don't fall for this

The methodologies and approaches explained above will, for the assets they are working on, produce a safe minimum level of maintenance interventions. There is no further room for another layer of "optimization".

These types of approaches are usually developed and applied by people with only a scarce understanding of what asset management is about, and they are fundamentally dangerous. In fact, they are more likely to cause safety related incidents than an approach that does not use this foolish application of criticality analysis.

4. Prioritize where ever you can do.

I have ranted on this many times. But essentially it is unwise to use criticality analysis to determine which assets should be analysed, or which capital should be spent. Where you can it is far far better to use prioritization methods such as bad actors and AHP. (Which is fantastic by the way)

Good luck.

Sunday, 21 June 2009

Enterprise Risk Management

John Moubray's article speaking out against streamlined RCM approaches was a watershed article. For the first time, Moubray spoke about the coming age of accountability for those charged with managing assets, and the need for defensibility in decision-making.

He was slammed by many at the time as a scaremonger, however, the reality has proved to be far more frightening than anything any of us could have imagined.

The Hatford rail disaster, the BP refinery explosion in Houston, and Buncefield explosion and the passing of the C-45 bill in Canada have proved, beyond a shadow of a doubt, that defensibility in decision-making should be firmly on the mind of any asset manager, or any company charged with managing physical assets.

That's the frightening angle, but there is another angle to this. One that is far more frightening for corporate executives, particularly in light of the recent financial meltdowns in the UK and USA.

For example, in Australia (and USA and UK environment is similar) the Australian Stock Exchange has released a guidance for companies called Principle 7. The thrust of this guidance note is to state that well-governed and well-managed companies need to have a functioning process in place for managing risk.

The guidance note specifically states that:

"Recommendation 7.1: ‘Companies should establish policies for the oversight and management of material business risks and disclose a summary of those policies’."

and...

"Recommendation 7.2: ‘The board should require management to design and implement the riskmanagement and internal control system to manage the company’s material business risks and report to it on whether those risks are being managed effectively. The board should disclose that management has reported to it as to the effectiveness of the company’s management of its material business risks’."

The implications of this are dramatic, and drives home two points related to risk management. First, that the often tactical approaches that we generally adopt are probably not enough to fulfill these requirements. And second, that there is probably a need for a larger registeration and management process for defining, reporting on and monitoring the material risks of a business.

In fact, their guidance notes point out that compliance with this area of corporate governance must be provable.

If you haven't already done so I strongly recommend that you check out Australian Standard AS/NZ4260. This is, as far as I understand it, the only existing globally recognized standard on how to create and implement a corporate process for the management of risk.

As Moubray pointed out, it is wise to be able to make decisions in an environment that can be defended if required. And it will be far easier to defend decisions made according to a recognised global standard than trying to explain the companies reasons behind choosing not to use such a standard.

I am working with a range of organizations today to try to implement corporate risk-management processes and systems. And it is challenging, to say the least.

One of the things that is proving significantly challenging is moving away from the child-like dependance on risk matrices that many in our industry seem to have, and moving organizations toward quantifiable risk profiles.

I don't think now is the time for dumbing down the discipline; now is the time for driving understanding deeper and higher into the corporate hierarchy.

Your thoughts are welcomed.

Plant Uptime

Pages