Mean Time Between Failure or MTBF is one of the most widely used metrics in physical asset management. Generally, companies use it as a guide to the performance of their physical assets, helping them to identify assets or processes that are causing lost revenue or cost related issues.
However, although widely applied, MTBF is still the subject of some confusion. Moreover, MTBF is useful for a range of different purposes, giving organizations greater ability to increase the net present value of their physical asset base.
When companies first look at implementing MTBF, they tend to ask three fundamental questions:
- What MTBF can tell us about our assets,
- what levels can it be applied at, and
- how can MTBF be used to add value to our reliability initiatives?
What MTBF can tell us?
The standard use for MTBF in industry is to tell us the performance of the primary function of an asset or system.
For example, a pumping system consists of a duty/standby pump arrangement, a pressure relief valve, piping, and the tank and associated level switches.
The primary function for this system is to pump water to tank B at a rate of between 900 l/minute and 1000 l/minute. In this case, a failure occurs when the pump system is unable to pump water at the required rate for whatever reason.
The primary function for this system is to pump water to tank B at a rate of between 900 l/minute and 1000 l/minute. In this case, a failure occurs when the pump system is unable to pump water at the required rate for whatever reason.
Here, we can calculate the MTBF as follows:
Total Time Required /Number of Failures
So if the total time that we required to pump to deliver this function was (say) 5 years, and we had 4 failures in that time, the average time between failures would be 5/4 = 1.25.
If this were the mean time between failures, then the failure rate for one year would be 1/MTBF or 1/1.25, which is 0.8, or 80% likelihood of experiencing a failure of the primary function in one year.
If we then wanted to convert this into months we would first convert the MTBF figure to months, 1.25 years = 15 months, then again determine the likelihood of this occurring in one month 1/15, or 0.066.
This means there is a 6% likelihood of experiencing a failure resulting in the loss of the primary function in any given month. We could do the same for a week, a day or any other given period.
The above example shows us that initial uses of MTBF can provide us with the average time between failures for a given time period, and that this can then be manipulated to give us a failure rate for any specified period of time.
Thus, for one measurement of MTBF we are able to calculate the following information:
- MTBF of the Primary Function = 1.25
- Likelihood of a failure in one year = 1/1.25years (80% or 8 x 10-1)
- Likelihood of a failure in one month = 1/15 months (6% or 6 x 10-2)
- Likelihood of a failure in one day = 1/456.25 days (0.22% or 2.2 x 10-3)
- Likelihood of failure in one hour = 1/10950 hours (0.009% or 9 x 10-5)
At all times the formula takes into account the total time of the function, not of the asset itself. This means that regardless of the number or type of assets in the system, the calculation always uses the total time required of the function, or 5 years in this example.
At what level can we apply MTBF?
Like many other metrics in physical asset management MTBF is applicable at any level throughout the asset base.
However, for performance measurement there are two rules for its application:
- it is always used to measure the function of the asset where it is being applied, and
- it always uses the total time required of the function of the level where it is applied.
For instance, in the example given above we determined that the MTBF for the pumping system was 1.25 years, and we were then able to derive failure rates for various other periods.
In addition, we can also apply this to the assets in the system as demonstrated in table 1.
Table 1 contains some information that should immediately provoke some questions. For example, we have counted four failures in our system level MTBF, yet the table contains 13 failures. (Not counting the system failures)
To understand this we need to review the functions for each of the components mentioned.
For example, the function of the High-High Level Switch is to trip the pumping system when water reaches the high-high level. If there is a failure preventing this asset from performing its function, it will not prevent the system from pumping water. We have had one failure on the switch that we know about in this period.
Another obvious issue is the fact that we have had seven failures of the Duty Pump. However, during this time we have also only had two failures on the Stand-By pump, a dormant function, which we know of.
As this system has redundancy built into it, we can only experience a loss of the primary function if we have a failure of the Duty pump and the Standby pump at the same time.
The four failures causing the loss of function at the system level were:
- One multiple failure of the duty and standby pump
- One failure of the High Level switch, meaning the level reached the High-High level once during the 5-year period
- One failure of the Low Level switch, resulting in the Low-Low level tripping the downstream process
- One failure of the piping causing downtime
Figure 2 - MTBF at Different Levels
All the other failure mentioned were either; hidden to the operations team until revealed by inspection, or their function was protected by other assets. (In the case of the failures on the Duty Pump)
As shown in Figure 2, MTBF is useful at any level throughout an asset base. However, its’ application must be on the functions of the assets, and the total time required of each function, at each level of performance measurement.
How can MTBF add value to Reliability Initiatives?
In the hands of a skilled RCM facilitator the measurement and manipulation of MTBF can be used to set the performance expectations of the physical asset base, as well as providing a base for evaluation of strategies, and to indicate the overall performance of assets; not just the performance of their functions.
This helps organizations in the change process because they begin to think about what the assets do, rather than what they are. That is, an appreciation of functional performance as opposed to asset performance.
For example, in the system described in Figure 1 we can break the system down into its’ functions, and begin to assign performance expectations to each of these.
Function 1 - To pump water to tank B at a rate of between 900 l/minute and 1000 l/minute
Functional Failure 1.A – Does not pump water at all
The water pump in this example provides, say, the cooling water for a petrochemical plant. If the system is unable to pump water, there will be a loss of production. The tank contains enough water to keep the plant running for a minimum period of 2 hours, and a maximum period of 6 hours.
A multiple failure of both pumps would nominally result in a loss of production equal to, say, USD $2,000,000. In this case the asset owner would like to keep the likelihood of this occurring to a reasonably low level and after some discussion he decides on a level of 1:10,000 years, or an annual rate of 10-4.
This means management of all failure modes causing this consequence, an adverse impact on operational capability, to the same level of likelihood.
Function 2 – To trip the pumping system when water reaches the high-high level
Functional Failure 2.A – Does not trip when the water reaches the high-high level.
In the case of the water system, an overflow of the tank would result in water in the surrounding area. While this is a slip hazard for employees sent to correct the issue, the asset owner does not regard it as a serious hazard, nor will it result in any damage to additional equipment.
The failure mode is dormant, meaning it will only have consequences when there is a failure of the high-level switch and the high-high level switch. In this particular case, the asset owner is at ease accepting a higher level of risk of occurrence, say, one in every 100 years, or a likelihood of 10-2 in any one year.
Function 3 – To alarm when the tank level is at the low-low level
Functional Failure 3.A – Does not trip when the tank is at the low-low level.
As with the High-High protection this alarm is only required once there has already been a failure of some sort, in this case, notably a failure of the Low-Level Switch.
If this was to occur, and the tank consequently ran dry, the results would be catastrophic in financial terms. The downstream equipment would run dry, and the plant would be without cooling water forcing a loss of production estimated at around 3 days or USD $6,000,000 in this case. There would also be damages conservatively estimated at USD$1,500,000 for producing assets.
The asset owner sees this as the worst possible outcome of a failure of this system. As a result, he would like to keep the likelihood of failure at 1:100,000 years, or 10-5 per year. The resulting performance expectations of failure modes are in Table 2 below.
Table 2 - Functional MTBF
Here we can see the desired failure rates set out in Table 2 for each function, and translated into a performance requirement for each failure mode.
We can also record actual MTBF measures against this to see how effective we have been in managing the failures of this asset to the desired levels of performance. However, this would only be a guide. The MTBF measured would only calculate since the beginning of measurement. The best use of this approach is to provide valuable input for RCM analysts, as well as for other applications within the reliability field. It would also give asset owners a predetermined risk envelope that they require their assets to work within, increasing their control over asset performance, and hence over corporate profitability.
Summary
MTBF is an exceptionally useful metric in the field of physical asset management and it is possible to apply it at any level throughout the physical asset base.
The principal benefit of wide ranging use of MTBF is that it begins the process of focusing a company on how the assets work to fulfill a function, rather than what those assets actually are. This is one of the fundamental concepts of Reliability-centered Maintenance.
As such, at whatever level it is applied, MTBF measures the function performed by that asset, asset system, or entire process. It is also useful for proactively establishing the performance expectations of the asset base, particularly in the areas of the Efficiency function.
If you enjoyed this post you can receive it in your email here. We don't like spam either and we won't send you any!
Great post. Made me think of how to set up a system level expected MTBF as a trigger point for doing a root-cause analysis for unplanned failures of the primary function of a system.
ReplyDeleteThanks,