Improving device reliability and redundancy

In a previous article we looked at how to mathematically calculate the reliability of electronic equipment. That is, the probability of it working correctly for a given period of time. This is best characterised by the Mean Time Between Failure (MTBF), or its opposite, Failure In Time (FIT). MTBF values are often in the order of hundreds of thousands of hours for electronic equipment.

Source: Ricardo Saiz

The probability of a failure occurring during time t is best expressed as an exponential function, similar to a straight line for small intervals.

How do you calculate the MTBF of a device?

Device reliability depends on that of its component parts (solderable electronic components, modules, wiring, etc.). The total MTBF is the sum of the inverse of the MTBF of each part, similarly to parallel resistances. If, in an electrical circuit, admittance is added up, the FIT of a device made up of many components (parallel paths leading to a failure) shall be the sum of all these FIT. This is why it is easier to operate with FIT than MTBF.

How to make devices more reliable?

In turn, the FIT of a component is not an immutable value but depends on the environment and (more specifically) on the temperature. Heat is directly related to the failure rate, and indeed to the speed of many physical processes and chemical reactions. Swedish scientist Svante Arrhenius (1859 – 1927) was the first to model this relationship, in 1889, with the equation that bears his name:

Formula

According to this formula, when the absolute temperature is close to zero, reactions stop. However, they accelerate significantly with increasing temperature.

High service availability

Our device will become less reliable as temperatures rise, but how can we make it more reliable? We can’t fight the laws of physics, but we can use them to make the best engineering decisions. In addition to heeding the advice found in manuals (“do not cover the ventilation slots” or “install the device far from heat sources”), we can improve the reliability of the system. This is known as service availability, which is ultimately what matters.

Redundancy of devices

In a router or switch, we can duplicate the power supply (one of the elements with the highest failure rate). The probability of a power supply failing in a t interval is:

This formula equals 0 when t=0, but its derivative is:

The device will stop working if both power supply units – PSUs fail. The probability of this happening is the previous formula squared:

As in the previous case, this formula equals 0 at t=0. However, its derivative is also 0 at the power supply unit – PSU.

Source: Ricardo Saiz

With two units working simultaneously (only one of which is essential), the failure rate draws a very different curve. This is especially true for shorter periods (when compared to the MTBF). Let’s look at a simple example.

We have a power supply with an MTBF of 200,000 hours. What are the chances of it failing within a year?

200,000 hours may seem like a long time, but there is a 4.3% chance of it breaking down in the first year of use. If we have a pool of 23 devices, we will suffer an average of one breakdown per year (with the ensuing service outage).

If we set up two power supply units – PSUs working redundantly, the probability of a critical failure over a year is:

It only amounts to 0.18%.

If we also connect each power supply unit – PSU to a separate electrical circuit (e.g., an uninterruptible power supply or UPS), we obtain another advantage: having a power cut leave us temporarily without service will be much less likely.

If our equipment sends an alert to the network administrator when it detects a failure, the faulty device can be replaced within a short period of time. Ideally before a second, critical failure occurs.

When combining redundancy with diligent fault detection and remediation, service availability is extremely high. This is because, after a failure occurs, suffering another breakdown during the time it takes to repair the device (presumably hours or a few days) is unlikely. We can understand this graphically, since we move in the grey line’s flat area (i.e., where the derivative is almost zero).

Source: Ricardo Saiz

MTBF findings and more

Teldat devices (such as the new generation of switches, some of which are equipped with redundant power supplies to meet the most demanding requirements) offer MTBF figures ranging between 500,000 and one million hours. We also carry out a rigorous Reliability, Availability, Maintainability and Safety (RAMS) analysis for equipment intended for special scenarios, such as railways. Using Fault Tree Analysis (FTA), we can identify potential failures and design alternative operating modes in the event of simple failures. As a result, our service availability figures are close to 100%.

Tags: Router technology Telecommunication technology

← PREVIOUS NEXT →

March 05, 2024

Ricardo Saiz

Ricardo Saiz, telecommunications engineer, is part of Teldat’s R&D department. He specializes in hardware, and is responsible for electronic design and equipment certification.

Related Posts

How do you calculate the MTBF of a device?

How to make devices more reliable?

High service availability

Redundancy of devices

MTBF findings and more

Ricardo Saiz

Secure SD-Branch SASE Convergence

MPLS traffic, SD-Branch and 5G connectivity

Open XDR Ecosystem Integration

Ready to take the Next Step?

COMPANY

About Us

Careers

Environmental Performance

LINKS

News

Blog

Manuals

RESOURCES

Teldat's Code of Conduct

Corporate Social Responsability

Ethics Channel

GET STARTED

Contact Us

Teldat Offices

Our Partners

Privacy Policy (GDPR)

Legal Notice

Cookies Policy

Quality Management

Information Security

Purchase and Use Terms (EN)

Purchase and Use Terms (PT)