Key takeaways
- FMEA is a proactive method: list how a design, process or machine could fail, then rank each failure mode so you fix the riskiest first.
- Each failure mode gets three 1-10 ratings, Severity, Occurrence and Detection, and their product is the Risk Priority Number (RPN = S × O × D).
- RPN is a relative ranking, not an absolute score. Use it to sort work, but always act on high-severity items even when their RPN looks modest.
- Most of the easy wins come from improving Detection and Occurrence; Severity rarely moves without a design change.
- An FMEA is only useful as a living document that drives actions and gets re-scored, not a launch-day form filed and forgotten.
Failure Mode and Effects Analysis (FMEA) is a structured way to ask, before anything breaks, how could this fail, how bad would it be, and would we even catch it? You list every way a product, process or machine can fail, rate the risk of each, and use that ranking to spend your limited maintenance and engineering effort where it actually reduces loss. It started in aerospace and defence, spread through automotive as a supplier requirement, and is now a standard tool wherever unplanned failures are expensive. This guide covers the three ratings, the RPN formula with a worked example, how to run one on the plant floor, and the ways FMEA quietly turns into paperwork.
What FMEA actually is
An FMEA is a table. Each row is one failure mode: a specific way a function can go wrong. For a fill valve on a beverage line, the function is dose the correct volume, and the failure modes include fails to open, fails to seal, and doses the wrong amount. For each failure mode you capture its effect on the process and the customer, its likely causes, and the controls already in place to prevent or detect it. Then you rate the risk and decide what to do about the ones that matter.
The two common flavours are the design FMEA (DFMEA), which looks at the product design, and the process FMEA (PFMEA), which looks at the manufacturing process that makes it. Plant and maintenance teams mostly live in PFMEA territory, and often a machine or equipment FMEA that focuses on one critical asset and the failure modes that stop it.
The three ratings and the RPN formula
Each failure mode is scored on three scales, each from 1 to 10:
- Severity (S): how serious the effect is if the failure reaches the customer or the next process. 1 is trivial; 10 is a safety or regulatory failure.
- Occurrence (O): how often the cause is likely to happen. 1 is remote; 10 is almost inevitable.
- Detection (D): how likely your current controls are to catch the failure before it escapes. Here the scale runs the other way: 1 means detection is almost certain, 10 means you would not catch it at all.
The Risk Priority Number multiplies the three:
RPN = Severity × Occurrence × Detection
RPN ranges from 1 to 1000 and is used to sort the failure modes so the team tackles the biggest risks first. The critical caveat: RPN is a relative ranking within one analysis, not a universal score. A value of 150 on one line tells you nothing about a 150 on another. And because a rare but catastrophic failure can carry a low occurrence and still be the most important thing on the sheet, most modern practice (including the AIAG-VDA handbook) treats high Severity as an action trigger in its own right, and increasingly replaces a fixed RPN cutoff with an action-priority table.
A worked example
Take the fill valve above, on a food and beverage line. The failure mode is valve fails to seal, causing overfill and underfill.
- Severity = 8: out-of-spec fills are rejected and can trigger a customer complaint, though this is not a safety issue.
- Occurrence = 6: the valve drifts a few times a month as the seat wears.
- Detection = 4: the downstream checkweigher catches most bad fills, but some drift slips through between checks.
That gives an initial RPN = 8 × 6 × 4 = 192. The team adds two controls: a scheduled seat-replacement PM to cut how often it drifts, and an in-line fill-volume sensor that flags the problem on the valve itself rather than downstream. They re-score: Occurrence falls to 3 and Detection falls to 2, while Severity stays at 8 because the consequence of a bad fill has not changed. The new RPN = 8 × 3 × 2 = 48, a 75% reduction, achieved without touching the one rating (Severity) that a design change would be needed to move. That pattern, chipping away at Occurrence and Detection, is where most FMEA value comes from.
How to run one without drowning in rows
A workable PFMEA follows a simple sequence, done with the people who run and fix the equipment, not alone at a desk:
- 1. Scope it tightly: one line, one machine or one process step. A boil-the-ocean FMEA never gets finished.
- 2. List functions, then failure modes: for each function, ask how it can fail. Keep failure modes specific and observable.
- 3. Add effects and causes: what happens downstream, and what mechanism drives the failure.
- 4. Score S, O and D using agreed rating tables so scores are consistent between people and sessions.
- 5. Rank and act: sort by RPN, pull out high-severity items, and assign owners and dates to real countermeasures.
- 6. Re-score after actions close so the sheet reflects the current, improved risk and shows whether the action worked.
Where FMEA goes wrong
The classic failure is the FMEA that exists only to satisfy an auditor: filled in once at launch, never re-scored, and never linked to a single completed action. Others: scores argued to hit or dodge a magic threshold, a scope so wide the team burns out before the important rows, and Detection ratings that flatter the current controls. That last one matters most, because a hand-written downtime log or an operator glancing at a screen is a far weaker detection control than teams give it credit for, so the real risk sits higher than the sheet says.
This is where honest data changes the numbers. The partner we recommend, Fabrico, reads stops straight from the PLC and shows the true cause of a micro-stop on video, which is a more defensible Detection score than a manual log and gives you an Occurrence rate measured from the machine rather than from memory. For food and beverage, automotive and packaging lines, where short, frequent stops dominate the losses, that closed loop from a detected failure to an auto-routed work order is exactly the control an FMEA is trying to credit. Fabrico is a partner we recommend; the tools here are free regardless.
Turn failure rates into reliability and cost with the free calculators.
FAQ
What is a good RPN threshold to act on?
There is no universal threshold. RPN is a relative ranking, not an absolute score, so a value of 150 on one line is not comparable to 150 on another. Act on your highest RPNs first, and treat any high-severity failure mode as an action item regardless of its RPN, because a rare, hard-to-detect catastrophe can carry a modest RPN while still being the most important thing on the sheet. Newer AIAG-VDA practice drops the fixed threshold entirely in favour of an action-priority table.
What is the difference between FMEA and root cause analysis?
FMEA is proactive: you work forward from a component or process step to the failure modes that could happen and rank them before they cause loss. Root cause analysis is reactive: you work backward from a failure that already happened to the underlying cause. They feed each other. A failure that RCA uncovers should be added to the FMEA, and a high-risk item on the FMEA is a good candidate for a deeper RCA before it bites.
What is the difference between DFMEA and PFMEA?
A design FMEA (DFMEA) analyses failure modes in the product design itself, before anything is built. A process FMEA (PFMEA) analyses failure modes in the manufacturing process that makes the product: the machines, tooling, fixtures and steps. Plant and maintenance teams usually work with PFMEA, and often a machine or equipment FMEA that focuses on the failure modes of a specific asset.
How often should an FMEA be updated?
Treat it as a living document, not a one-off deliverable. Update it whenever the design or process changes, when a new failure mode appears in the field, when a corrective action is completed so the ratings can be re-scored, and on a periodic review cadence for critical assets. An FMEA that has not been touched since launch is usually out of date and quietly ignored.
Related: root cause analysis · preventive maintenance program · maintenance KPIs · bearing troubleshooting