Fault Management
Fault Management, the F in FCAPS, is the process of locating, diagnosing, isolating, and correcting problems in a network (including proving the fixes have succeeded).
A fault can be identified as a failure of a system, device, or component to operate as expected and requires action to resolve. Failures can be indicated by excessive errors, such as alignment errors in Ethernet, but there are also errors that are considered normal, such as collisions in an Ethernet environment.
Fault management is the detection, isolation, and correction of either persistent or transient faults that cause networks to fall below expectation. Monitoring is the most basic function of fault management systems and monitoring includes collecting information about device hardware and software. It can also include data collection about devices status, health and performance. Finally it can also include post-collection based on the data collection.
There are two schools of thought when it comes to Fault management, reactive or pro-active. What do I mean by these terms; well reactive management is the process of waiting for the fault to occur and then resolving it. An example would be a router reloading. Now what happens if you could predict the Fault before it actually occurs, this is in essence what pro-active management is. If we use the same example of a router reloading, what happens if the fault is due to excessive traffic on a specific interface causing the CPU to spike which in turn causes the router to panic and reload. If we are monitoring for this type of thing and a pre defined threshold is met an alert can be generated and acted upon before we have a service effecting issue.
(Drop in 2 flash like animations to graphically show example)The majority of organisations seem to still rely of the reactive mechanism, which in my eyes is crazy, I would much rather know of potential issues before they occur. A number of enterprise applications are now providing pro-active monitoring out the box. One important thing to aid in pro-active management is good Performance management.
A number of fault management applications and tools exist in market (this is by no means a definaive list);
- HP OpenView NNM
- CiscoWorks DFM
- CA Spectrum
- IBM Micromuse
- Cisco EEM
- Cisco IPSLA (usually associated with Performance Management)
- OpenNMS
- Nagios
NMS Monkey