Fault Management System Configuration
The Fault Management System (FMS) detects events, correlates them, and raises relevant alarms. The events are OPER_LOGs relayed from the vlogd module. The alarms are a result of the correlation rules and provide a persistent indication of the faults. The alarms are maintained in a database and can be displayed via show commands.
Note:
• FMS relies on the loopback interface (interface lo0) for communication with VLOGd. Therefore, ensuring the operational status of the loopback interface is vital for the normal functioning of both the FMS and VLOGd modules.
• In OcNOS, FMS is disabled by default.
• After starting FMS, avoid editing the alarm_def_config.yaml file, as changes will only take effect after restarting FMS.
• Set the device's logging level to at least 4 (NOTIFY) to ensure timely delivery of notification events to FMS for appropriate actions. Adjusting the logging level below NOTIFY may cause FMS to miss clear events and fail to resolve active alarms.
• If FMS reboots due to a device reboot (upgrade/downgrade/reboot) or manual FMS reboots through disabling and enabling it, the device will close active alarms. Use the
show alarm closed CLI command to view closed active alarms.
FMS applies the correlation procedures in
Table 33-3 based on the configurations specified.
Table 33-3: FMS correlation procedures
Correlation type | Description |
---|
Generalization | • Groups two or more events into a single alarm. • A generalized alarm will further use one of the correlation types (none, time-bound, counting and compression) for applying correlation logic to the new alarm. |
Time-bound | • Stipulates that when the event is received, a timer is started for that event. • While the timer is running, subsequent events of the same type are suppressed. • On the expiry of the timer, an alarm will be raised for that event stating the count for the number of times that event was received in this duration. |
Counting | Considers a specified number of similar events as one. In this correlation type, the respective alarm will be raised after the event has occurred for count times. |
Compression | Check multiple occurrences of the same event for duplicate/redundant event information, remove the redundancies, and report them as a single alarm. |
Severity | Correlates events based on the severity of the events. |
Implementation
FMS was developed with NodeJS with scripts written in JavaScript with a *.js extension and configuration files with a *.yaml extension. These files are in the below paths in OcNOS.
Table 33-4: FMS script and configuration files
/usr/local/bin/js | JavaScript files (*.js files) |
/usr/local/etc | Configuration files (*.yaml files) |
Enabling and Disabling the Fault Management System
Follow the below steps to enable or disable FMS:
Enabling FMS
# configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
(config)#
(config)#fault-management enable
(config)#
Disabling FMS
# configure terminal
Enter configuration commands, one per line. End with CNTL/Z.
(config)#
(config)#fault-management disable
(config)#
Alarm Configuration File
The alarm configuration file contains the configurations/rules for the alarms that will be referred by FMS to generate alarms upon receiving events. This file is in *.yaml format (human readable) in /usr/local/etc.
This file can be edited before starting FMS to include correlation rules for specific events.
Alarm Configuration File Template
#-------Template-------
#- Event_Group:
# - ALARM_ID: # Integer number identifying alarm
# ALARM_TYPE_ID: # Alarm Type-id(AIS, EQPT, LOS, OTS, OPWR, UNKNOWN)
# EVENT: # Event name(oper_log)
# GENERALIZED_EVENT_NAME: # Event name for the Generalization Event Group
# ALARM_DESC: # Alarm string which will be generated
# CORRELATION_TYPE: # Correlation logic type(0:No-Correlation, 1:Generalization, 2:Timebound, 3:Counting, 4:Compression, 5:Drop-Event, 6:Severity)
# GENERALIZED_CORRELATION_TYPE # Correlation type, in which generalized event will be sent
# CORRELATION_COUNTER: # Counter value that will be considered during counting logic to raise alarm
# CORRELATION_TIMER_DURATION: # Timer duration to be considered for time bound logic
# CORRELATION_SEVERITY: # Alarm Severity(0:Critical, 1:Major, 2:Warning, 3:Minor, 4:Unknown)
# QUALIFIER_STRING_POSITION: # List of positions where qualifier values present
# QUALIFIER_POSITION_1_EVENT_1: # First position of the qualifier value in the first event
# RESOURCE_STRING_POSITION: # List of positions where resource values present
# RESOURCE_POSITION_1_EVENT_1: # First position of the resource value in the first event
# SNMP_TRAP: # SNMP TRAP (true(1) or false(0))
# SNMP_OID: # OID for SNMP TRAP
# NETCONF_NOTIFICATION: # Netconf Notification (true(1) or false(0))
# CLEAR_ALARM: # Clear Alarm (oper_log enum, Status for Alarm will be made In-active if this event is received)
# CLEAR_EVENT_PATTERN_VALUES: # Pattern values which will be searched in event's description to identify clear event and to clear active alarm (required if both active and clear event types are same)
# SNMP_TRAP_CLEAR: # true(1) or false(0, if CLEAR_ALARM is null then SNMP_TRAP_CLEAR will be null)
# SNMP_CLEAR_OID: # OID for SNMP TRAP CLEAR
# NETCONF_CLEAR_NOTIFICATION: # Clear Netconf Notification information
Auto Generating the Alarm Configuration File
The auto_yaml_generator.js file is a NodeJS script that generates the alarm configuration file (alarm_def_config.yaml) for the oper logs which are listed in the oper_logs_list.yaml file with the default values as shown below.
# Integer number identifying alarm
ALARM_ID: 1000
# Event name (oper_log)
EVENT: oper_log string
# Event name for the Generalization Event Group
GENERALIZED_EVENT_NAME: null
# Alarm string which will be generated
ALARM_DESC: oper_log string
# Correlation logic type (0: No-Correlation, 1: Generalization, 2: Time Bound, 3: Counting, 4: Compression, 5: Drop-Event)
CORRELATION_TYPE: 0
# Correlation type, in which generalized event will be sent
GENERALISED_CORRELATION_TYPE: null
# Counter value that will be considered during counting logic to raise alarm
CORRELATION_COUNTER: 3
# Timer duration to be considered for time bound logic
CORRELATION_TIMER_DURATION: 20000
# Alarm Severity(1:Emergency, 2:Alert, 3:Critical, 4:Error, 5:Warning, 6:Notification, 7:Informational, 8:Debugging, 9:Cli)
CORRELATION_SEVERITY: null
# QUALIFIER_STRING_POSITION
QUALIFIER_POSITION_1_EVENT_1: null
# RESOURCE_STRING_POSITION
RESOURCE_POSITION_1_EVENT_1: null
SNMP_TRAP: 0
# OID for SNMP TRAP
SNMP_OID: null
# Netconf Notification (true (1) or false (0))
NETCONF_NOTIFICATION: 1
# Clear Alarm (oper_log enum, Status for Alarm will be made In-active if this event is received)
CLEAR_ALARM: null
# Clear Event's pattern values which will be searched in event's description to identify clear event
CLEAR_EVENT_PATTERN_VALUES: null
# True (1) or False (0, if CLEAR_ALARM is null then SNMP_TRAP_CLEAR will be null)
SNMP_TRAP_CLEAR: 0
# OID for SNMP TRAP CLEAR
SNMP_CLEAR_OID: null
# Clear Netconf Notification information
NETCONF_CLEAR_NOTIFICATION: 0
Alarm Configuration File Generation Steps
1. List all the oper_log enums in the oper_logs_list.yaml file and keep the file in the same path with auto_yaml_generator.js.
2. Copy auto_yaml_generator.js and oper_logs_list.yaml files into /usr/local/bin/js.
3. Run the auto_yaml_generator.js script with the following command.
# node auto_yaml_generator.js
4. After executing the above commands, you will see the alarm-def-config.yaml file in the same directory.
Sample oper_logs_list.yaml File
EVENT_GROUP:
IFMGR_IF_DOWN,
IFMGR_IF_UP,
STP_SET_PORT_STATE,
STP_IPC_COMMUNICATION_FAIL,
STP_ROOTGUARD_PORT_BLOCK,
:
:
Alarm Descriptions
Table 33-5 describes the supported alarms.
Table 33-5: FMS alarms
Alarm | Description |
---|
CMM_DDM_MONITOR_CURRENT | Transceiver Bias Current crossed the threshold limit |
CMM_DDM_MONITOR_FREQ | Transceiver Frequency crossed the threshold limit |
CMM_DDM_MONITOR_RxPOWER | Transceiver Rx Power crossed the threshold limit |
CMM_DDM_MONITOR_TEC | Transceiver Thermoelectric Cooler fault |
CMM_DDM_MONITOR_TEMP | Transceiver Temperature crossed the threshold limit |
CMM_DDM_MONITOR_TxPOWER | Transceiver TX Power crossed the threshold limit |
CMM_DDM_MONITOR_VOLT | Transceiver Voltage crossed the threshold limit |
CMM_DDM_MONITOR_WAVE | Transceiver Wavelength crossed the threshold limit |
CMM_FAN_CTRL | Fan insertion, removal, speed, or fault condition alarm |
CMM_MONITOR_CPU | CPU load average crossed the threshold limit |
CMM_MONITOR_CPU_CORE | CPU core usage crossed the threshold limit |
CMM_MONITOR_DISK_READ_ACTIVITY | Disk read activity crossed the threshold limit |
CMM_MONITOR_DISK_REMAIN_LIFE | Disk remaining life crossed the threshold limit |
CMM_MONITOR_DISK_WRITE_ACTIVITY | Disk write activity crossed the threshold limit |
CMM_MONITOR_PSU_POWER | Power supply unit insertion, removal, or fault condition |
CMM_MONITOR_PSU_IIN | Power supply unit input current crossed the threshold limit |
CMM_MONITOR_PSU_IOUT | Power supply unit output current crossed the threshold limit |
CMM_MONITOR_PSU_PIN | Power supply unit input power crossed the threshold limit |
CMM_MONITOR_PSU_POUT | Power supply unit output power crossed the threshold limit |
CMM_MONITOR_PSU_PRESENCE | Power supply unit is present |
CMM_MONITOR_PSU_TEMP1 | Power supply unit temperature 1 crossed the threshold limit |
CMM_MONITOR_PSU_TEMP2 | Power supply unit temperature 2 crossed the threshold limit |
CMM_MONITOR_PSU_VIN | Power supply unit input voltage crossed the threshold limit |
CMM_MONITOR_PSU_VOUT | Power supply unit output voltage crossed the threshold limit |
CMM_MONITOR_RAM | RAM memory usage crossed the threshold limit |
CMM_MONITOR_SDCARD | Hard-disk usage crossed the threshold limit or fault condition |
CMM_MONITOR_TEMP | Temperature sensor crossed the threshold limit |
CMM_TAI_CD_ALARM | RX lane current chromatic dispersion crossed the threshold limit |
CMM_TAI_CD_CLEAR | RX lane current chromatic dispersion recovered from the threshold |
CMM_TAI_Q_MARGIN_ALARM | Network RX Q-margin over PM interval value crossed threshold limit |
CMM_TAI_Q_MARGIN_CLEAR | Network RX Q-margin over PM interval value recovered from the threshold |
CMM_TAI_RX_LOS | RX-Loss-Of-Signal alarm Detected |
CMM_TAI_RX_LOS_CLEAR | RX-Loss-Of-Signal alarm Cleared |
CMM_TAI_SNR_ALARM | Network lane0 current Signal-to-Noise Ratio crossed threshold limit |
CMM_TAI_SNR_CLEAR | Network lane0 current Signal-to-Noise Ratio recovered from the threshold |
CMM_TRANSCEIVER | Transceiver on fault condition |
IFMGR_IF_DOWN | Interface state down |
IFMGR_IF_UP | Interface state up |
CMM_MONITOR_FAN | FAN monitoring - crossed the threshold limit |
CMM_MONITOR_CURRENT | Current crossed the threshold limit |
CMM_MONITOR_VOLTAGE | Voltage crossed the threshold limit |