8. Facility Alarms

8.1. In This Chapter

This chapter provides information about Facility Alarms.

Topics in this chapter include:

8.2. Facility Alarms Overview

Facility Alarms provide a useful tool for operators to easily track and display the basic status of their equipment facilities. Facility Alarm support is intended to cover a focused subset of router states that are likely to indicate service impacts (or imminent service impacts) related to the overall state of hardware assemblies (cards, fans, links, and so on).

In the CLI, for brevity, the keyword or command alarm is used for commands related to Facility Alarms. This chapter may occasionally use the term “alarm” as a short form for “facility alarm”.

The CLI display for show routines allows the system operator to easily identify current facility alarm conditions and recently cleared facility alarms without searching event logs or monitoring various card and port show commands to determine the health of basic equipment in the system such as cards and ports.

The SR OS alarm model is based on RFC 3877, Alarm Management Information Base (MIB), (which evolved from the IETF DISMAN drafts).

8.3. Facility Alarms vs. Log Events

Facility Alarms are different than log events. Facility Alarms have a state (at least two states: active and clear) and a duration, and can be modeled with state transition events (raised, cleared). A log event occurs when the state of some object in the system changes. Log events notify the operator of a state change (for example, a port going down, an IGP peering session coming up, and so on). Facility alarms show the list of hardware objects that are currently in a bad state. Facility alarms can be examined at any time by an operator, whereas log events can be sent by a router asynchronously when they occur (for example, as an SNMP notification or trap, or a syslog event).

While log events provide notifications about a large number of different types of state changes in SR OS, facility alarms are intended to cover a focused subset of router states that are likely to indicate service impacts (or imminent service impacts) related to the overall state of hardware assemblies (cards, fans, links, and so on).

The facility alarm module processes log events in order to generate the raised and cleared state for the facility alarms. If a raising log event is suppressed under event-control, then the associated facility alarm will not be raised. If a clearing log event is suppressed under event-control, then it is still processed for the purpose of clearing the associated facility alarm. If a log event is a raising event for a Facility Alarm, and the associated Facility Alarm is raised, then changing the log event to suppress will clear the associated Facility Alarm.

Log event filtering, throttling and discarding of log events during overload do not affect facility alarm processing. In all cases, non-suppressed log events are processed by the facility alarm module before they are discarded.

Figure 21 illustrates the relationship of log events, facility alarms and the LEDs.

Figure 21:  Log Events, Facility Alarms and LEDs 

Facility Alarms are different and independent functionality from other uses of the term alarm in SR-OS such as:

  1. Log events that use the term alarm (tmnxEqPortSonetAlarm)
  2. configure card fp hi-bw-mcast-src [alarm]
  3. configure mcast-management multicast-info-policy bundle channel source-override video analyzer alarms
  4. configure port ethernet report-alarm
  5. configure system thresholds no memory-use-alarm
  6. configure system thresholds rmon no alarm
  7. configure system security cpu-protection policy alarm

8.4. Facility Alarm Severities and Alarm LED Behavior

The Alarm LEDs on the CPM/CCM reflects the current status of the Facility Alarms:

  1. The Critical Alarm LED is lit if there is 1 or more active Critical Facility Alarms
  2. Similarly with the Major and Minor alarm LEDs
  3. The OT Alarm LED is not controlled by the Facility Alarm module

The supported alarm severities are as follows:

  1. Critical (with an associated LED on the CPM/CCM)
  2. Major (with an associated LED on the CPM/CCM)
  3. Minor (with an associated LED on the CPM/CCM)
  4. Warning (no LED)

Facility alarms inherit their severity from the raising log event.

A raising log event for a facility alarm configured with a severity of indeterminate or cleared will result in the facility alarm not being raised. But, a clearing log event is processed in order to clear facility alarms, regardless of the severity of the clearing log event.

Changing the severity of a raising log event only affects subsequent occurrences of that log event and facility alarms. Facility alarms that are already raised when their raising log event severity is changed maintain their original severity.

8.5. Facility Alarm Hierarchy

Facility Alarms for children objects is not raised for failure of a parent object. For example, when an MDA or XMA fails (or is shutdown) there is not a set of port facility alarms raised.

When a parent facility alarm is cleared, children facility alarms that are still in occurrence on the node appears in the active facility alarms list. For example, when a port fails there is a port facility alarm, but if the MDA or XMA is later shutdown the port alarm is cleared (and a card alarm will be active for the MDA or XMA). If the MDA or XMA comes back into service, and the port is still down, then a port alarm becomes active once again.

The supported facility alarm hierarchy is as follows (parent objects that are down cause alarms in all children to be masked):

  1. CPM -> Compact Flash
  2. CCM -> Compact Flash
  3. IOM/IMM -> MDA -> Port -> Channel
  4. XCM -> XMA -> Port
  5. MCM -> MDA -> Port -> Channel
Note:

A masked facility alarm is not the same as a cleared facility alarm. The cleared facility alarm queue does not display entries for previously raised facility alarms that are currently masked. If the masking event goes away, then the previously raised facility alarms will once again be visible in the active facility alarm queue.

8.6. Facility Alarm List

Table 89 and Table 90 show the supported Facility Alarms.

Table 89:  Facility Alarm, Facility Alarm Name/Raising Log Event, Sample Details String and Clearing Log Event  

Facility Alarm *1

Facility Alarm Name/Raising Log Event

Sample Details String

Clearing Log Event

7-2001-1

tmnxEqCardFailure

Class MDA Module: failed, reason: Mda 1 failed startup tests

tmnxChassisNotificationClear

7-2003-1

tmnxEqCardRemoved

Class CPM Module: removed

tmnxEqCardInserted

7-2004-1

tmnxEqWrongCard

Class IOM Module: wrong type inserted

tmnxChassisNotificationClear

7-2005-1

tmnxEnvTempTooHigh

Chassis 1: temperature too high

tmnxChassisNotificationClear

7-2006-1

tmnxEqFanFailure

Fan 2 failed

tmnxChassisNotificationClear

7-2007-1

tmnxEqPowerSupplyFailureOvt

Power supply 2 over temperature

tmnxChassisNotificationClear

7-2008-1

tmnxEqPowerSupplyFailureAc

Power supply 1 AC failure

tmnxChassisNotificationClear

7-2009-1

tmnxEqPowerSupplyFailureDc

Power supply 2 DC failure

tmnxChassisNotificationClear

7-2011-1

tmnxEqPowerSupplyRemoved

Power supply 1, power lost

tmnxEqPowerSupplyInserted

7-2017-1

tmnxEqSyncIfTimingHoldover

Synchronous Timing interface in holdover state

tmnxEqSyncIfTimingHoldoverClear

7-2019-1

tmnxEqSyncIfTimingRef1Alarm

with attribute tmnxSyncIfTimingNotifyAlarm == 'los(1)'

Synchronous Timing interface, alarm los on reference 1

tmnxEqSyncIfTimingRef1AlarmClear

7-2019-2

tmnxEqSyncIfTimingRef1Alarm with attribute tmnxSyncIfTimingNotifyAlarm == 'oof(2)'

Synchronous Timing interface, alarm oof on reference 1

same as 7-2019-1

7-2019-3

tmnxEqSyncIfTimingRef1Alarm with attribute tmnxSyncIfTimingNotifyAlarm == 'oopir(3)'

Synchronous Timing interface, alarm oopir on reference 1

same as 7-2019-1

7-2021-x

same as 7-2019-x but for ref2

same as 7-2019-x but for ref2

same as 7-2019-x but for ref2

7-2030-x

same as 7-2019-x but for the BITS input

same as 7-2019-x but for the BITS input

same as 7-2019-x but for the BITS input

7-2033-1

tmnxChassisUpgradeInProgress

Class CPM Module: software upgrade in progress

tmnxChassisUpgradeComplete

7-2050-1

tmnxEqPowerSupplyFailureInput

Power supply 1 input failure

tmnxChassisNotificationClear

7-2051-1

tmnxEqPowerSupplyFailureOutput

Power supply 1 output failure

tmnxChassisNotificationClear

7-2073-x

same as 7-2019-x but for the BITS2 input

same as 7-2019-x but for the BITS2 input

same as 7-2019-x but for the BITS2 input

7-2092-1

tmnxEqPowerCapacityExceeded

The system has reached maximum power capacity <x> watts

tmnxEqPowerCapacityExceededClear

7-2094-1

tmnxEqPowerLostCapacity

The system can no longer support configured devices. Power capacity dropped to <x> watts

tmnxEqPowerLostCapacityClear

7-2096-1

tmnxEqPowerOverloadState

The system has reached critical power capacity. Increase available power now

tmnxEqPowerOverloadStateClear

7-4001-1

tmnxInterChassisCommsDown

Control communications disrupted between the Active CPM and the chassis

tmnxInterChassisCommsUp

7-4003-1

tmnxCpmIcPortDown

CPM Interconnect Port is not operational. Error code = invalid-connection

tmnxCpmIcPortUp

7-4007-1

tmnxCpmANoLocalIcPort

CPM A can not reach the chassis using its local CPM interconnect ports

tmnxCpmALocalIcPortAvail

7-4008-1

tmnxCpmBNoLocalIcPort

CPM B can not reach the chassis using its local CPM interconnect ports

tmnxCpmBLocalIcPortAvail

7-4017-1

tmnxSfmIcPortDown

SFM interconnect Port is not operational. Error code = invalid-connection to Fabric 10 IcPort 2

tmnxSfmIcPortUp

7-5001-1

tmnxOesCtlCommsDown

Control communications disrupted between the Active CPM and the OES Master chassis, reason: oes-unreachable

tmnxOesCtlCommsUp

7-5101-1

tmnxOesCtlCardPortDown

OES control card port is not operational

tmnxOesCtlCardPortUp

7-5105-1

tmnxOesFanRemoved

OES fan Removed

tmnxOesFanInserted

7-5109-1

tmnxOesFanFailure

OES Fan Failure: Card Communication Failure

tmnxOesFanFailureClear

7-5111-1

tmnxOesPwrSupplyRemoved

OES Power Supply Removed

tmnxOesPwrFilterInserted

7-5113-1

tmnxOesPwrSupplyFailure

OES Power Supply Failure: High Input Voltage Defect

tmnxOesPwrSupplyFailureClear

7-5128-1

tmnxOesTempLow

Fan oes-1/37: temperature too low

tmnxOesTempLowClear

59-2004-1

linkDown

Interface intf-towards-node-B22 is not operational

linkUp

64-2091-1

tmnxSysLicenseInvalid

Error - <reason> record. <hw> will reboot the chassis <timeRemaining>

None

64-2092-1

tmnxSysLicenseExpiresSoon

The license installed on <hw> expires <timeRemaining>

None

Table 90:  Facility Alarm Name/Raising Log Event, Cause, Effect and Recovery  

Facility Alarm *1

Facility Alarm Name/Raising Log Event

Cause

Effect

Recovery

7-2001-1

tmnxEqCardFailure

Generated when one of the cards in a chassis has failed. The card type may be IOM (or XCM), MDA (or XMA), SFM, CCM, CPM, Compact Flash, etc. The reason is indicated in the details of the log event or alarm, and also available in the tmnxChassisNotifyCardFailureReason attribute included in the SNMP notification.

The effect is dependent on the card that has failed. IOM (or XCM) or MDA (or XMA) failure will cause a loss of service for all services running on that card. A fabric failure can impact traffic to/from all cards.

Before taking any recovery steps collect a tech-support file, then try resetting (clear) the card. If that doesn't work then try removing and then re-inserting the card. If that doesn't work then replace the card.

7-2003-1

tmnxEqCardRemoved

Generated when a card is removed from the chassis. The card type may be IOM (or XCM), MDA (or XMA), SFM, CCM, CPM, Compact Flash, etc.

The effect is dependent on the card that has been removed. IOM (or XCM) or MDA (or XMA) removal will cause a loss of service for all services running on that card. A fabric removal can impact traffic to/from all cards.

Before taking any recovery steps collect a tech-support file, then try re-inserting the card. If that doesn't work then replace the card.

7-2004-1

tmnxEqWrongCard

Generated when the wrong type of card is inserted into a slot of the chassis. Even though a card may be physically supported by the slot, it may have been administratively configured to allow only certain card types in a particular slot location. The card type may be IOM (or XCM), MDA (or XMA), SFM, CCM, CPM, Compact Flash, etc.

The effect is dependent on the card that has been incorrectly inserted. Incorrect IOM (or XCM) or MDA (or XMA) insertion will cause a loss of service for all services running on that card.

Insert the correct card into the correct slot, and ensure the slot is configured for the correct type of card.

7-2005-1

tmnxEnvTempTooHigh

Generated when the temperature sensor reading on an equipment object is greater than its configured threshold.

This could be causing intermittent errors and could also cause permanent damage to components.

Remove or power down the affected cards, or improve the cooling to the node. More powerful fan trays may also be required.

7-2006-1

tmnxEqFanFailure

Generated when one of the fans in a fan tray has failed.

This could be cause temperature to rise and resulting intermittent errors and could also cause permanent damage to components.

Replace the fan tray immediately, improve the cooling to the node, or reduce the heat being generated in the node by removing cards or powering down the node.

7-2007-1

tmnxEqPowerSupplyFailureOvt

Generated when the temperature sensor reading on a power supply module is greater than its configured threshold.

This could be causing intermittent errors and could also cause permanent damage to components.

Remove or power down the affected power supply module or improve the cooling to the node. More powerful fan trays may also be required. The power supply itself may be faulty so replacement may be necessary.

7-2008-1

tmnxEqPowerSupplyFailureAc

Generated when an AC failure is detected on a power supply.

Reduced power can cause intermittent errors and could also cause permanent damage to components.

First try re-inserting the power supply. If that doesn't work, then replace the power supply.

7-2009-1

tmnxEqPowerSupplyFailureDc

Generated when an DC failure is detected on a power supply.

Reduced power can cause intermittent errors and could also cause permanent damage to components.

First try re-inserting the power supply. If that doesn't work, then replace the power supply.

7-2011-1

tmnxEqPowerSupplyRemoved

Generated when one of the chassis's power supplies is removed.

Reduced power can cause intermittent errors and could also cause permanent damage to components.

Re-insert the power supply.

7-2017-1

tmnxEqSyncIfTimingHoldover

Generated when the synchronous equipment timing subsystem transitions into a holdover state.

Any node-timed ports will have very slow frequency drift limited by the central clock oscillator stability. The oscillator meets the holdover requirements of a Stratum 3 and G.813 Option 1 clock.

Address issues with the central clock input references.

7-2019-1

tmnxEqSyncIfTimingRef1Alarm

with attribute tmnxSyncIfTimingNotifyAlarm == 'los(1)'

Generated when an alarm condition on the first timing reference is detected. The type of alarm (los, oof, etc) is indicated in the details of the log event or alarm, and is also available in the tmnxSyncIfTimingNotifyAlarm attribute included in the SNMP notification. The SNMP notification will have the same indices as those of the tmnxCpmCardTable.

Timing reference 1 cannot be used as a source of timing into the central clock.

Address issues with the signal associated with timing reference 1.

7-2019-2

tmnxEqSyncIfTimingRef1Alarm with attribute tmnxSyncIfTimingNotifyAlarm == 'oof(2)'

same as 7-2019-1

same as 7-2019-1

same as 7-2019-1

7-2019-3

tmnxEqSyncIfTimingRef1Alarm with attribute tmnxSyncIfTimingNotifyAlarm == 'oopir(3)'

same as 7-2019-1.

same as 7-2019-1.

same as 7-2019-1.

7-2021-x

same as 7-2019-x but for ref2

same as 7-2019-x but for the second timing reference

same as 7-2019-x but for the second timing reference

same as 7-2019-x but for the second timing reference

7-2030-x

same as 7-2019-x but for the BITS input

same as 7-2019-x but for the BITS timing reference

same as 7-2019-x but for the BITS timing reference

same as 7-2019-x but for the BITS timing reference

7-2033-1

tmnxChassisUpgradeInProgress

The tmnxChassisUpgradeInProgress notification is generated only after a CPM switchover occurs and the new active CPM is running new software, while the IOMs or XCMs are still running old software. This is the start of the upgrade process. The tmnxChassisUpgradeInProgress notification will continue to be generated every 30 minutes while at least one IOM or XCM is still running older software.

A software mismatch between the CPM and IOM or XCM is generally fine for a short duration (during an upgrade) but may not allow for correct long term operation.

Complete the upgrade of all IOMs or XCMs.

7-2050-1

tmnxEqPowerSupplyFailureInput

Generated when an input failure is detected on a power supply.

Reduced power can cause intermittent errors and could also cause permanent damage to components.

First try re-inserting the power supply. If that doesn't work, then replace the power supply.

7-2051-1

tmnxEqPowerSupplyFailureOutput

Generated when an output failure is detected on a power supply.

Reduced power can cause intermittent errors and could also cause permanent damage to components.

First try re-inserting the power supply. If that doesn't work, then replace the power supply.

7-2073-x

same as 7-2019-x but for the BITS2 input

same as 7-2019-x but for the BITS 2 timing reference

same as 7-2019-x but for the BITS 2 timing reference

same as 7-2019-x but for the BITS 2 timing reference

7-2092-1

tmnxEqPowerCapacityExceeded

Generated when a device needs power to boot, but there is not enough power capacity to support the device.

A non-powered device will not boot until the power capacity is increased to support the device.

Add a new power supply to the system, or change the faulty power supply with a working one.

7-2094-1

tmnxEqPowerLostCapacity

Generated when a power supply fails or is removed which puts the system in an overloaded situation.

Devices are powered off in order of lowest power priority until the available power capacity can support the powered devices.

Add a new power supply to the system, or change the faulty power supply with a working one.

7-2096-1

tmnxEqPowerOverloadState

Generated when the overloaded power capacity can not support the power requirements and there are no further devices that can be powered off.

The system runs a risk of experiencing brownouts while the available power capacity does not meet the required power consumption.

Add power capacity or manually shutdown devices until the power capacity meets the power needs.

7-4001-1

tmnxInterChassisCommsDown

The tmnxInterChassisCommsDown alarm is generated when the active CPM cannot reach the far-end chassis.

The resources on the far-end chassis are not available. This event for the far-end chassis means that the CPM, SFM, and XCM cards in the far-end chassis will reboot and remain operationally down until communications are re-established.

Ensure that all CPM interconnect ports in the system are properly cabled together with working cables.

7-4003-1

tmnxCpmIcPortDown

The tmnxCpmIcPortDown alarm is generated when the CPM interconnect port is not operational. The reason may be a cable connected incorrectly, a disconnected cable, a faulty cable, or a misbehaving CPM interconnect port or card.

At least one of the control plane paths used for inter-chassis CPM communication is not operational. Other paths may be available.

A manual verification and testing of each CPM interconnect port is required to ensure fully functional operation. Physical replacement of cabling may be required.

7-4007-1

tmnxCpmANoLocalIcPort

The tmnxCpmANoLocalIcPort alarm is generated when the CPM cannot reach the other chassis using its local CPM interconnect ports.

Another control communications path may still be available between the CPM and the other chassis via the mate CPM in the same chassis. If that alternative path is not available then complete disruption of control communications to the other chassis will occur and the tmnxInterChassisCommsDown alarm is raised.

A tmnxCpmANoLocalIcPort alarm on the active CPM indicates that a further failure of the local CPM interconnect ports on the standby CPM will cause complete disruption of control communications to the other chassis and the tmnxInterChassisCommsDown alarm is raised.

A tmnxCpmANoLocalIcPort alarm on the standby CPM indicates that a CPM switchover may cause temporary disruption of control communications to the other chassis while the rebooting CPM comes back into service.

Ensure that all CPM interconnect ports in the system are properly cabled together with working cables.

7-4008-1

tmnxCpmBNoLocalIcPort

Same as 7-4007-1.

Same as 7-4007-1.

Same as 7-4007-1.

7-4009-1

tmnxCpmALocalIcPortAvail

The tmnxCpmALocalIcPortAvail notification is generated when the CPM re-establishes communication with the other chassis using its local CPM interconnect ports.

A new control communications path is now available between the CPM_A and the other chassis,

7-4010-1

tmnxCpmBLocalIcPortAvail

Same as 7-4009-1.

Same as 7-4009-1.

Same as 7-4009-1.

7-4017-1

tmnxSfmIcPortDown

The tmnxSfmIcPortDown alarm is generated when the SFM interconnect port is not operational. The reason may be a cable connected incorrectly, a disconnected cable, a faulty cable, or a misbehaving SFM interconnect port or SFM card.

This port can no longer be used as part of the user plane fabric between chassis. Other fabric paths may be available resulting in no loss of capacity.

A manual verification and testing of each SFM interconnect port is required to ensure fully functional operation. Physical replacement of cabling may be required.

7-5001-1

tmnxOesCtlCommsDown

The tmnxOesCtlCommsDown notification is generated when the active CPM can't reach the OES master chassis.

The OES cannot be managed by the router.

Ensure that all control communication ports between the router and the OES master chassis are correctly connected and the cables have been tested.

7-5101-1

tmnxOesCtlCardPortDown

The tmnxOesCtlCardPortDown notification is generated when a port (e.g. ES 1 or AUX) on an OES control card (e.g. EC card) is not operational. The reason may be a misconnection, disconnection or faulty cable, a faulty port or control card.

If an ES port is down then one of the control plane communication paths between EC cards in different OES chassis is not available. The control communications with one or more OES chassis may be affected rendering the chassis unmanageable. Other control paths may be available.

If an AUX port is down then one of the control plane communication paths between the router and the OES Master Chassis is not available and control communications with the OES may be affected rendering the OES unmanageable. Other control paths may be available.

Check if one end of the cable is connected to the wrong port or disconnected. Test the cable.

7-5105-1

tmnxOesFanRemoved

The tmnxOesFanRemoved notification is generated when the OES fan unit is removed from its slot in the OES chassis.

The function of the OES fan unit is not available.

Insert the OES fan unit into its slot.

7-5109-1

tmnxOesFanFailure

The tmnxOesFanFailure notification is generated when the fan unit in an OES chassis has failed. tmnxOesNotifyFailureReason contains the reason for the fan failure.

The fan unit in the OES chassis is out of service.

If the condition causing fan failure can not be removed, replace the faulty fan unit.

7-5111-1

tmnxOesPwrSupplyRemoved

The tmnxOesPowerSupplyRemoved notification is generated when an OES power supply unit is removed from its slot in the OES chassis.

The power supply unit is not present in the indicated OES chassis slot.

Insert the OES power supply unit into its slot.

7-5113-1

tmnxOesPwrSupplyFailure

The tmnxOesPowerSupplyFailure notification is generated when the power supply unit in the indicated slot of the OES chassis has failed. tmnxOesNotifyFailureReason contains the reason for the failure.

The indicated OES power supply unit is out of service.

If the condition causing the power supply failure can not be removed, replace the faulty power supply unit.

7-5128-1

tmnxOesTempLow

The card has detected that its temperature is below operational limits.

The card is operating below the accepted temperature.

Ensure that no environmental issues are present where the network elements reside.Resolve any existing issues.

59-2004-1

linkDown

A linkDown trap signifies that the SNMP entity, acting in an agent role, has detected that the ifOperStatus object for one of its communication links is about to enter the down state from some other state (but not from the notPresent state).

The indicated interface is taken down.

If the ifAdminStatus is down then the interface state is deliberate and there is no recovery.

If the ifAdminStatus is up then try to determine that cause of the interface going down: cable cut, distal end went down, etc.

64-2091-1

tmnxSysLicenseInvalid

Generated when the license becomes invalid for the reason specified in the log event/alarm.

The system will reboot at the end of the time remaining.

Configure a valid license file location and file name.

64-2092-1

tmnxSysLicenseExpiresSoon

Generated when the license is due to expire soon.

The system will reboot at the end of the time remaining.

Configure a valid license file location and file name.

The linkDown Facility Alarm is supported for the objects listed in Table 91 (note that all objects may not be supported on all platforms):

Table 91:  linkDown Facility Alarm Support  

Object

Supported?

Ethernet Ports

Yes

Sonet Section, Line and Path (POS)

Yes

TDM Ports (E1, T1, DS3) including CES MDAs/CMAs

Yes

TDM Channels (DS3 channel configured in an STM-1 port)

Yes

ATM Ports

Yes

Ethernet LAGs

No

APS groups

No

Bundles (MLPPP, IMA, etc)

No

ATM channels, Ethernet VLANs, Frame Relay DLCIs

No