Cisco Nexus Switch Process Separation and Restartability

Cisco NX-OS provides isolation between software components so that a failure within one process does not disrupt others. In the Cisco NX-OS software, independent processes, known as services, perform a function or set of functions for a subsystem or feature set. Each service and service instance runs as an independent, protected process. This approach provides a highly fault-tolerant software infrastructure and fault isolation between services. A failure in a service instance (such as BGP) does not affect any other services running at that time, such as the Link Aggregation Control Protocol (LACP). In addition, each instance of a service can run as an independent process, which means that two instances of a routing protocol (for example, two instances of the OSPF protocol) can run as separate processes.

The Cisco NX-OS service restart features allow you to restart a faulty service without restarting the supervisor to prevent process-level failures from causing system-level failures. You can restart a service depending on current errors, failure circumstances, and the high-availability policy for the service. A service can undergo either a stateful or stateless restart. Cisco NX-OS allows services to store runtime state information and messages for a stateful restart. In a stateful restart, the service can retrieve this stored state information and resume operations from the last checkpoint service state. In a stateless restart, the service can initialize and run as if it had just been started with no prior state.

Not all services are designed for a stateful restart. For example, Cisco NX-OS does not store runtime state information for Layer 3 routing protocols such as Open Shortest Path First (OSPF) and Routing Information Protocol (RIP). Their configuration settings are preserved across a restart, but these protocols are designed to rebuild their operational state using information obtained from neighbor routers.

Backend management and orchestration of processes and services supporting stateful restarts are handled by a set of high-level system-control services:

System Manager: The system manager directs overall system function, service management, and system health monitoring and enforces high-availability policies. The system manager is responsible for launching, stopping, monitoring, and restarting services as well as initiating and managing the synchronization of service states and supervisor states for a stateful switchover.
Persistent storage service: Cisco NX-OS services use the persistent storage service (PSS) to store and manage operational runtime information. The PSS component works with system services to recover states in the event of a service restart. PSS functions as a database of state and runtime information that allows services to make a checkpoint of their state information whenever needed. A restarting service can recover the last-known operating state that preceded a failure, which allows for a stateful restart. Each service that uses PSS can define its stored information as private (it can be read only by that service) or shared (the information can be read by other services). If the information is shared, the service can specify that it is local (the information can be read only by services on the same supervisor) or global (it can be read by services on either supervisor or on modules). For example, if the PSS information of a service is defined as shared and global, services on other modules can synchronize with the PSS information of the service that runs on the active supervisor.
Message and transaction service: The message and transaction service (MTS) is a high-performance interprocess communications (IPC) message broker. MTS handles message routing and queuing between services on and across modules and between supervisors. MTS facilitates the exchange of messages such as event notification, synchronization, and message persistency between system services and system components. MTS can maintain persistent messages and logged messages in queues for access even after a service restart.
High Availability (HA) policies: Cisco NX-OS allows each service to have an associated set of internal HA policies that define how a failed service is restarted. Each service can have four defined policies—a primary policy and secondary policy when two supervisors are present and a primary policy and secondary policy when only one supervisor is present. If no HA policy is defined for a service, the default HA policy to be performed upon a service failure is a switchover (if two supervisors are present) or a supervisor reset (if only one supervisor is present).

Cisco Nexus Switch Process Separation and Restartability – Cisco Switch Virtualization

Cisco Nexus Switch Process Separation and Restartability

Leave a Comment Cancel reply