Cisco Nexus vPC - Part 2 - Common Failure Scenarios
In the last part we looked at the core concepts of vPC and then setup a couple of switches in a vPC domain and connected a downstream switch.
Now, we are going to look at some common failure scenarios of vPC and how they look on the switches and for downstream devices.
This is still our topology:
The configuration on the switches has not changed.
Initially, lets see what the vPC status is of each switch:
VPC-1# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 up 1
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
VPC-2# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 up 1
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
Both switches loo good at the moment. Take note of the vPC Role
.
A vPC peer is rebooted for any reason
Lets reboot VPC-1, say we are doing some planned maintenance, or an upgrade.
We will see the Peer link go down and also the keepalives should fail.
This is what we see on VPC-2:
VPC-2# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
The status shows us the peer link is down and that the keep alive has been suspended. We can also see that the role of VPC-2 has moved to operational primary, as its the only switch left. Its worth noting here that nothing has changed connectivity wise, the downstream switch still has a connection to VPC-2 and the etherchannel is still UP:
Switch-1#show etherchannel summary | begin Group
Group Port-channel Protocol Ports
------+-------------+-----------+-----------------------------------------------
1 Po1(SU) LACP Gi0/0(s) Gi0/1(P)
The interesting caveat here is that this will not change once the other switch comes back up. There is no preemption in vPC.
Now that VPC-1 is back up, lets check the status:
VPC-2# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer adjacency formed ok
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 up 1
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
VPC-2 is still operational primary, and what does VPC-1 think it is:
VPC-1# show vpc brief | inc role
vPC role : primary, operational secondary
Its operational secondary. So the switches have kept their main roles, but when the primary switch reboots, the peer switch takes over as primary from an operational standpoint.
Peer link drops
In the unlikely event that both vPC peer links drop in the port-channel they are in, either through shutting the links down accidentally, physical cable damage or even hardware failure. Now, given that we have just rebooted VPC-1
in the earlier scenario, this scenario is being shown with the roles reset to be primary and secondary like they were at the start.
VPC-1(config)# show vpc brief
...
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : peer is alive
...
vPC role : primary
...
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
And on VPC-2
:
VPC-2# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : peer is alive
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 down failed Peer-link is down -
We can see that on the secondary switch, we also see the peer link is down. However, we can also see that the downstream vPC to the switch is down. This is due to the peer link being down. Essentially, if the vPC peer link goes down, and the peer is still reachable, the secondary (or Operational Secondary) switch will force its vPCs into a down state:
VPC-2# show int po2
port-channel2 is down (suspended by vpc)
...
Lets take this one step further...
Peer link and keepalive drops
When this happens, there is a lot going on, multiple links have failed at this point.
This is a bad scenario to get into, as the switches at this point have no way of knowing if its peer is up or down. Therefore, neither switch knows whats going on:
VPC-1# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
VPC-2# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : success
Per-vlan consistency status : success
Type-2 consistency status : success
vPC role : secondary, operational primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Enabled
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success success 1
At this point, we have a split-brain scenario. Both switches vPCs are up because they can't see each other. Theres all sorts of problems here, in that there will likely be connectivity issues connected to each switch depending how the port-channel hashes traffic going through the switches. Not to mention STP issues. This is why we protect the links with redundancy.
If you found yourself in this scenario, your best course of action is to somehow take one of the switches out of service or at least its links, that way the downstream and upstream devices will drop their interfaces to the switch and leave you with a single vPC member, thus breaking the possibility for a loop.
Switch Failure on cold boot
This is the last failure scenario we will go through in this part, its a rather niche scenario but it has happened to me in production so thought I would include it.
Lets say you have a complete power failure or spike, in which the power to both vPC peer switches is lost. These switches would of course need to be powered back on. However, what if VPC-2
suffered hardware damage during the power cut and it will not boot.
You would be right to assume that VPC-1
would boot up and assume the Primary role. However, thats not the case, at least by default.
This is the status of VPC-1
:
VPC-1# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : failed
Per-vlan consistency status : success
Configuration inconsistency reason: Consistency Check Not Performed
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : none established
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Disabled (due to peer configuration)
Auto-recovery status : Disabled
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 down failed Peer-link is down -
As we can see, this is a bit of a mess. The vPC peer link is down because the other switch is inoperable. The keepalive is also down for the same reason. So the switch by default protects itself by suspending all its VPCs.
What does this mean for our downstream switch? Well, its port-channel interface is down:
Switch-1#show etherchannel summary | begin Group
Group Port-channel Protocol Ports
------+-------------+-----------+-----------------------------------------------
1 Po1(SD) LACP Gi0/0(s) Gi0/1(s)
Switch-1#show int desc | inc Po1
Po1 down down
So at this point, we have a working switch, but it refuses to become operational in the vPC environment. In the above output from VPC-1
we can see that Auto-recovery status
is Disabled. This is the mechanism that allows the switch to begin to participate when this scenario unfolds. We need to enable it under the vPC domain:
vpc domain 100
auto-recovery
When we do this, we see it becomes enabled, and the timer is off by default:
VPC-1# show vpc brief | inc Auto-recovery
Auto-recovery status : Enabled, timer is off.(timeout = 240s)
If you have to enable this ad-hoc, you must reload the switch for it to take and actually work.
This is what we get left with:
VPC-1# show vpc brief
Legend:
(*) - local vPC is down, forwarding via vPC peer-link
vPC domain id : 100
Peer status : peer link is down
vPC keep-alive status : Suspended (Destination IP not reachable)
Configuration consistency status : failed
Per-vlan consistency status : success
Configuration inconsistency reason: Consistency Check Not Performed
Type-2 inconsistency reason : Consistency Check Not Performed
vPC role : primary
Number of vPCs configured : 1
Peer Gateway : Disabled
Dual-active excluded VLANs : -
Graceful Consistency Check : Disabled (due to peer configuration)
Auto-recovery status : Enabled, timer is off.(timeout = 60s)
Delay-restore status : Timer is off.(timeout = 30s)
Delay-restore SVI status : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router : Disabled
Virtual-peerlink mode : Disabled
vPC Peer-link status
---------------------------------------------------------------------
id Port Status Active vlans
-- ---- ------ -------------------------------------------------
1 Po1 down -
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
2 Po2 up success Type checks were 1
bypassed for the vPC
There are divided opinions about leaving auto-restore enabled. As it can cause a split brain scenario in some cases. With a scenario like this, it is rare. I would read up on the pros and cons of it before deciding to permanently implement it on your switches.
That concludes the common (and one not-so) failure scenarios for vPC failures. Some of these are worse than others, but shows you the need to make these connections resilient and robust.
0 Comments