Cisco Nexus vPC - Part 2 - Common Failure Scenarios

In the last part we looked at the core concepts of vPC and then setup a couple of switches in a vPC domain and connected a downstream switch.

Now, we are going to look at some common failure scenarios of vPC and how they look on the switches and for downstream devices.

This is still our topology:

Image

The configuration on the switches has not changed.

Initially, lets see what the vPC status is of each switch:

VPC-1# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    up     1                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  
VPC-2# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary                     
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    up     1                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  

Both switches loo good at the moment. Take note of the vPC Role.

A vPC peer is rebooted for any reason

Lets reboot VPC-1, say we are doing some planned maintenance, or an upgrade.

We will see the Peer link go down and also the keepalives should fail.

This is what we see on VPC-2:

VPC-2# show vpc brief
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  

The status shows us the peer link is down and that the keep alive has been suspended. We can also see that the role of VPC-2 has moved to operational primary, as its the only switch left. Its worth noting here that nothing has changed connectivity wise, the downstream switch still has a connection to VPC-2 and the etherchannel is still UP:

Switch-1#show etherchannel summary | begin Group
Group  Port-channel  Protocol    Ports
------+-------------+-----------+-----------------------------------------------
1      Po1(SU)         LACP      Gi0/0(s)    Gi0/1(P)    

The interesting caveat here is that this will not change once the other switch comes back up. There is no preemption in vPC.

Now that VPC-1 is back up, lets check the status:

VPC-2# show vpc brief
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer adjacency formed ok      
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    up     1                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  

VPC-2 is still operational primary, and what does VPC-1 think it is:

VPC-1# show vpc brief | inc role
vPC role                          : primary, operational secondary

Its operational secondary. So the switches have kept their main roles, but when the primary switch reboots, the peer switch takes over as primary from an operational standpoint.

Peer link drops

In the unlikely event that both vPC peer links drop in the port-channel they are in, either through shutting the links down accidentally, physical cable damage or even hardware failure. Now, given that we have just rebooted VPC-1 in the earlier scenario, this scenario is being shown with the roles reset to be primary and secondary like they were at the start.

VPC-1(config)# show vpc brief 
...
vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : peer is alive                 
...
vPC role                          : primary                       
...
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  

And on VPC-2:

VPC-2# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : peer is alive                 
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary                     
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           down   failed      Peer-link is down     -      

We can see that on the secondary switch, we also see the peer link is down. However, we can also see that the downstream vPC to the switch is down. This is due to the peer link being down. Essentially, if the vPC peer link goes down, and the peer is still reachable, the secondary (or Operational Secondary) switch will force its vPCs into a down state:

VPC-2# show int po2
port-channel2 is down (suspended by vpc)
...

Lets take this one step further...

Peer link and keepalive drops

When this happens, there is a lot going on, multiple links have failed at this point.

This is a bad scenario to get into, as the switches at this point have no way of knowing if its peer is up or down. Therefore, neither switch knows whats going on:

VPC-1# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  
VPC-2# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : success 
Per-vlan consistency status       : success                       
Type-2 consistency status         : success 
vPC role                          : secondary, operational primary
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Enabled
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     success               1                  

At this point, we have a split-brain scenario. Both switches vPCs are up because they can't see each other. Theres all sorts of problems here, in that there will likely be connectivity issues connected to each switch depending how the port-channel hashes traffic going through the switches. Not to mention STP issues. This is why we protect the links with redundancy.

If you found yourself in this scenario, your best course of action is to somehow take one of the switches out of service or at least its links, that way the downstream and upstream devices will drop their interfaces to the switch and leave you with a single vPC member, thus breaking the possibility for a loop.

Switch Failure on cold boot

This is the last failure scenario we will go through in this part, its a rather niche scenario but it has happened to me in production so thought I would include it.

Lets say you have a complete power failure or spike, in which the power to both vPC peer switches is lost. These switches would of course need to be powered back on. However, what if VPC-2 suffered hardware damage during the power cut and it will not boot.

You would be right to assume that VPC-1 would boot up and assume the Primary role. However, thats not the case, at least by default.

This is the status of VPC-1:

VPC-1# show vpc brief 
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : failed  
Per-vlan consistency status       : success                       
Configuration inconsistency reason: Consistency Check Not Performed
Type-2 inconsistency reason       : Consistency Check Not Performed
vPC role                          : none established              
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Disabled (due to peer configuration)
Auto-recovery status              : Disabled
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                           

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           down   failed      Peer-link is down     -                  

As we can see, this is a bit of a mess. The vPC peer link is down because the other switch is inoperable. The keepalive is also down for the same reason. So the switch by default protects itself by suspending all its VPCs.

What does this mean for our downstream switch? Well, its port-channel interface is down:

Switch-1#show etherchannel summary | begin Group
Group  Port-channel  Protocol    Ports
------+-------------+-----------+-----------------------------------------------
1      Po1(SD)         LACP      Gi0/0(s)    Gi0/1(s)    

Switch-1#show int desc | inc Po1                
Po1                            down           down

So at this point, we have a working switch, but it refuses to become operational in the vPC environment. In the above output from VPC-1 we can see that Auto-recovery status is Disabled. This is the mechanism that allows the switch to begin to participate when this scenario unfolds. We need to enable it under the vPC domain:

vpc domain 100
  auto-recovery

When we do this, we see it becomes enabled, and the timer is off by default:

VPC-1# show vpc brief | inc Auto-recovery
Auto-recovery status              : Enabled, timer is off.(timeout = 240s)

If you have to enable this ad-hoc, you must reload the switch for it to take and actually work.

This is what we get left with:

VPC-1# show vpc brief
Legend:
                (*) - local vPC is down, forwarding via vPC peer-link

vPC domain id                     : 100 
Peer status                       : peer link is down             
vPC keep-alive status             : Suspended (Destination IP not reachable)
Configuration consistency status  : failed  
Per-vlan consistency status       : success                       
Configuration inconsistency reason: Consistency Check Not Performed
Type-2 inconsistency reason       : Consistency Check Not Performed
vPC role                          : primary                       
Number of vPCs configured         : 1   
Peer Gateway                      : Disabled
Dual-active excluded VLANs        : -
Graceful Consistency Check        : Disabled (due to peer configuration)
Auto-recovery status              : Enabled, timer is off.(timeout = 60s)
Delay-restore status              : Timer is off.(timeout = 30s)
Delay-restore SVI status          : Timer is off.(timeout = 10s)
Delay-restore Orphan-port status  : Timer is off.(timeout = 0s)
Operational Layer3 Peer-router    : Disabled
Virtual-peerlink mode             : Disabled

vPC Peer-link status
---------------------------------------------------------------------
id    Port   Status Active vlans    
--    ----   ------ -------------------------------------------------
1     Po1    down   -                                                                    

vPC status
----------------------------------------------------------------------------
Id    Port          Status Consistency Reason                Active vlans
--    ------------  ------ ----------- ------                ---------------
2     Po2           up     success     Type checks were      1                           
                                       bypassed for the vPC                              

There are divided opinions about leaving auto-restore enabled. As it can cause a split brain scenario in some cases. With a scenario like this, it is rare. I would read up on the pros and cons of it before deciding to permanently implement it on your switches.

That concludes the common (and one not-so) failure scenarios for vPC failures. Some of these are worse than others, but shows you the need to make these connections resilient and robust.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *