Jul 27, 2010

Debugging interface status on failover pairs

Should static routes interfere with the operation of ASA/PIX failover pairs? I don't think so. However, it is possible! So I'm going to describe a scenario where that issue might happen.



I configured the firewalls to perform Active/Standby failover:

fw1/pri/act(config)# sh run fail
failover
failover lan unit primary
failover lan interface FAILOVER Ethernet3
failover lan enable
failover key *****
failover mac address Ethernet0 00aa.00ee.2400 00aa.0076.6300
failover mac address Ethernet1 00aa.00ee.2401 00aa.0076.6301
failover link FAILOVER Ethernet3
failover interface ip FAILOVER 172.16.0.1 255.255.255.252 standby 172.16.0.2
 
There is a static route to 192.168.100.0/25 which was added for some irrelevant reason, and the routing table is the following:

fw1/pri/act# sh route

C 172.16.0.0 255.255.255.252 is directly connected, FAILOVER
C 192.168.0.0 255.255.255.0 is directly connected, outside
S 192.168.100.0 255.255.255.128 [1/0] via 192.168.100.126, inside
C 192.168.100.0 255.255.255.0 is directly connected, inside
 
The cluster works fine and the configuration sync has been done between the boxes:

fw1/pri/act(config)# show failover state
State                Last Failure Reason     Date/Time
This host - Primary
Standby Ready        None

Other host - Secondary
Active               None

====Configuration State===
Sync Done
====Communication State===
Mac set


However, the monitoring status of the inside interfaces are Normal (Waiting):

fw1/pri/act# sh fail
[TRUNCATED OUTPUT]
      This host: Primary - Active
            Active time: 1200 (sec)
               Interface outside (192.168.0.1): Normal
               Interface inside (192.168.100.1): Normal(Waiting)
      Other host: Secondary - Standby Ready
            Active time: 0 (sec)
               Interface outside (192.168.0.2): Normal
               Interface inside (192.168.100.2): Normal (Waiting)
[TRUNCATED OUTPUT]

 
I got the same output for the Standby device:
 
fw1/sec/stby# sh fail
[TRUNCATED OUTPUT]
      This host: Secondary - Standby Ready
            Active time: 0 (sec)
               Interface outside (192.168.0.2): Normal
               Interface inside (192.168.100.2): Normal (Waiting)
      Other host: Primary - Active
            Active time: 1200 (sec)
               Interface outside (192.168.0.1): Normal
               Interface inside (192.168.100.1): Normal(Waiting)
[TRUNCATED OUTPUT]


It is important to note that sometimes only one of the boxes might report that status.

The log messages report connectivity loss between the inside interfaces:

%PIX-1-105005: (Primary) Lost Failover communications with mate on interface inside
%PIX-1-105008: (Primary) Testing Interface inside
%PIX-1-105009: (Primary) Testing on interface inside Passed


Debugging the interface monitoring traffic between the firewalls, we can see that HELLO messages are sent through the inside interface of the Standby but the Active never receives them. It is also seen while monitoring the other way.

fw1/pri/act# debug fover txip
fover event trace on
fover_health_monitoring_thread: send_msg_ifc(): 192.168.0.1->192.168.0.2 ifc 1 cmd FHELLO
fover_health_monitoring_thread: send_msg_ifc(): 192.168.100.1->192.168.100.2 ifc 2 cmd FHELLO
fover_health_monitoring_thread: send_msg_ifc(): 192.168.0.1->192.168.0.2 ifc 1 cmd FHELLO
fover_health_monitoring_thread: send_msg_ifc(): 192.168.100.1->192.168.100.2 ifc 2 cmd FHELLO

 
fw1/sec/stby# debug fover rxip
fover_ip: fover_ip(): ifc 2 192.168.0.1 -> 192.168.0.2
fover_ip: fover_ip(): ifc 2 got FHELLO
fover_ip: fover_ip(): ifc 2 192.168.0.1 -> 192.168.0.2
fover_ip: fover_ip(): ifc 2 got FHELLO

 
Checking the ARP table, I could not figure out any layer-2 issue:
 
fw1/pri/act# sh arp

outside 192.168.0.2 00aa.0076.6300 0
inside 192.168.100.2 00aa.0076.6301 0
FAILOVER 172.16.0.2 00aa.0076.6303 419

From the Active box, I enabled debug icmp trace and ran a ping to the inside interface of the Standby. So I saw the echo messages being sent to the directed broadcast address instead of the specified destination address:
 
fw1/pri/act# ping 192.168.100.2
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.100.2, timeout is 2 seconds:
?????
Success rate is 0 percent (0/5)
ICMP echo request from 192.168.100.1 to 192.168.100.255 ID=4388 seq=0 len=32
ICMP echo request from 192.168.100.1 to 192.168.100.255 ID=4388 seq=1 len=32
ICMP echo request from 192.168.100.1 to 192.168.100.255 ID=4388 seq=2 len=32
ICMP echo request from 192.168.0.1 to 192.168.0.255 ID=4388 seq=0 len=32
ICMP echo request from 192.168.0.1 to 192.168.0.255 ID=4388 seq=1 len=32
ICMP echo request from 192.168.0.1 to 192.168.0.255 ID=4388 seq=2 len=32

Thus, I figured out that the interface monitoring status was not working properly for some layer-3 issue. Then I removed the static route to 192.168.100.0/25 and the monitoring status changed to Normal:
 
fw1/pri/act(config)# sh route

C 172.16.0.0 255.255.255.252 is directly connected, FAILOVER
C 192.168.0.0 255.255.255.0 is directly connected, outside
C 192.168.100.0 255.255.255.0 is directly connected, inside
 
fw1/pri/act(config)# sh fail
[TRUNCATED OUTPUT]
      This host: Primary - Active
            Active time: 3885 (sec)
               Interface outside (192.168.0.1): Normal
               Interface inside (192.168.100.1): Normal
      Other host: Secondary - Standby Ready
            Active time: 0 (sec)
               Interface outside (192.168.0.2): Normal
               Interface inside (192.168.100.2): Normal
[TRUNCATED OUTPUT]

 
I ran that ping again and it worked fine:

fw1/pri/act(config)# ping 192.168.100.2

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.100.2, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/38/50 ms

The static routes descrided previously indeed might interrupt the communication between the boxes and the whole inside network (I've saw it happening!). Using that settings, only unidirectional traffic (e.g., syslog) from inside to other networks would work. I could write this post and give it the title "ASA/PIX MUST be the default gateway for any directly connected network", however I couldn't explain the troubleshooting steps for failover related issues, that might be useful for another case. Furthermore, I would be throwing all my work to take this finding away.

Taking the worst case, adding static routes to all subnets of 192.168.100.0/24, except itself assigned to some longer prefix [192.168.100.0/xx | 31 > xx > 24], the communication would work properly:

fw1/pri/act(config)# sh route

C 172.16.0.0 255.255.255.252 is directly connected, FAILOVER
C 192.168.0.0 255.255.255.0 is directly connected, outside
S 192.168.100.8 255.255.255.248 [1/0] via 192.168.100.14, inside
S 192.168.100.4 255.255.255.252 [1/0] via 192.168.100.6, inside
C 192.168.100.0 255.255.255.0 is directly connected, inside
S 192.168.100.16 255.255.255.240 [1/0] via 192.168.100.30, inside
S 192.168.100.32 255.255.255.224 [1/0] via 192.168.100.62, inside
S 192.168.100.64 255.255.255.192 [1/0] via 192.168.100.126, inside
S 192.168.100.128 255.255.255.128 [1/0] via 192.168.100.254, inside


fw1/pri/act(config)# ping 192.168.100.2
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.100.2, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 40/40/40 ms


The only way to have the static route to a subnet [192.168.100.0/xx | 31 > xx > 24] not breaking the communications is defining the IP address of the Active or the Standby box as the next hop. However this would be totally unnecessary, since there is a connected route.

To compare the results, I've implemented the same scenario using IOS routers instead of firewalls. In this case, the route overlapping didn't result on this issue. ASA/PIX indeed handle VLSM in a different way that routers do it.

It makes no sense to have the firewall not set to be the gateway of some directly connected network. However, the firewall system shouldn't allow the administrator to add static routes like that to avoid connectivity issues.
 
 
asa(config)# end
asa# wr mem

No comments:

Post a Comment