I frequently reference a conversation I had with a client a while back. It went something like this:
Client: I thought we had redundant firewalls! Why are we down?!?
Me: You do have redundant firewalls, and it has worked great for the past 9 months while one of the units was failed. However, the second unit has now failed.
A lot of this could be avoided with good monitoring. Some systems are really easy to monitor, some you need to be a bit more creative about and figure out the best way to monitor.
Easy to monitor
If something flat-out dies, and you’re monitoring that system, that should be fairly easy to detect. In the above scenario, if they were monitoring the inside IP address of the firewall, and the firewall died, they would have been aware.
Another key thing to monitor is the lights. This is hard for some organizations as servers aren’t necessarily hosted where the IT people live. But, simply going into the server room, turning off the room lights and looking for red, yellow or orange lights on equipment can give you a quick warning in case something wasn’t configured properly for monitoring. Also, look for a lack of lights where there should be some, which could indicate a complete failure.
A number of windows servers and applications will report the failure, suspension of replication, or loss of redundancy in the event log. Appliances tend to send SNMP traps to a target based on the conditions of their peers if they have failed.
You’ll need to identify each system deemed important for monitoring and then validate that the monitoring implementation is working by staging a failure and confirming the alerts work as intended.
Harder to monitor
What about systems where the underlying status may be a bit more masked? For example, web services behind a load balancer. In that case, if the load balancer provides monitoring options (syslog, SNMP, or other methods), you might be able to extract information that way. If not, then perhaps you need to poll each of the individual servers independently to verify the web services are responding properly (not just the server being ‘up’ [layer 3], but that it is actually able to deliver a service and valid response [layer 7]).
Another item we bump into that is a bit more interesting is redundant ISP connections. Frequently, the production side has a bigger pipe than the redundant side, so if you failed over, your users would be complaining about the slowness of the Internet, etc. Assuming you don’t have link balancers in place (where you’d monitor the link balancers for status), it would be good to be proactive in knowing this. Problems could be occurring over the weekend or for brief periods of time, indicating a minor problem that could grow into a full failure. So, you need to make sure that you’re monitoring particular IPs down particular paths. For this, I tend to watch the ISP’s off-site IP address. Since the most common issue would be failed hardware or someone cutting a wire in the ground, monitoring of that nature would be able to detect that. It doesn’t help when ISPs are having a bad day and their routing table goes wonky (there are ways to help detect on that too), but this process covers the common failure conditions people would experience. You just need to make sure the routes are defined properly on the firewall to go down the right path all the time.
In both of those more complex scenarios, a little bit of planning will help determine the right method to monitor the resources for possible failure.
Having high availability for key resources is important. Knowing when you lost that redundancy is also important. If you need help setting up monitoring or looking for a solution that helps monitor or manage your environment, email email@example.com. We are happy to help.