Redundant Network Monitoring


Introduction

This section describes a few scenarios for implementing redundant monitoring hosts an various types of network layouts. With redundant hosts, you can maintain the ability to monitor your network when the primary host that runs NetSaint fails or when portions of your network become unreachable.

Note: If you are just learning how to use NetSaint, I would suggest not trying to implement redudancy until you have becoming familiar with the prerequisites I've laid out. Redundancy is a relatively complicated issue to understand, and even more difficult to implement properly.

Index

Prerequisites
Considerations
Sample scripts
Scenario 1 - Implementing redundancy on the same network segment
Scenario 2 - A simple way to implement redundancy across network segments
Scenario 3 - A smarter way to implement redundancy across network segments
Scenario 4 - Implementing multiple redundancy methods

Prerequisites

Before you can even think about implementing redundancy with NetSaint, you need to be familiar with the following...

Considerations

There are a few things you need to understand before you jump into implementing redundancy...

Version 0.0.5 was the first release of NetSaint where redundancy could actually be implemented in any kind of reasonable manner. It just so happened that all the pieces fell into place for accomodating this (event handlers, program modes, and external commands). Additional support for implementing redundancy will be incorporated into future versions of NetSaint, but I need your feedback!

Sample Scripts

All of the sample scripts that I use in this documentation can be found in the eventhandlers/ subdirectory of the NetSaint distribution. You'll probably need to modify them to work on your system...

Scenario 1 - Implementing Redundancy On The Same Network Segment

Introduction

This is the easiest method of implementing redundant monitoring hosts on your network. However, this method only will only protect against a limited number of failures. More complex setups are necessary in order to provide better redundancy across different network segments.

Goals

The goal of this type of redundancy implementation is for a "slave" host running NetSaint to take over the job of monitoring the entire network if:

  1. The "master" host that runs NetSaint is down or..
  2. The NetSaint process on the "master" host stops running for some reason

Network Layout Diagram

The diagram below shows a very simple network setup. For this scenario I will be assuming that hosts A and E are both running NetSaint and are monitoring all the hosts shown. Host A will be considered the "master" host and host E will be considered the "slave" host.

Initial Program Modes

First off, we need to define what program mode the master and slave hosts will be in when they start monitoring. This is done by using the program_mode option in the main configuration file. The master host (host A) should have its initial program mode set to active, while the slave host (host B) should have its initial program mode set to standby. That was easy enough...

Initial Configuration

Next we need to consider the differences between the host configuration files on the master and slave hosts...

I will assume that you have the master host (host A) setup to monitor services on all hosts shown in the diagram above. The slave host (host E) should be setup to monitor the same services and hosts, with the following additions in the configuration file...

It is important to note that host A (the master host) has no knowledge of host E (the slave host). In this scenario it simply doesn't need to. Of course you may be monitoring services on host E from host A, but that has nothing to do with the implementation of redundancy...

Event Handler Command Definitions

We need to stop for a minute and describe what the command definitions for the event handlers on the slave host look like. Here is an example...

command[handle-master-host-event]=/usr/local/netsaint/libexec/eventhandlers/handle-master-host-event $HOSTSTATE$ $STATETYPE$
command[handle-master-proc-event]=/usr/local/netsaint/libexec/eventhandlers/handle-master-proc-event $SERVICESTATE$ $STATETYPE$

This assumes that you have placed the event handler scripts in the /usr/local/netsaint/libexec/eventhandlers directory. You may place them anywhere you wish, but you'll need to modify the examples I've given here.

Event Handler Scripts

Okay, now lets take a look at what the event handler scripts look like...

Host Event Handler (handle-master-host-event)

#!/bin/sh

# Only take action on hard host states...
case "$2" in
HARD)
	case "$1" in
	DOWN)
		# The master host has gone down!
		# We should now become the master host and take
		# over the responsibilities of monitoring the 
		# network, so enter active mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_active_mode
		;;
	UP)
		# The master host has recovered!
		# We should go back to being the slave host and
		# let the master host do the monitoring, so 
		# enter standby mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_standby_mode
		;;
	esac
	;;
esac
exit 0

Service Event Handler (handle-master-proc-event)

#!/bin/sh

# Only take action on hard service states...
case "$2" in
HARD)
	case "$1" in
	CRITICAL)
		# The master NetSaint process is not running!
		# We should now become the master host and
		# take over the responsibility of monitoring
		# the network, so enter active mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_active_mode
		;;
	WARNING)
	UNKNOWN)
		# The master NetSaint process may or may not
		# be running.. We won't do anything here, but
		# to be on the safe side you may decide you 
		# want the slave host to become the master in
		# these situations...
		;;
	OK)
		# The master NetSaint process running again!
		# We should go back to being the slave host, 
		# so enter standby mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_standby_mode
		;;
	esac
	;;
esac
exit 0

What This Does For Us

When things first start out, host A (the master host) is in active mode. This means that it monitors all services and sends out notifications if there are problems or recoveries. Host E (the slave host) is in standby mode, which means that it will monitor all services but will not send out any notifications.

The NetSaint process on host E becomes the master host when...

When the NetSaint process on host E has entered active mode, it will be able to send out notifications about any service or host problems or recoveries. At this point host E has effectively taken over the responsibility of monitoring the network!

The NetSaint process on host E returns to being the slave host when...

When the NetSaint process on host E has entered standby mode, it will not send out notifications about any service or host problems or recoveries. At this point host E has handed over the responsibilities of monitoring the network back to host A. Everything is now as it was when we first started!

Time Lags

Redundancy in NetSaint is by no means perfect. One of the more obvious problems is the lag time between the master host failing and the slave host taking over. This is affected by the following...

You can minimize this lag by...

When NetSaint recovers on the host A, there is also some lag time before host E returns to standby mode. This is affected by the following...

The exact lag times between the transfer of monitoring responsibilities will vary depending on how many services you have defined, the interval at which services are checked, and a lot of pure chance. At any rate, its definitely better than nothing...

Special Cases

Here is one thing you should be aware of... If host A goes down, host E will switch to active mode and take over the responsibilities of monitoring. When host A recovers, host E will switch to standby mode. If - when host A recovers - the NetSaint process on host A does not start up properly, there will be a period of time when neither host is monitoring the network! Fortunately, the service check logic in NetSaint accounts for this. The next time the NetSaint process on host E checks the status of the NetSaint process on host A, it will find that it is not running. Host E will then switch back to active mode and take over all responsibilities of monitoring.

The exact amount of time that neither host is monitoring the network is hard to determine. Obviously, this period can be minimized by increasing the frequency of service checks (on host E) of the NetSaint process on host A. The rest is up to pure chance, but the total "blackout" time shouldn't be too bad...

Scenario 2 - A Simple Way To Implement Redundancy Across Network Segments

Introduction

If you're monitoring hosts that reside on different network segments, you're going to need a more substantial redundancy model that described in scenario 1. The following example is more complex than that in the first scenario, but the logic behind it should become clear if you study it closely enough.

Goals

The goal of this type of redundancy implementation is for a "slave" host running NetSaint to take over the job of monitoring the entire network if:

  1. The "master" host that runs NetSaint is down or unreachable or...
  2. The NetSaint process on the "master" host stops running for some reason

Network Layout Diagram

The diagram below shows a relatively simple network setup with host on two network segments. For this scenario I will be assuming that hosts A and F are both running NetSaint and are monitoring all the hosts shown. Host A will be considered the "master" host and host F will be considered the "slave" host. Nodes H and I are routers that lie between the two network segments.

Initial Program Modes

For this example, the master host (host A) should have its initial program mode set to active, while the slave host (host F) should have its initial program mode set to standby.

Initial Configuration

Next we need to consider the differences between the host configuration files on the master and slave hosts...

I will assume that you have the master host (host A) setup to monitor services on all hosts shown in the diagram above. The slave host (host F) should be setup to monitor the same services and hosts, with the following additions in the configuration file...

It is important to note that host A (the master host) has no knowledge of host F (the slave host). In this scenario it simply doesn't need to. Of course you may be monitoring services on host F from host A, but that has nothing to do with the implementation of redundancy...

Event Handler Command Definitions

We need to stop for a minute and describe what the command definitions for the event handlers on the slave host look like. Here is an example...

command[handle-master-host-event]=/usr/local/netsaint/libexec/eventhandlers/handle-master-host-event $HOSTSTATE$ $STATETYPE$
command[handle-master-proc-event]=/usr/local/netsaint/libexec/eventhandlers/handle-master-proc-event $SERVICESTATE$ $STATETYPE$
command[handle-router-event]=/usr/local/netsaint/libexec/eventhandlers/handle-router-event $HOSTSTATE$ $STATETYPE$

This assumes that you have placed the event handler scripts in the /usr/local/netsaint/libexec/eventhandlers directory. You may place them anywhere you wish, but you'll need to modify the examples I've given here.

Event Handler Scripts

Okay, now lets take a look at what the event handler scripts look like...

Host Event Handler (handle-master-host-event)

#!/bin/sh

# Only take action on hard host states...
case "$2" in
HARD)
	case "$1" in
	DOWN)
		# The master host has gone down!
		# We should now become the master host and take
		# over the responsibilities of monitoring the 
		# network, so enter active mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_active_mode
		;;
	UP)
		# The master host has recovered!
		# We should go back to being the slave host and
		# let the master host do the monitoring, so 
		# enter standby mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_standby_mode
		;;
	esac
	;;
esac
exit 0

Service Event Handler (handle-master-proc-event)

#!/bin/sh

# Only take action on hard service states...
case "$2" in
HARD)

	case "$1" in

	CRITICAL)
		# The master NetSaint process is not running!
		# We should now become the master host and
		# take over the responsibility of monitoring
		# the network, so enter active mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_active_mode
		;;

	WARNING)
		;;
	UNKNOWN)
		;;
		# The master NetSaint process may or may not
		# be running.. We won't do anything here, but
		# to be on the safe side you may decide you 
		# want the slave host to become the master in
		# these situations...

	OK)
		# The master NetSaint process running again!
		# We should go back to being the slave host, 
		# so enter standby mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_standby_mode
		;;
	esac
	;;
esac
exit 0

Host Event Handler (handle-router-event)

#!/bin/sh

# Only take action on hard host states...
case "$2" in
HARD)
	case "$1" in
	DOWN)
		# The router has gone down!
		# We should now become the master host and take
		# over the responsibilities of monitoring the 
		# network, so enter active mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_active_mode
		;;
	UP)
		# The router has recovered!
		# We should go back to being the slave host and
		# let the master host do the monitoring, so 
		# enter standby mode...
		/usr/local/netsaint/libexec/eventhandlers/enter_standby_mode
		;;
	esac
	;;
esac
exit 0

What This Does For Us

When things first start out, host A (the master host) is in active mode. This means that it monitors all services and sends out notifications if there are problems or recoveries. Host F (the slave host) is in standby mode, which means that it will monitor all services but will not send out any notifications.

The NetSaint process on host F becomes the master host when...

When the NetSaint process on host F has entered active mode, it will be able to send out notifications about any service or host problems or recoveries. At this point host F has effectively taken over the responsibility of monitoring the network!

The NetSaint process on host F returns to being the slave host when...

When the NetSaint process on host F has entered standby mode, it will not send out notifications about any service or host problems or recoveries. At this point host F has handed over the responsibilities of monitoring the network back to host A. Everything is now as it was when we first started!

Shortcomings

This simple example has some shortcomings that you should be aware of. Note that when one of the routers goes down, the NetSaint process on host F acts as if the NetSaint process on host A is no longer running. This may or may not be the case. If the process on host A is running, you'll get potentially bogus notifications being sent out from both NetSaint processes...

As an example, lets say that router H goes down and severs the connection between the two network segments, but everything else is okay. From the view of the NetSaint process on host F, all hosts beyond router H (hosts A, B, C, D, E, and I) are unreachable. At the same time, the NetSaint process on host A (which is on the other side of router H) thinks that all hosts beyond router H (hosts F and G) are unreachable. Both NetSaint processes see that router H is down, but that's the only thing they agree on. This might lead to an enormous amount of bogus notifications being sent out to you. You could potentially get two notifications about router H being down (one from each process) and one notification about every other host on the network being unreachable!

Scenario 3 - A Smarter Way To Implement Redundancy Across Network Segments

Introduction

This is basically just an improvement in the redundancy logic described above in scenario 2. What we will do is make both monitoring hosts aware of each other. In scenario 2, the slave host (host F) knew about the master host (host A), but the master was unaware of the slave. In this scenario both the slave and master hosts will be aware of each other, and will use that information to make better decisions on how to take over or adjust monitoring responsibilities.

Goals

We have several goals with this redundancy scenario...

The "slave" host running NetSaint should take over the job of monitoring the entire network if:

  1. The NetSaint process on the "master" host stops running for some reason
  2. The "master" host that runs NetSaint is down
  3. The "master" host becomes unreachable due to one or both of the routers going down and the "master" host was last known to be either down or unreachable

The "slave" host running NetSaint should take over the job of monitoring only its local network segment if:

  1. The "master" host becomes unreachable due to one or both of the routers going down and the "master" host was last known to be up

The "master" host running NetSaint should stop monitoring the entire network and change to monitoring only its local network segment if:

  1. The "slave" host becomes unreachable due to one or both of the routers going down and the "slave" host was last known to be up

Network Layout Diagram

See network diagram for scenario 2 - its the same...

Initial Program Modes

The master host (host A) should have its initial program mode set to active, while the slave host (host F) should have its initial program mode set to standby. This is the same setup as described in scenario 2.

Initial Configuration

...

The rest of this documentation is incomplete. Since it has been missing since 0.0.5 was released and no one asked about it, I assume no one needs it. :-)

...

Scenario 4 - Implementing Multiple Redundancy Methods

If you've got a large, complex network and are paranoid about ensuring that NetSaint monitors everything, you'll probably want to look into implementing multiple redundancy methods. This basically involves combining the redundancy methods described in scenarios 1 and 3 to create a pool of monitoring hosts that are all aware of each other's state and can take over all or part of the network monitoring responsibilites if necessary. If you found the concepts presented in scenario 3 difficult to understand, you should be aware that the complexity of configuration files and event handler scripts will grow exponentially as you add additional monitoring hosts to a multiple redundancy setup.

Since there are endless possibilities for implementing multiple redundancy methods, I won't try to discuss them here. If you decide to implement mixed redundancy methods on your network be prepared to spend a lot of time analyzing your network structure, its critical failure points (i.e. routers, firewalls, etc.), the location of monitoring hosts, and what should happen at each monitoring host in the event of a problem. When implementing multiple redundancy methods you cannot simply create event handler scripts based on the state of routers, etc. - you must also take into account the state of other monitoring hosts on the local network segment and (possibly) on other segments.