Monday, 23 January 2012

Monitoring a connection via IP Sla, Tracking & Syslog

It's been nearly a week before I managed to learn something new at work.  It encompasses a few bits though!  Here goes:

We have 2 connections at branch sites.  One connection heads back to the core network and the other is a connection to our organisations version of the Internet.  This Internet connection is managed by a third party and we have to log calls to prompt any action if it is suspected it is at fault - we have no access to the router.

Recently it has been at fault, its just proving it!

We statically route a long list of addresses out of the Internet connection on Cisco 1841 routers, rather than them being routed internally.  Its more efficient to use the connection back to the hub site for internal data and the internet connection is funded centrally, so why not use it for itnernet traffic?

The problem seems to be that every 'now and then' user computers will 'hang' for a period of a few minutes, or substantially longer when they are accessing services through this connection.  Our desktop engineers have ruled out the PC's - and I was called in.  The syptoms seem familiar to a home user accessing a web-page with some inbuilt java or scripting, then their connection dies and thus the PC sits there with an egg-timer until the router/connection is restored and off they go again. 

The router and switches that we can manage look fine.  No error messages of any sort and the interface counters seem relatively solid. 

As we use static routing defined like so...:

ip route
ip route

... The defined route will only check its interface and line status to the device known as - which is the 3rd party router connected via a crossover cable next to it.  If the crossover breaks, or both interfaces go down, then the route will be changed to the default route listed above.  Its clear that this isn't happening, the interface counters and syslog don't show it to have dropped and internet traffic hasn't been routed to our core firewalls, so the issue must be on the wider 3rd party connection.  but how to prove that...

We have IOS 15.1(3)T - so I can't vouch for older IOS revisions in terms of the commands used here, just for info.

Find an I.P
First I found a 'pingable' I.P address of one of the systems accessed over the connection.  Its suprising how many of the services we use wouldn't respond, but luckily the most important system allowed ICMP echo's.  You may question how pinging one I.P address will confirm the connection is down - and I'd agree with you that it doesn't.  But... we do access this system from 100+ other sites and it has an up time of 99.9% so, in my book, its as solid an address as I can find.

Configure IP SLA
Configuring the IP SLA manually will allow you to set definable thresholds, frequency and timeout values.  But I did actually stumble across Solar Winds IP SLA monitor.  Its free and therefore its not a polished product that can be tweaked to the endth degree, but it works ok for the task.  Ensure you have write snmp details for the device and that you know the string, otherwise you'll get zero results!
After configuring the SLA through the application, I did have to amend the configuration in the CLI, just to tell the SLA that it should use the interface connected to the internet connection, to do the pinging!
It ended up looking like this:

ip sla 1
 icmp-echo source-interface FastEthernet0/1
 owner SW_IpSla_FreeTool_test
 frequency 30
ip sla schedule 1 life forever start-time now

I like how the application has given itself an owner tag.  In any case, the little desktop application shows a green light - so the SLA is a success at the minute.  Also running show ip sla statistics, proves that to be the case.  The frequency field is defined by the application (boo) and amending it doesn't allow the SLA to be monitored through the application (frequency is how often the ping occurs, in seconds).

Configure a Track
I then setup a track on this object so that we could assess its reachability state and base some actions on it.  The track has to be the easiest series of commands:

track 1 ip sla 1

Show track, gives the state, changes and time of last change etc.

Configuring Event Manager
Rather than looking through the router logs after logging in, we'd like events to be referenced in Syslog so we can view them through our network management software.

Event manager is the way to do this.  You can set actions (such as log to syslog with a message) based on the state changes of our Tracked object.  In turn, we can build a log of how often the internet connection appears to drop.  We can compare this log with another site router (performing all the same tricks as this one) to approach the 3rd party with something comparable - doing their job for them ideally.

Here are the commands to configure event manager:

event manager applet internet-down
 event tag pingdown track 1 state down
  correlate event pingdown
  attribute tag pingdown occurs 1
 action 1 syslog msg "*********Internet DOWN**********"

event manager applet internet-up
 event tag pingup track 1 state up
  correlate event pingup
 action 1 syslog msg "*********Internet UP**********"

We have two actions based on the 'state' of the tracked object.  One logs the internet as down and the other as being up.  Its important to log when it comes back up, so we know how long the outage was for.
You may also notice a couple of irrelevant commands - such as the trigger and correlate events commands.  Originally during testing I created two event tags for each applet.  This allowed two IP addresses to be pinged (I had two IP SLA's and two Tracks) and the correlate would take both into account - its just a slightly more accurate way to say the connection was down/up basing it on more than one I.P address.  I subsequently removed the IP SLA 2 and Track 2, but left the trigger and correlate there for you to see what is possible with event manager.

And there we have it.  A free desktop application keeps an eye on our IP Sla's as well as logged data recorded from the router.  If we have to go a step further in the future, it would be to tweak the SLA to suit our parameters and then base the routing decisions via PBR (policy based routing) on the reachability of some service I.P addresses (or other SLA probes). 

But for the time being, management just want a log of how often the problem occurs, so in a month or so, I hope we'll have some results - or not... the poor users will be the ones affected if I do have something to show!

1 comment: