Tuesday, 31 January 2012

IPv6 Abbreviations - Abbreviating rules

Just as I thought I was mastering IPv6 understanding, the result is FAIL.
I got the removal of 0's the wrong way round.  Take note, unlike I did.

"When abbreviating an IPv6 address
you can remove the leading 0's from within a single octet. 
You cannot remove the trailing 0's.
:: can be used only once in the address
to collapse octets of 0's."

2233:0000:2222:0011:0000:0000:0000:0001
can become:
2233:0:2222:11::1

Friday, 27 January 2012

IP GRE Tunnel + PBR (Policy Based Routing)


A little project that has taken up a few days at work recently was: How to route traffic on a remote network out of an internet connection on a local network?  The closest idea that could be invented within the department was to turn off the remote networks internet connection (it is 1/10th the capacity of the local one in question) and let the routing tables take over.
I thought there was a better way and suggested PBR as the answer.  This lead to another network adminstrator and myself drawing up how it would work.  Cue whiteboard and a bit of head scratching.  We only wanted certain traffic to NOT use the remote internet connection - even though it was slower than the local one, it was still perfectly servicable.

I might use local and remote in the wrong context here, but the LOCAL internet connection is the one I'm sat next to and the REMOTE network is at the other side of the city - just for reference!

PBR
The idea to use PBR is to detect traffic on the remote network (matching them by destination and source) and policy routing them to the local internet router.  Whilst this sounds a solid enough idea, PBR relies on the next hop being locally connected (which it definately isn't in our case - the traffic must traverse a core network, approximately 4 hops or so).  Also the core network uses HP Layer 3 switches and this model is not capable of PBR (or a great deal else... but that is another story), so its not like we can simply keep PBR'ing all the way to the local network!  PBR would play its part in matching the traffic and assigning a next hop value, we just needed to create some form of connection that appeared local between the two sites to then assign PBR to it.

IP Tunnel
What we needed to do was create an IP tunnel from the remote network to the local one.  To accomplish this, we could have used a VPN, but encryption on the internal network was overkill, so a simple GRE IP Tunnel was investigated.  It stands for 'general routing encapsulation' and does the job we want of logically creating a point-to-point tunnel between two remote devices.  I must say that it is super easy to setup.  We setup the remote site router to run keepalives, as to not use the connection if the tunnel was down - a cheap form of disaster recovery you could say!  This forced the router to pass traffic through the normal routing table if the other end of the tunnel was offline.

Stuff I forgot about
Whilst setting up the PBR to state a next hop for the 'matched' traffic was easy and the tunnel was also easy to create, It still didn't work.
On the local router the only configuration commands required were that of the B-end of the tunnel.  Nothing more.  It was this oversight that proved to be the problem.
Of course to traverse the router to access the internet we use NAT.  You guessed it already, I didn't add the tunnel interface to be an inside NAT interface - lesson 1.
Lesson 2 was also NAT related.  We use ACL's to determine which traffic can access the internet - the remote site's I.P range was not specified and that concludes our lessons for today.

Learnt today?
All of it really.  I did have an idea about PBR already but didn't know it was only local gateways that could be used as next hops.  I even thought the recursive command would be useful but it turns out that the router performs a recursive lookup for the remote gateway and routes the packet TOWARDS it, rather than encapsulating it etc - this obviously was still an issue with the HP switches.
I also learnt the GRE tunnel isn't intelligent.  The remote ends make no difference to routing decisions if it is down and thus creating a routing black hole.  I though about using an IP Sla and tracking to check the remote interface and base the PBR on that, but keepalives did the job on the remote router where the traffic originates - this keepalive actually affects whether PBR is used or not - news to me.
Also, did you know that when using PBR you are actually process switching?  unless you state (at interface level) - ip route-cache policy.
PBR also was a bit confusing when using a loopback interface to test the tunnel - it didn't work correctly and thus this is because router generated traffic is not influenced by PBR unless you turn it on globally using - ip local policy route-map *map-tag*.

The most important thing learnt today was that working with others has its benefits when drafting up solutions like this.  Brain-storming and bouncing ideas of each other is a form of learning, even if the outcome isn't as expected! 

Monday, 23 January 2012

Monitoring a connection via IP Sla, Tracking & Syslog

It's been nearly a week before I managed to learn something new at work.  It encompasses a few bits though!  Here goes:

We have 2 connections at branch sites.  One connection heads back to the core network and the other is a connection to our organisations version of the Internet.  This Internet connection is managed by a third party and we have to log calls to prompt any action if it is suspected it is at fault - we have no access to the router.

Recently it has been at fault, its just proving it!

We statically route a long list of addresses out of the Internet connection on Cisco 1841 routers, rather than them being routed internally.  Its more efficient to use the connection back to the hub site for internal data and the internet connection is funded centrally, so why not use it for itnernet traffic?

The problem seems to be that every 'now and then' user computers will 'hang' for a period of a few minutes, or substantially longer when they are accessing services through this connection.  Our desktop engineers have ruled out the PC's - and I was called in.  The syptoms seem familiar to a home user accessing a web-page with some inbuilt java or scripting, then their connection dies and thus the PC sits there with an egg-timer until the router/connection is restored and off they go again. 

The router and switches that we can manage look fine.  No error messages of any sort and the interface counters seem relatively solid. 

As we use static routing defined like so...:

ip route 0.0.0.0 0.0.0.0 192.168.242.22
ip route 20.130.152.0 255.255.255.0 10.128.99.32

... The defined route will only check its interface and line status to the device known as 10.128.99.32 - which is the 3rd party router connected via a crossover cable next to it.  If the crossover breaks, or both interfaces go down, then the route will be changed to the default route listed above.  Its clear that this isn't happening, the interface counters and syslog don't show it to have dropped and internet traffic hasn't been routed to our core firewalls, so the issue must be on the wider 3rd party connection.  but how to prove that...

We have IOS 15.1(3)T - so I can't vouch for older IOS revisions in terms of the commands used here, just for info.

Find an I.P
First I found a 'pingable' I.P address of one of the systems accessed over the connection.  Its suprising how many of the services we use wouldn't respond, but luckily the most important system allowed ICMP echo's.  You may question how pinging one I.P address will confirm the connection is down - and I'd agree with you that it doesn't.  But... we do access this system from 100+ other sites and it has an up time of 99.9% so, in my book, its as solid an address as I can find.

Configure IP SLA
Configuring the IP SLA manually will allow you to set definable thresholds, frequency and timeout values.  But I did actually stumble across Solar Winds IP SLA monitor.  Its free and therefore its not a polished product that can be tweaked to the endth degree, but it works ok for the task.  Ensure you have write snmp details for the device and that you know the string, otherwise you'll get zero results!
After configuring the SLA through the application, I did have to amend the configuration in the CLI, just to tell the SLA that it should use the interface connected to the internet connection, to do the pinging!
It ended up looking like this:

ip sla 1
 icmp-echo 20.130.152.151 source-interface FastEthernet0/1
 owner SW_IpSla_FreeTool_test
 frequency 30
ip sla schedule 1 life forever start-time now

I like how the application has given itself an owner tag.  In any case, the little desktop application shows a green light - so the SLA is a success at the minute.  Also running show ip sla statistics, proves that to be the case.  The frequency field is defined by the application (boo) and amending it doesn't allow the SLA to be monitored through the application (frequency is how often the ping occurs, in seconds).

Configure a Track
I then setup a track on this object so that we could assess its reachability state and base some actions on it.  The track has to be the easiest series of commands:

track 1 ip sla 1

Show track, gives the state, changes and time of last change etc.

Configuring Event Manager
Rather than looking through the router logs after logging in, we'd like events to be referenced in Syslog so we can view them through our network management software.

Event manager is the way to do this.  You can set actions (such as log to syslog with a message) based on the state changes of our Tracked object.  In turn, we can build a log of how often the internet connection appears to drop.  We can compare this log with another site router (performing all the same tricks as this one) to approach the 3rd party with something comparable - doing their job for them ideally.

Here are the commands to configure event manager:

event manager applet internet-down
 event tag pingdown track 1 state down
 trigger
  correlate event pingdown
  attribute tag pingdown occurs 1
 action 1 syslog msg "*********Internet DOWN**********"


event manager applet internet-up
 event tag pingup track 1 state up
 trigger
  correlate event pingup
 action 1 syslog msg "*********Internet UP**********"

We have two actions based on the 'state' of the tracked object.  One logs the internet as down and the other as being up.  Its important to log when it comes back up, so we know how long the outage was for.
You may also notice a couple of irrelevant commands - such as the trigger and correlate events commands.  Originally during testing I created two event tags for each applet.  This allowed two IP addresses to be pinged (I had two IP SLA's and two Tracks) and the correlate would take both into account - its just a slightly more accurate way to say the connection was down/up basing it on more than one I.P address.  I subsequently removed the IP SLA 2 and Track 2, but left the trigger and correlate there for you to see what is possible with event manager.

And there we have it.  A free desktop application keeps an eye on our IP Sla's as well as logged data recorded from the router.  If we have to go a step further in the future, it would be to tweak the SLA to suit our parameters and then base the routing decisions via PBR (policy based routing) on the reachability of some service I.P addresses (or other SLA probes). 

But for the time being, management just want a log of how often the problem occurs, so in a month or so, I hope we'll have some results - or not... the poor users will be the ones affected if I do have something to show!

Tuesday, 17 January 2012

SEPM CPU 90%? - its normal say Symantec

A trusty network administrator has just been on the phone with Symantec regarding a couple of SEPM's (Symantec Endpoint Protection Manager) servers running at 90% CPU load for no apparent reason.

We're using version 12.1 by the way.

Symantec say this is normal if replication is used (which it is).  You probably wont see this load with a stand-alone SEPM installation.
Symantec say that although resource is consumed, its will be released, should the server need any CPU load for other services or processing.  High CPU load witnessed is the SEPM's way of checking its database for consistency.  Seems like a lot of checking to me, but when the server was asked to perform another duty, it did so without exploding - so maybe Symantec are right, it's their product after all...

It would be the case for using the embedded database - which we are using, but Symantec also say that the SQL server (if one was used) would also be running high CPU as it follows the same principles.

CPU queries seem to be flavour of the month at our place!

Cisco IP Interface switching

Whilst studying some text on NAT, I kept noticing segements of text referring to Cisco switching mechanisms.

Here's a line of the text:
"On a NAT tranlsation table, an asterix means that the translation is occurring in the fast-switched path.  By default the first packet in a NAT translation will always be process-switched" - so whats all this fast/process switched business?

Cisco layer 3 devices have three switching modes, Process Switching, Fast Switching, and Cisco Express Forwarding switching.

If you see the configuration line "'no ip route-cache" on an interface the packets entering the interface will be process switched, which means the CPU will do the switching - a potential burden on a router.
Fast switching overcame the issue by route caching the first packet so that others in the same flow didn't have to hit the CPU - these routes were housed in hardware lookup tables that are independent of the CPU.
CEF (Cisco Express Forwarding) builds on fast switching by using FIB (forwarding information bases) and adjacency tables to quickly link packets/flows with routing table entries and adjacent devices.

This is one way of determining what each interface is doing, have a look for the commands:

*Nothing specified in the interface?*
CEF usually enabled by default - you may see ip cef listed earlier in your config
No ip route-cache
Process switched
ip route-cache
Fast switched

That seems to make a bit more sense.

Monday, 16 January 2012

Only one IP access-group allowed

A network manager needs to allow some inbound traffic from our version of the internet.  To complete this, we need to review our ACL on a border router.  The device uses CBAC or IP Inspect as it's known in the CLI to allow only internal traffic back through the router (traffic must be generated internally first to be allowed back through the stateful inspection list).  Usually these routers deny everything coming in on the internet facing interface - because there are some nasty folk out there.

Whilst it seems normal to create an ACL for the required ranges for the network manager, after I'd done the hard work in configuring the wildcard masks (they are odd, aren't they?) I then 'forgot' that you can only have one IP access-group either inbound or outbound on an interface.  I say forgot, but I actually forgot there already was one (which said deny to anything inbound) applied to the interface.  When I came to type ip access-group... I thought I'd check to see if there was one already. 

And to summarise, yes there was one.  And yes, the ACL I created was a waste.  Instead I had to append the IP ranges and wildcards to the already configured extended access-list - taking care to ensure the deny statement was at the end of list. 

So, only one IP access-group is allowed on an interface (one for each direction) - this in turn may affect how you setup your ACL's.

Authentication with EIGRP

I found that I learned nothing at the weekend!  Although we did celebrate 10 years of 'courting' (as my Nan would say) by going for a meal - Pigeon is not a very nice meat, but I guess I already knew that, so its not really anything to blog about...

Back to the text books today and thus I have learnt that EIGRP only uses MD5 authentication - only after getting the question wrong on a text exam :(

Friday, 13 January 2012

High CPU on Cisco 1841 Router - ARP or SNMP?

And so the learning continues today...

Another network manager has reported an issue with some legacy routers.  The CPU is over 80% and the SNMP engine seems to be partly responsible.  The manager has turned off the SNMP engine to prevent the router from melting - but unsure why the problem has occurred.

Another network administrator has already found some information that may help.  He discovered that the ARP tables on the 'busy' routers were large.  In fact they were extremely large for a remote site/branch router.  50,000 entries on one of them, is some serious amount of Arp requests.

Delving further into the problem it appears that the router is proxy-arp'ing everything.  Running a show arp summary (on slightly new IOS's that some of ours have) give some indication as to which interfaces the Arp'ing was being conducted.  In this case it was a Vlan (the router was using an etherswitch card) that connected the remote site to its main/hub site.  It was also the route for traffic that didn't have a specific route in the routing table - namely a default route (ip route 0.0.0.0 0.0.0.0 vlan128).  The arp table showed requests to sites on the internet as well as internal ranges.

The Cisco documentation regarding troubleshooting high CPU utilization does cover this aspect as highlighted here:

************************************

ARP Input

High CPU utilization in the Address Resolution Protocol (ARP) Input process occurs if the router has to originate an excessive number of ARP requests. The router uses ARP for all hosts, not just those on the local subnet, and ARP requests are sent out as broadcasts, which causes more CPU utilization on every host in the network. ARP requests for the same IP address are rate-limited to one request every two seconds, so an excessive number of ARP requests would have to originate for different IP addresses. This can happen if an IP route has been configured pointing to a broadcast interface. A most obvious example is a default route such as:
ip route 0.0.0.0 0.0.0.0 Fastethernet0/0
In this case, the router generates an ARP request for each IP address that is not reachable through more specific routes, which practically means that the router generates an ARP request for almost every address on the Internet. For more information about configuring next hop address for static routing, see Specifying a Next Hop IP Address for Static Routes.
Alternatively, an excessive amount of ARP requests can be caused by a malicious traffic stream which scans through locally attached subnets. An indication of such a stream would be the presence of a very high number of incomplete ARP entries in the ARP table. Since incoming IP packets that would trigger ARP requests would have to be processed, troubleshooting this problem would essentially be the same as troubleshooting high CPU utilization in the IP Input process.

************************************

So it appears that specifying the next hop as an interface may not be good practice.  Luckily we already have some knowledge of this and have implemented IP addresses of remote router interfaces at most sites (usually wise to base next hop on a reachable IP in-case you have more than one route to that I.P rather basing it on an interface state), but obviously some legacy configs are still around.

It must be that the SNMP engine was then processing all this ARP data and thus adding to the CPU load for an already busy router processing arp requests!

By changing the next hop interface to an I.P address, the router calmed and normal service was resumed - although there is still some monitoring to be done to see if this is a long term cure.  One suspects that turning proxy-arp off, is not only a good security lesson, but may help the router out in future.

SEP 12.1 RU1 - Pull or Push?

So... the first bit of information I learned today was at work. 

Another network administrator has informed me that the default client/server communication method in the RU1 update for Symantec Endpoint Protection version 12.1 (yes it is a mouthful - and a bitter tasting one, I'll explain one day) uses Pull mode rather than Push as default.
Apparently push mode means that client/server comm's is constant, so that any policy changes etc can be instantly 'pushed' to clients.  Not ideal in large networks, so they've changed that to pull, where changes are sent after the hearbeat cycle.

Now, I didn't know that, until now.