Linkd auto-generate path outage dependencies.

Description

[9:14] <biffhero> 190 emails from nagios in the space of 10 minutes. I understand how nagios forces you to do dependencies by hand, when you create the checks. But it would be amazing if ONMS could do it in a discovery thread. "host <foo> has MAC addy <bar>, which is showing up on switch <baz>, so there's a dependency there." "CDP tells me that swtiches go from <here> to <there>, so there's a dependency there."

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Ross McKerchar February 29, 2012 at 7:02 AM

Yep, I agree the pragmatic approach of automatically handling the simplest case (single homed node with an snmp-enabled OpenNMS managed gateway) would be a great start, providing the limitations were clearly documented. It would also be trivial to code (I'm wondering if it might even possibly in SQL alone?).

On most modern flattened networks (if you ignore your switching layer) it would also solve most of the problem - modern network topology graphs are very top heavy: they have few branches relative to leaf nodes (your endpoints).

Ignoring layer2 maybe isn't a massive problem either: I would guess switch failure is a rarer event than losing a subnet, due to a WAN link failure, for example. It also generates less notifications: you likely have <50 nodes behind a switch but hundreds behind a router.

> Rob: b) iii) Yes. Can ONMS use traceroute?

I dont think OpenNMS capture topolocy information via traceroute, no (but I may be wrong).

Alexander Hoogerhuis February 28, 2012 at 12:23 PM

A quick comment from the gallery: for the first case, if you boil it down to specifically handle this scenario:

IFF the node has a single route that isnt a interface/link route AND there is a newly created outage of one or more IPs on the node owning the nexthop IP from the routing table then supressing the end node's outage would be right in pretty much any sane case.

This would take care of most endpoint equipment, which in a typicla network would elimiate all kinds of notifications.

-A

Not Rob Walker February 28, 2012 at 12:07 PM

Ross, you are right, and you have looked further into it than I have.

a) I think all IPs are registered under the single node. Would it be bad to code for this assumption, as long as the assumption was called out loud and clear?

b) i) My point about default route to the upstream node was certainly only for single-homed hosts. But I think it's a part of the puzzle, and one which would get us started down the path.

b) ii) Switches were what got me started on this path, as I am also implementing racktables, and they have the concept of keeping track of the individual cables between the ports of each device in the data center. They pick up the ARP tables from the switches, and also get the MAC address from each physical interface on the servers. They don't auto-link them, but that shouldn't be that difficult for ONMS.

b) iii) Yes. Can ONMS use traceroute?

Thanks,
Rob

Ross McKerchar February 28, 2012 at 6:44 AM

Hi Rob,

I agree that from inspecting the iprouteinterface table, populated by linkd, it feels like the pathoutage stuff could be substantially automated but it's not super simple. I investigated doing some clever stuff myself but stumbled across these problems:

a) Routenexthop != "pathoutage ip". Routenexthop is from the node perspective - the default gateway in most cases. OpenNMS sees the other side of device which has a different IP. If OpenNMS has SNMP access to the device, I think it can handle this as all the IPs are registered under a single node but if this isn't the case then the pathoutage wont work.

b) To build a truly comprehensive picture you have to:

i) Remember that simply looking at the default route to find the upstream node is only accurate for single-homed nodes, not routers. You really want to be looking at the routedest & routemask to figure out the upstream node from OpenNMS's perspective.

ii) Consider switches: Ideally we want OpenNMS to include layer2 so that you nicely handle switch failure. Of course, layer3 routing info isn't going to give you switch topology (but OpenNMS does already "understand" layer2 and stores it in the datalinkinterface table, as far as I can see).

iii) Fail gracefully. You might not have SNMP access to everything - say if you outsource your WAN. A good strategy will consider these dark spots and do the right thing based on ICMP info only. Luckily pathOutage are always going to be a best effort thing: it's not a disaster when they fail - 90% correctness can reduce 90% of notification spam. I'd be happy with that!

Having a fully discovered topology, stored as a graph, would also really help doing some clever stuff with maps (whilst I agree maps aren't all that, no matter how hard I try to explain this, half my team still continue to request them).

BTW: I would be interested in contributing to a PoC but I would need help deciphering AbstractDefaultRouteProcessingFactoryInstantiator() and all the other java-isms (yes, I did make that name up).

-ross

Not Rob Walker February 24, 2012 at 11:21 PM

I just realized that linkd can already look at routing tables to build a network. Therefore this code could grab the default gateway for that, and if the router is down, and the host isn't accessible, it might not be down, it's only not able to be talked to.

If the user doesn't have linkd running, iso.3.6.1.2.1.4.21.1.7.0.0.0.0 could show us the default gateway.

snmpwalk -Os -c public -v 1 localhost iso.3.6.1.2.1.4.21.1.7.0.0.0.0|awk '{print $4}' FTW.

Details

Assignee

Reporter

Doc Backlog Status

Doc Backlog Grooming Date

Components

Affects versions

Priority

PagerDuty

Created November 30, 2010 at 12:17 PM
Updated September 21, 2021 at 9:15 PM