snmpd discovered but not collected against

Description

OpenNMS 1.6.0 from source on FreeBSD 7.0 with net-snmp-5.4.2.1 installed from packages.

On nodes running the OpenBSD 4.4 native SNMP daemon, SNMP is discovered as a service but not collected against. This bug is a summary of the mailing list discussion archived here:
http://thread.gmane.org/gmane.network.opennms.general/27570/

I chose the Discovery/Capsd component because the root problem seems to be that the sysOID is not recorded in the node database table.

SNMP on this node (10.2.0.18) is polled as a service by OpenNMS, but no SNMP attributes are shown on the node's page and no collection is performed against it. I have 80+ other nodes working fine with various SNMP implementations, so it appears to be specific to this one.

From the OpenNMS server, I can:

Retrieve the node's sysOID (.1.3.6.1.4.1.30155.23.1)
Walk the node's system OID tree
Walk the node's ifTable OID tree
Walk the node's ipAddrTable OID tree
Match the two IP addresses from ipAddrTable to interfaces in ifTable

I see nothing out of the ordinary in a tcpdump of the SNMP conversation, and multiple SNMP based checks (mostly org.opennms.netmgt.poller.monitors.HostResourceSwRunMonitor) are running correctly against this node. There is little useful information in my logs. The only relevant entry I've found is the following from collectd.log:

–
2008-12-15 15:17:50,968 DEBUG [CollectdScheduler-50 Pool-fiber0]
Collectd: scheduleInterface: Unable to schedule
[OnmsIpInterface@2beffd ipaddr = '10.2.0.18', ifindex = 4, iphostname
= 'buny1', ismanaged = 'M', ipstatus = 1, iplastcapsdpoll = 2008-12-15
15:11:20.95, issnmpprimary = P]/SNMP, reason: System Object ID for
interface 10.2.0.18 does not exist in the database.
java.lang.RuntimeException: System Object ID for interface 10.2.0.18
does not exist in the database.
at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateSysObjId(DefaultCollectionAgent.java:173)
at org.opennms.netmgt.collectd.DefaultCollectionAgent.validateAgent(DefaultCollectionAgent.java:222)
at org.opennms.netmgt.collectd.SnmpCollector.initialize(SnmpCollector.java:336)
at org.opennms.netmgt.collectd.CollectionSpecification.initialize(CollectionSpecification.java:205)
at org.opennms.netmgt.collectd.CollectableService.<init>(CollectableService.java:142)
at org.opennms.netmgt.collectd.Collectd.scheduleInterface(Collectd.java:475)
at org.opennms.netmgt.collectd.Collectd.scheduleInterfacesWithService(Collectd.java:355)
at org.opennms.netmgt.collectd.Collectd.access$300(Collectd.java:102)
at org.opennms.netmgt.collectd.Collectd$2.doInTransaction(Collectd.java:336)
at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:127)
at org.opennms.netmgt.collectd.Collectd.scheduleExistingInterfaces(Collectd.java:330)
at org.opennms.netmgt.collectd.Collectd.access$000(Collectd.java:102)
at org.opennms.netmgt.collectd.Collectd$1.run(Collectd.java:259)
at org.opennms.netmgt.scheduler.LegacyScheduler$1.run(LegacyScheduler.java:292)
at org.opennms.core.concurrent.RunnableConsumerThreadPool$FiberThreadImpl.run(RunnableConsumerThreadPool.java:422)
at java.lang.Thread.run(Thread.java:619)
–

I can confirm that the sysoid is not entered into the database, but I've not been able to track down why. Note that the nodeid reported here differs from the mailing list conversation because I deleted and rediscovered the node while testing.

–
opennms=# select nodesysoid from node where nodeid = 113;
nodesysoid
------------
(1 row)
opennms=#
–

Please let me know if I can provide any additional information.

-HKS

Environment

Operating System: All Platform: All

Acceptance / Success Criteria

None

Attachments

Lucidchart Diagrams

Activity

Show:

hks.private January 8, 2009 at 5:22 PM

Thanks for all that work. I've submitted a bug to the OpenBSD team, and we'll see where it goes.

-HKS

Jeff Gehlbach January 2, 2009 at 4:03 PM

I Installed my own OpenBSD 4.4/i386 in a VM and did some testing, verified that this is in fact a bug in the OpenBSD snmpd. The immediate problem is that this agent does not properly handle GETNEXT requests (whether SNMPv1 or SNMPv2C) that target a scalar object. For instance, if I make the following request for sysDescr to the OpenBSD agent (using snmpget from the Net-SNMP tools):

$ snmpgetnext -On -v1 -c public 172.16.56.137 .1.3.6.1.2.1.1.1

I get this output:

.1.3.6.1.2.1.1.2.0 = OID: .1.3.6.1.4.1.30155.23.1

Which is wrong, because the agent gave me back sysObjectID.0. It should have instead given me the scalar instance sysDescr.0 (.1.3.6.1.2.1.1.1.0), which is what lexically follows sysDescr (.1.3.6.1.2.1.1.1).

You can see this problem in effect in the capsd.log that you provided as attachment 776:

2008-12-30 11:54:19,440 INFO [Capsd Rescan Pool-fiber2] Snmp4JWalker: Walking system/ifTable/ifXTable/ipAddrTable for /10.2.0.18 using version SNMPv1 with config: AgentConfig[Address: /10.2.0.18, Port: 161, Community: tempro, Timeout: 3000, Retries: 3, MaxVarsPerPdu: 10, MaxRepititions: 2, Max request size: 65535, Version: 1, ProxyForAddress: null]
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.1
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.2
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.3
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.4
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.5
2008-12-30 11:54:19,440 DEBUG [Capsd Rescan Pool-fiber2] SingleInstanceTracker: Requesting oid following: .1.3.6.1.2.1.1.6
2008-12-30 11:54:19,442 DEBUG [Capsd Rescan Pool-fiber2] Snmp4JWalker: Sending tracker pdu of size 10
2008-12-30 11:54:19,452 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] Snmp4JWalker: Received a tracker pdu of type RESPONSE from /10.2.0.18 of size 10 errorStatus = Success errorIndex = 0
2008-12-30 11:54:19,452 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.2.0 = .1.3.6.1.4.1.30155.23.1
2008-12-30 11:54:19,452 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.3.0 = 70035
2008-12-30 11:54:19,453 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.4.0 = root@munged.com
2008-12-30 11:54:19,453 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.5.0 = buny1.munged.local
2008-12-30 11:54:19,453 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.6.0 = NY
2008-12-30 11:54:19,453 DEBUG [DefaultUDPTransportMapping_10.123.0.19/0] SingleInstanceTracker: Processing varBind: .1.3.6.1.2.1.1.7.0 = 74

In short, the object identifiers in the response varbinds are shifted lexically forward by one. The Capsd code that processes that response assumes that the response varbinds are in the same order as the request varbinds, meaning that the first varbind should be sysDescr.0, the second sysObjectID.0, the third sysUpTime.0, and so on. In actual fact, the first response varbind from this agent is sysObjectID.0, the second sysUpTime.0, and so on. The value in the second slot (where sysObjectID.0 is expected to be) cannot be successfully parsed as an OID, so Capsd stores a null value for the system's sysObjectID (node.nodesysoid in the database). The SNMP collector depends on that value in order to know what to collect, so when it comes time to collect data from this node via SNMP, the collector throws up its hands and moves along.

Further investigation shows that this problem is not strictly confined to the agent's implementation of the GETNEXT protocol operation. Doing a GET against a scalar OID (e.g. sysObjectID) with no instance identifier specified incorrectly results in the agent seemingly "interpolating" the scalar (zero) instance identifier:

$ snmpget -On -v1 -c public 172.16.56.137 .1.3.6.1.2.1.1.2
.1.3.6.1.2.1.1.2.0 = OID: .1.3.6.1.4.1.30155.23.1

The agent should properly reply to this request with errorStatus set to noSuchName and errorIndex set to the index of the offending request varbind.

Given that information, I dug into the code and traced the issue as far as function mps_getnextreq in usr.sbin/snmpd/mps.c. I started to chase it deeper into mps_getint in the same file and thence into ber_add_integer in usr.sbin/snmpd/ber.c, but I don't have time to continue the hunt. I suspect that there's a logic flaw somewhere in the code that traverses the ASN.1 tree that's manifesting as this bug.

I'm moving this bug to the "SNMP / 3rd Party Agents" component and changing Hardware and OS to "All" since the latter two fields refer to the platform where openNMS is running. Also resolving the openNMS bug as WONTFIX since it's totally impractical to attempt a workaround for this problem. The collector code in particular relies heavily on GETNEXT for performance in v1 mode, and in v2c and v3 mode uses GETBULK (which is implemented internally to most agents by means of locally-targeted GETNEXTs).

Reyk or any of the other OpenBSD committers are welcome to contact me about this issue.

Jeff Gehlbach January 2, 2009 at 11:53 AM

Are you on IRC, either habitually or currently?

Thanks for the info on the OpenBSD SNMP daemon, that's quite intriguing – wonder if it will get ported the way OpenSSH has been... it sure would be nice to have a viable alternative to Net-SNMP.

Looking through the log and PCAP now.

hks.private December 30, 2008 at 1:34 PM

I have attached the requested information. Further comments inline.

(In reply to comment #2)
> Also please attach snmp4j-internal.log for the time period covering a rescan of
> this node. If you can attach a binary PCAP of the rescan, that's even better.
> On the openNMS server:
>
> sudo tcpdump -i xl0 -s 0 -w /tmp/obsd-rescan.pcap -v 'host 10.2.0.18 and port
> 161'
>
> This is likely a bug in the SNMP agent (I don't think OpenBSD has a "native"
> SNMP agent, but Net-SNMP 5.4.1.1 is available in the main distribution of
> OpenBSD 4.4) rather than in openNMS. What architecture are the OpenBSD nodes?
> Net-SNMP is notoriously buggy on 64-bit platforms, particularly little-endian
> ones (e.g. amd64) and even more particularly on operating systems other than
> Linux.

The OpenBSD SNMP daemon[1] has been included since OpenBSD 4.3. This server is running 4.4 on i386 architecture. Net-SNMP is running on the OpenNMS server, but not the node in question.

-HKS

[1]: http://www.openbsd.org/cgi-bin/man.cgi?query=snmpd&apropos=0&sektion=0&manpath=OpenBSD+44&arch=i386&format=html

hks.private December 30, 2008 at 1:26 PM

Created an attachment (id=776)
capsd.log during rescan

The portion of capsd.log that was written during the rescan. I have munged the more sensitive bits of information, but nothing that is relevant to this issue.

Won't Fix

Details
Assignee
OpenNMS Bug Mailing List
Reporter
hks.private
Components
Fix versions
1.6.2
Affects versions
1.6.0
Priority
Major

PagerDuty

Created December 29, 2008 at 1:19 PM

Updated February 3, 2011 at 2:45 PM

Resolved January 8, 2009 at 5:22 PM

snmpd discovered but not collected against

Description

Environment

Acceptance / Success Criteria

Attachments

Lucidchart Diagrams

Activity

hks.private January 8, 2009 at 5:22 PM

Jeff Gehlbach January 2, 2009 at 4:03 PM

Jeff Gehlbach January 2, 2009 at 11:53 AM

hks.private December 30, 2008 at 1:34 PM

hks.private December 30, 2008 at 1:26 PM

DetailsAssigneeOpenNMS Bug Mailing ListOpenNMS Bug Mailing ListReporterhks.privatehks.privateComponentsFix versions1.6.2Affects versions1.6.0PriorityMajor

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDutyPagerDuty Incident

PagerDuty

Details
Assignee
OpenNMS Bug Mailing List
Reporter
hks.private
Components
Fix versions
1.6.2
Affects versions
1.6.0
Priority
Major

PagerDuty