traps are not processed after a malformed trap is received

Description

I'm using OpenNMS 1.5.90 installed on Debian from packages downloaded from debian.opennms.org

Recently I've noticed strange behavior of the system. After some time the system is stopping with
processing SNMP traps. I've run some debugging and found that the thread polling UDP socket is crashing after receiving of a malformed trap.

I've run strace on that thread. Below is the output:

00:09:44 poll([{fd=97, events=POLLIN|POLLERR}], 1, 1000) = 0
00:09:45 gettimeofday({1210284585, 193114}, NULL) = 0
00:09:45 poll([{fd=97, events=POLLIN|POLLERR}], 1, 1000) = 0
00:09:46 gettimeofday({1210284586, 193136}, NULL) = 0
00:09:46 poll([{fd=97, events=POLLIN|POLLERR, revents=POLLIN}], 1, 1000) = 1
00:09:46 recvfrom(97, "hello\0", 65535, 0, {sa_family=AF_INET6, sin6_port=htons(51369), inet_pton(AF_INET6, "::ffff:172.16.10.15", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 6
00:09:46 write(2, "Exception in thread \"DefaultUDPT"..., 61) = 61
00:09:46 write(2, "java.lang.RuntimeException: java"..., 100) = 100
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.MessageDispatcher"..., 67) = 67
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.MessageDispatcher"..., 67) = 67
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.transport.Abstrac"..., 84) = 84
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.transport.Default"..., 84) = 84
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "Caused by: java.io.IOException: "..., 83) = 83
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.asn1.BER.decodeIn"..., 53) = 53
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.smi.Integer32.dec"..., 54) = 54
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\t... 4 more", 11) = 11
00:09:46 write(2, "\n", 1) = 1
00:09:46 mmap2(0x9807f000, 12288, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x9807f000
00:09:46 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
00:09:46 sched_getaffinity(2532, 32, { 3 }) = 4
00:09:46 sched_getaffinity(2532, 32, { 3 }) = 4
00:09:46 _exit(0) = ?

Full message taken from output.log:
Exception in thread "DefaultUDPTransportMapping_0.0.0.0/162" java.lang.RuntimeException: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3
at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source)
at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source)
at org.snmp4j.transport.AbstractTransportMapping.fireProcessMessage(Unknown Source)
at org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(Unknown Source)
Caused by: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3
at org.snmp4j.asn1.BER.decodeInteger(Unknown Source)
at org.snmp4j.smi.Integer32.decodeBER(Unknown Source)
... 4 more

After that the system is saying trapd is still running:
flap:/var/tmp/trace# /usr/share/opennms/bin/opennms -v status
OpenNMS.Eventd : running
OpenNMS.Trapd : running
(...)
opennms is running

but it is not processing SNMP traps any more and a receive queue is getting longer and longer:
Proto Recv-Q Send-Q Local Address Foreign Address State
udp6 97440 0 :::162 :::*

and I can't see any more trap based events in the system until it's fully restarted.

That particular host which sent that message (172.16.10.15) is a Huawei router however it has been added to the system just a few days ago, while I've been observing the issue before that as well so other nodes from different vendors send such messages as well.

Can it be fixed somehow? I believe OpenNMS should not rely on a correct format of traps it's receiving as it can be an easy way to crash the system.

Best regards,
Pawel

Environment

Operating System: Linux Platform: Other

Acceptance / Success Criteria

None

Lucidchart Diagrams

Activity

Show:

Benjamin Reed June 15, 2008 at 10:40 PM

this was merged along with the other stuff in

Jeff Gehlbach May 23, 2008 at 8:21 PM

Fixed 1.6-testing in r9198. Upstream changes in SNMP4J 1.9.1f fix this problem.

Jeff Gehlbach May 19, 2008 at 3:35 PM

This problem exists upstream in the org.snmp4j.transport.DefaultUdpTransportMapping class, specifically in the run() method of the ListenThread inner class. If an IOException is caught in this method, the listener is stopped. Trapd gets no notification when this happens.

I think the only clean way to solve this problem is to do so upstream in SNMP4J. I've mailed the SNMP4J list about this problem but have not yet received a reply.

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

PagerDuty

Created May 9, 2008 at 5:22 PM
Updated February 3, 2011 at 2:44 PM
Resolved June 15, 2008 at 10:40 PM