traps are not processed after a malformed trap is received
Description
I'm using OpenNMS 1.5.90 installed on Debian from packages downloaded from debian.opennms.org
Recently I've noticed strange behavior of the system. After some time the system is stopping with processing SNMP traps. I've run some debugging and found that the thread polling UDP socket is crashing after receiving of a malformed trap.
I've run strace on that thread. Below is the output:
Full message taken from output.log: Exception in thread "DefaultUDPTransportMapping_0.0.0.0/162" java.lang.RuntimeException: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3 at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source) at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source) at org.snmp4j.transport.AbstractTransportMapping.fireProcessMessage(Unknown Source) at org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(Unknown Source) Caused by: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3 at org.snmp4j.asn1.BER.decodeInteger(Unknown Source) at org.snmp4j.smi.Integer32.decodeBER(Unknown Source) ... 4 more
After that the system is saying trapd is still running: flap:/var/tmp/trace# /usr/share/opennms/bin/opennms -v status OpenNMS.Eventd : running OpenNMS.Trapd : running (...) opennms is running
but it is not processing SNMP traps any more and a receive queue is getting longer and longer: Proto Recv-Q Send-Q Local Address Foreign Address State udp6 97440 0 :::162 :::*
and I can't see any more trap based events in the system until it's fully restarted.
That particular host which sent that message (172.16.10.15) is a Huawei router however it has been added to the system just a few days ago, while I've been observing the issue before that as well so other nodes from different vendors send such messages as well.
Can it be fixed somehow? I believe OpenNMS should not rely on a correct format of traps it's receiving as it can be an easy way to crash the system.
Best regards, Pawel
Environment
Operating System: Linux
Platform: Other
Acceptance / Success Criteria
None
Lucidchart Diagrams
Activity
Show:
Benjamin Reed June 15, 2008 at 10:40 PM
this was merged along with the other stuff in
Jeff Gehlbach May 23, 2008 at 8:21 PM
Fixed 1.6-testing in r9198. Upstream changes in SNMP4J 1.9.1f fix this problem.
Jeff Gehlbach May 19, 2008 at 3:35 PM
This problem exists upstream in the org.snmp4j.transport.DefaultUdpTransportMapping class, specifically in the run() method of the ListenThread inner class. If an IOException is caught in this method, the listener is stopped. Trapd gets no notification when this happens.
I think the only clean way to solve this problem is to do so upstream in SNMP4J. I've mailed the SNMP4J list about this problem but have not yet received a reply.
I'm using OpenNMS 1.5.90 installed on Debian from packages downloaded from debian.opennms.org
Recently I've noticed strange behavior of the system. After some time the system is stopping with
processing SNMP traps. I've run some debugging and found that the thread polling UDP socket is crashing after receiving of a malformed trap.
I've run strace on that thread. Below is the output:
00:09:44 poll([{fd=97, events=POLLIN|POLLERR}], 1, 1000) = 0
00:09:45 gettimeofday({1210284585, 193114}, NULL) = 0
00:09:45 poll([{fd=97, events=POLLIN|POLLERR}], 1, 1000) = 0
00:09:46 gettimeofday({1210284586, 193136}, NULL) = 0
00:09:46 poll([{fd=97, events=POLLIN|POLLERR, revents=POLLIN}], 1, 1000) = 1
00:09:46 recvfrom(97, "hello\0", 65535, 0, {sa_family=AF_INET6, sin6_port=htons(51369), inet_pton(AF_INET6, "::ffff:172.16.10.15", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 6
00:09:46 write(2, "Exception in thread \"DefaultUDPT"..., 61) = 61
00:09:46 write(2, "java.lang.RuntimeException: java"..., 100) = 100
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.MessageDispatcher"..., 67) = 67
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.MessageDispatcher"..., 67) = 67
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.transport.Abstrac"..., 84) = 84
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.transport.Default"..., 84) = 84
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "Caused by: java.io.IOException: "..., 83) = 83
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.asn1.BER.decodeIn"..., 53) = 53
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\tat org.snmp4j.smi.Integer32.dec"..., 54) = 54
00:09:46 write(2, "\n", 1) = 1
00:09:46 write(2, "\t... 4 more", 11) = 11
00:09:46 write(2, "\n", 1) = 1
00:09:46 mmap2(0x9807f000, 12288, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x9807f000
00:09:46 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
00:09:46 sched_getaffinity(2532, 32, { 3 }) = 4
00:09:46 sched_getaffinity(2532, 32, { 3 }) = 4
00:09:46 _exit(0) = ?
Full message taken from output.log:
Exception in thread "DefaultUDPTransportMapping_0.0.0.0/162" java.lang.RuntimeException: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3
at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source)
at org.snmp4j.MessageDispatcherImpl.processMessage(Unknown Source)
at org.snmp4j.transport.AbstractTransportMapping.fireProcessMessage(Unknown Source)
at org.snmp4j.transport.DefaultUdpTransportMapping$ListenThread.run(Unknown Source)
Caused by: java.io.IOException: Wrong ASN.1 type. Not an integer: 108 at position 3
at org.snmp4j.asn1.BER.decodeInteger(Unknown Source)
at org.snmp4j.smi.Integer32.decodeBER(Unknown Source)
... 4 more
After that the system is saying trapd is still running:
flap:/var/tmp/trace# /usr/share/opennms/bin/opennms -v status
OpenNMS.Eventd : running
OpenNMS.Trapd : running
(...)
opennms is running
but it is not processing SNMP traps any more and a receive queue is getting longer and longer:
Proto Recv-Q Send-Q Local Address Foreign Address State
udp6 97440 0 :::162 :::*
and I can't see any more trap based events in the system until it's fully restarted.
That particular host which sent that message (172.16.10.15) is a Huawei router however it has been added to the system just a few days ago, while I've been observing the issue before that as well so other nodes from different vendors send such messages as well.
Can it be fixed somehow? I believe OpenNMS should not rely on a correct format of traps it's receiving as it can be an easy way to crash the system.
Best regards,
Pawel