grinder has asked for the wisdom of the Perl Monks concerning the following question:

I tracked down a bug today in the SNMP support library used by MRTG. I have a fix for it, but I'm not sure whether it's an ugly hack, the Right Thing or if there's a different angle for a better way to do it.

I've been using MRTG for a while now, to watch Cisco routers (even if they do have really buggy snmp support), HP switches and sundry servers from a number of manufacturers. I'm currently setting up a VPN using Netscreen routers and ADSL lines, and I wanted to monitor them. As soon as I plugged them in, I started getting really weird non-reproducible errors. Sometimes the router would respond, sometimes not, and no way to understand why.

After futzing around with snmpwalk and some other support tools in the NetSaint kit, I felt pretty sure that the Netscreen routers were working correctly, it was the MRTG code that was wrong.

Once I started looking there, it was pretty quick to zoom into the offending code. In a nutshell, with SNMP when you want to ask a piece of equipment for some information, you have to tell it what community you belong to (read: authentication), what information you want, and a request id, which I surmise is kind of like a cookie.

By instrumenting the code with print statements, it became apparent that the following conditional was at fault. It would be true or false at random:

if ($response_community ne $this->{community} || ($response_id) != $this->{request_id})) { # ... }

... and that determines success or failure in receiving a response. So I printed out the values of these variables and hit the jackpot:

print "$response_community $this->{community} $response_id $this->{req +uest_id}\n"; # produces snmpfoo snmpfoo 2325544852 -1969422444

Looks like 32-bit signedness rearing its ugly head. If the MSB is zero, the test passed, if one, it failed. I had to get a job done, so I just packed the values and string compared them:

if ($response_community ne $this->{community} || pack('l',$response_id) ne pack('l',$this->{request_id})) { # ... }

So my questions are:

  1. I can't believe that this problem hasn't occurred before, somewhere in the world. Google provided a few similar leads, but nothing exactly the same (which is why I'm taking the time to be as explicit as possible... maybe someone else will put in the right keywords and pull this page out of the infinite depths). Wouldn't a problem of signedness have been found sooner?
  2. If this is a problem, is this the best way to fix it? Or is there a simpler way of coercing signed and unsigned integers to get along? I'm not overly fussed with speed: we are talking about transactions that take several dozen milliseconds if you're lucky, more likely several hundred. A cycle here or there doesn't matter.
  3. Shouldn't this problem be corrected upstream in the SNMP library, so that request ids are always of the same sign?

My personal feeling for that last one is a yes, but I'm not well versed in SNMP to know whether that's reasonable or not. I have traced the problem down to a rather horrendous piece of code in the BER.pm module called "decode_by_template", which is passed a sprintf-like string which is decoded manually. Trying to trace through it makes me want to gouge my eyes out: it's a textbook case of creature feep.

I think I'll just mail my hack to the maintainer, and see what he wants to do about it, but any recommendations or ideas will be much appreciated, before I send it off.


print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'

Replies are listed 'Best First'.
(tye)Re: Bug in SNMP_Session when polling Netscreen routers with MRTG
by tye (Sage) on Jun 27, 2002 at 02:53 UTC

    A quick way to convert a number to an unsigned integer is to do a bitwise logical operation on it. So you could do:     (0|$response_id) == (0|$this->{request_id}) I've run into similar problems with signedness before. Old versions of the XS "stuff" didn't include complete support for unsigned arguments and return values. I think that is corrected now, so the library could probably be fixed by just changing a few cases of "IV" to "UV".

            - tye (but my friends call me "Tye")