dantheman1210 has asked for the wisdom of the Perl Monks concerning the following question:

OK long story short I have a UDP listener using IO::Socket::INET that I am trying to grab the first 4 bytes from the data with a pattern match that is failing on roughly 1 out of 100 packets. First here is my simplified code:

#!/usr/bin/perl -w $|++; use strict; use IO::Socket; use IO::Select; my $response = IO::Socket::INET->new(Proto=>"udp",LocalPort=>5000) or die "Can't make UDP server: $@"; my $sel = new IO::Select( $response ); my ($datagram,$flags); binmode($response); while(my @ready = $sel->can_read) #Make sure socket has something to g +ive us { $datagram = ''; $flags = ''; $response->recv($datagram,1500,$flags); my $length = length($datagram); print "Length : $length\n"; if($datagram =~ /^(\C\C\C\C)(.*)$/s) { #Do some stuff } else { #Didn't match for some reason print "Did not match!\n"; } }

Here is the root of my question: I print out the length of the packet just prior to attempting the pattern match, which always has a length of "1205", but when I attempt the match with $datagram =~ /^(\C\C\C\C)(.*)$/s it sometimes fails. How can this match fail (which I believe is just looking for the first 4 bytes in the datagram) when the data is 1205 bytes long?

Oh, and here is a snipet of the output:

Length : 1205 Length : 1205 Length : 1205 Length : 1205 Length : 1205 Did not match! Length : 1205 Length : 1205

Any guidance would be greatly appreciated! :)

Replies are listed 'Best First'.
Re: Pattern match not working sometimes
by GrandFather (Saint) on Mar 18, 2012 at 20:14 UTC

    On the face of it everything looks fine. Have you tried printing the failing strings to see what the nature of the beast is that doesn't match? You'll probably want to translate unprintable and white space octets as hex to make them visible.

    We could help more if you gave us a sample failing string - it need only be 10 or so characters long.

    True laziness is hard work

      OK, so I added:

      my $stuff = unpack('B64', $datagram); print "This is data: $stuff\n";

      within the final else. Here is the output

      Length : 1205 Length : 1205 Did not match! This is data: 00001010010111000101011001001001010011110011111100100100 +01110110 Length : 1205 Length : 1205

      The only thing that I noticed is that all the data that doesn't match seem to start with four 0's, but I would think that it would still match. Anyway let me know if you see anything.

        The issue is the ^ anchor and the character causing grief is a new line at the start of the line! If you change your match to /(\C\C\C\C)(.*)/s the problem is fixed. Note that there is no need for anchors in any case - you want the first four octets followed by anything so that will always match at the start of the string (unless there are fewer than 4 octets).

        True laziness is hard work

        It looks like your first byte is ASCII 10, which is a line feed. Is that breaking the regex because it's hitting an EOL? What if you change the "s" modifier to an "m" modifier so it will match over EOLs within the string?

        Does it also fail when you get that pattern "00001010" anywhere else in the first four bytes?

        If that's the problem it should fail if you get

        0000 1011 (vertical tab) 0000 1100 (form feed) 0000 1101 (carriage return)

        If you're interpreting as Unicode you might also see failures if you get a few other combinations that don't start with 0000 that get interpreted as premature EOL and cause it to fail.

        (Edit: looks like GrandFather posted a better solution than changing the match mode. Dropping the anchors is simpler.)

Re: Pattern match not working sometimes
by nemesdani (Friar) on Mar 18, 2012 at 20:26 UTC
    A minor thing: why do you look for (.*)$ ? If you don't want to do anything with the junk after the 4 characters, use only /^(\C\C\C\C)/

      At a guess, the stuff which is done in the place of the #Do some stuff comment uses $2.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
        Very true, I am using both within that block.
Re: Pattern match not working sometimes
by bulk88 (Priest) on Mar 18, 2012 at 21:15 UTC
    I think substr would be better for you case, since your just wanted to read 4 bytes. Also a less ideal alternative is unpack. shouldnt binmode be a method call not sub call?
      "I think substr would be better for you case"

      Actually, no. The OP stressed that he is dealing with octets, not characters. unpack could be a reasonable option, but probably not as clear as the regex unless the maintenance programmer is familiar with pack/unpack. The unpack code would (using the OP's sample data) look like:

      my $stuff = '000010100101110001010110010010010100111100111111001001000 +1110110'; my $bytes = pack('B64', $stuff); my ($prefix, $tail) = unpack('a4a*', $bytes); print ">$prefix<\n>$tail<\n";
      True laziness is hard work
        The OP is not introducing unicode or mentioning his locale anywhere in his code. The scalars coming from the OP's socket will have byte semantics. Why would any scalars be upgraded to unicode in his code? OP claims his length() return is the number of bytes in $datagram. $datagram isnt utf marked. He didn't say he is using -C.