Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Good day Monks,

I'm having some difficulty searching binary data with regular expressions. The problem appears to be due to a numerical 10 within my data that is being interpreted as a line feed.

Is there any way to indicate to the regex to ignore line feeds (similar to using $\=undef for files) or any other suggestions on how to get the desired results?


The following code demonstrates the problem. Whenever I am looking for "any characters" using .{x} and the span includes a 10, it will fail to find it.

my $data = "\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"; my @find = ( # These Work "\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11", "\x04\x05.{4}\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11", "\x04\x05\x06\x07\x08.{1}\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11", "\x04\x05\x06\x07\x08\x09\x0a.{1}\x0c\x0d\x0e\x0f\x10\x11", #These don't work "\x04\x05.{5}\x0b\x0c\x0d\x0e\x0f\x10\x11", "\x04\x05\x06\x07\x08\x09.{1}\x0b\x0c\x0d\x0e\x0f\x10\x11" ); foreach (@find) { if ($data =~ /$_/) { print "found\n" } else { print "not found\n" } +; };
Running the code will output:
found
found
found
found
not found
not found


Also, does anyone know a good online regexp reference? I had an excellent reference at http://japhy.perlmonk.org/book/ bookmarked, but it looks like it has been removed.

Thanks

Replies are listed 'Best First'.
Re: Using regexp with binary data
by fizbin (Chaplain) on Aug 28, 2005 at 15:34 UTC
    People attacking binary data with regular expressions generally have two problems: by default, /./ won't match newlines, and /$/ will, on a string which ends with character 10, match just before the final character.

    The solution to the first of these problems is to use the s modifier on your regexp, as has already been mentioned. Note that you can encode this modifier into the regexp when you define it:

    my $data = "\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"; my @find = ( # These Work qr"\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"s, qr"\x04\x05.{4}\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"s, qr"\x04\x05\x06\x07\x08.{1}\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"s, qr"\x04\x05\x06\x07\x08\x09\x0a.{1}\x0c\x0d\x0e\x0f\x10\x11"s, #These didn't work before, but do now. qr"\x04\x05.{5}\x0b\x0c\x0d\x0e\x0f\x10\x11"s, qr"\x04\x05\x06\x07\x08\x09.{1}\x0b\x0c\x0d\x0e\x0f\x10\x11"s, ); foreach (@find) { if ($data =~ /$_/) { print "found\n" } else { print "not found\n" } +; };
    The solution to the second problem is to use \Z instead of $ when you absolutely need the very end of the string, even if it ends in a newline.

    As for online regexp references, doing perldoc perlre as suggested before is good (if you are working on windows and have ActiveState's perl installed, then search C:\Perl\html\index.html for "perlre") - if you prefer to look at some online page, googling for perlre will get you several online copies of the same page.

    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Thanks for your help everyone. It is greatly appriciated.


      Its things like qr"\x04\x05.{4}\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11"s that make me really love perl. I mean that is just poetry in programming right there.
Re: Using regexp with binary data
by davidrw (Prior) on Aug 28, 2005 at 15:13 UTC
    use the /s modifier so that "dot" matches any character (see perldoc -f perlre), including newlines, and all 6 will match:
    $data =~ /$_/s
Re: Using regexp with binary data
by Joost (Canon) on Aug 28, 2005 at 15:26 UTC
    The problem appears to be due to a numerical 10 within my data that is being interpreted as a line feed.

    Take a look at the ASCII character table; a byte/character with a ordinal value of 10 is a linefeed (in most popular character encodings, at least). As mentioned above, using the /s modifier solves this problem.

    If you're using this as a general technique, you should also be very careful that your @find array doesn't contain any other special characters like "(", or "[". Just because you write characters as escape sequences doesn't give them any special status: they're just characters.

      If you're using this as a general technique, you should also be very careful that your @find array doesn't contain any other special characters like "(", or "[". Just because you write characters as escape sequences doesn't give them any special status: they're just characters.
      Ah, but if he takes my suggestion and writes each expression as qr"\x...blah blah..."s, then those escape sequences do in fact make the characters unspecial, and he can use this technique with impunity. For example, to his original code add the expressions qr"\x06\x2E"s and "\x06\x2E" - the one inside qr will not match, since in it the \x2E is seen as a literal, whereas without qr you'll get the problem you're talking about. (For those of you without the ASCII table memorized, \x2E is .)

      For full details, see perlop, especially the section "Gory details of parsing quoted constructs".

      --
      @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Using regexp with binary data
by htj (Initiate) on Aug 29, 2005 at 06:57 UTC
    Use $data =~ /$_/s instead of $data =~ /$_/ and then it's OK!