Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Searching binary data

by Shendal (Hermit)
on Mar 14, 2002 at 17:03 UTC ( [id://151736]=perlquestion: print w/replies, xml ) Need Help??

Shendal has asked for the wisdom of the Perl Monks concerning the following question:

Given a binary file, I want to search for a section of content, and return the next few bytes in the file. Currently, I am using something like this code:
my $file = '/path/to/binaryfile'; # string to match, in hex my $matchstr = '53005f00560045005200530049004f004e005f0049004e00460 +04f0000000000bd04effe00000100'; # read in file and convert to a textual (hex) representation open(FILE,"$file") or die "Unable to open file: $!\n"; binmode(FILE); my $binhex = ''; while (eof(FILE) != 1) { my $buf; my $num_byte_read = read(FILE,$buf,16); foreach ($buf =~ m/./gs) { $binhex .= sprintf("%02x",ord($_)); } } close(FILE); # search it if ($binhex =~ /$matchstr(\S\S\S\S)(\S\S\S\S)(\S\S\S\S)(\S\S\S\S)/) + { print "Found: " . join(',',($1,$2,$3,$4)); }
This is unbearably slow for large files. I'd like to search without having to convert to/from hex. Any ideas on how to accomplish this, or another way to speed up the parsing?

BTW, the searching section of the code works very well. It's the conversion to hex that takes all the time.

Cheers,
Shendal

Edit by tye

Replies are listed 'Best First'.
Re: Searching binary data
by dws (Chancellor) on Mar 14, 2002 at 18:18 UTC
    This is unbearably slow for large files. I'd like to search without having to convert to/from hex. ... It's the conversion to hex that takes all the time.

    Two thoughts:

    1. You're doing entirely too many read()'s. Reading 16 bytes at a time is underkill, since a disk block is going to be at least 512 bytes. Try reading 512 bytes at a time.
    2. Instead of using sprintf to make hex strings, roll your own table lookup. It'll be a lot faster.
    while ( read(FILE, $buf, 512) ) { foreach ( $buf =~ m/./gs ) { $binhex .= $hex[ord($_)]; } }
    Setting up @hex is left as an exercise.

    Oh, and if you're on Win32, don't forget to   binmode(FILE);

      Wouldn't it be faster to use unpack?
      while ( read(FILE, $buf, 512) ) { $binhex = unpack "H*",$buf; # do something with $binhex }
Re: Searching binary data
by talexb (Chancellor) on Mar 14, 2002 at 18:25 UTC
    Back when I was programming in C, I came across this kind of problem all the time. Here's the approach that I used:
    • Grab an 8K chunk of the file
    • Search for the thing you're looking for in the first half of the chunk less the length of the search string.
    • Once you've looked at most of the first chunk, scroll everything down by 4K, read in another 4K chunk where the empty space was created, and repeat from the previous step.
    Instead of using a regex, I'd probably use index(), but if that's too slow I would just write a C program to do the job.

    --t. alex

    "Here's the chocolates, and here's the flowers. Now how 'bout it, widder hen, will ya marry me?" --Foghorn Leghorn

Re: Searching binary data
by Juerd (Abbot) on Mar 14, 2002 at 17:20 UTC

    my $match = pack 'H*', '53005f00560045005200530049004f004e005f004' . '9004e0046004f0000000000bd04effe00000100'; open (FILE, 'foo') or die $!; while (<FILE>) { print "Found: $1, $2, $3, $4\n" while s/$match(\S{4})(\S{4})(\S{4})(\S{4})/; } close FILE;
    (Warning: untested code.)

    U28geW91IGNhbiBhbGwgcm90MTMgY
    W5kIHBhY2soKS4gQnV0IGRvIHlvdS
    ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
    geW91IHNlZSBpdD8gIC0tIEp1ZXJk
    

      I think you want ".", rather then "\S" in your regex. Also, I think your matching code will miss cases where there is more then one match per line.

      However, the basic idea of packing is exactly what you want. You also probably want to use ord and printf %x to print your results in hex. (or unpack).


      We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book

        I think you want ".", rather then "\S" in your regex

        I copied that from the original post. I have no idea what sort of file is being used.

        Also, I think your matching code will miss cases where there is more then one match per line.

        The inner while (the statement modifier) takes care of that.

        U28geW91IGNhbiBhbGwgcm90MTMgY
        W5kIHBhY2soKS4gQnV0IGRvIHlvdS
        ByZWNvZ25pc2UgQmFzZTY0IHdoZW4
        geW91IHNlZSBpdD8gIC0tIEp1ZXJk
        

      After some coersion, your response led me down the proper path. Here's my updated code snippet:
      my $match = pack 'H*', '53005f00560045005200530049004f004e005f004' . '9004e0046004f0000000000bd04effe00000100'; open(FILE,$file) or die $!; while (<FILE>) { if (/$match(\S{2})(\S{2})(\S{2})(\S{2})/) { print "Found match!"; print join('.',map {unpack 'H*',$_; } ($1,$2,$3,$4); } } close(FILE);
      I think my confusion rested in how perl would handle a binary file. Seems that the DWIM held true... I was making it harder than it was.

      Thanks for your help!

      Cheers,
      Shendal
UNIX Strings Command? - Re: Searching binary data
by metadoktor (Hermit) on Mar 14, 2002 at 17:23 UTC
    What sort of content are you searching for??? If you're searching for normal text then you can sometimes do something like this under UNIX:

    cat binaryfile | strings | more

    Something similar will work under Windows assuming that you either use Cygwin or craft your own program that reads binary files and extracts text.

    metadoktor

    "The doktor is in."

      Strings only works if I am looking for strings in a binary file. What I'm looking for is the value in hex of a specific location in the file. That is, after a series of '0000001000001000', I want to know that the next four words are '1111000001011010'. Unfortunately, the suggestion given by Juerd above doesn't work.
      Cheers,
      Shendal

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://151736]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-03-29 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found