in reply to Re: is perl the best tool for this job?(emphatically Yes!)
in thread is perl the best tool for this job ?

Thanks for your very insightful answer ! PerlMonks is such a great place because people like you take the time to research into questions and give the great replies as this :-)

A few points:

My machine is Win2k on P4 1800 GHz and 256 MB of memory.
The file loading takes 22 seconds. The first search method (loop with substr takes 2 minutes), the second takes 1 second. What concerns me though, is that neither reported that the sequence "wasn't found" in a completely random binary file (the probability of 0xff x 16 to appear is 2E-128, quite unlikely).
I wonder about the speed differences... what makes my program run slower on a far stronger PC ? Bad Perl implementation (Activeperl 5.6) ?

Also, could you please elaborate on the use of the following line:

$/ = \(100*1024*1024)

I only used $/ as in "undef $/" to read a file wholly, or to set a record separator. What does your line do ?

----------- Update ------------

I think I figured out the performance problem. My PC usually runs at 190 MB memory usage, which means that the 100MB slurped threw it into the "virtual realm", which naturally downgrades performance.
Now I read as follows:

until (eof(FH)) ... read(FH, $val, 128); ...
And then just compare to the wanted value. The whole process (which now combines reading & testing) takes 28 seconds.

By the way, I was wrong about other thing. You first method (with substr) works correctly - it doesn't find the string. The second method (with index) seems to do find a string, or at least it thinks so, which is probably wrong (look at your test output, it's obvious there too)

Replies are listed 'Best First'.
Re: Re: Re: is perl the best tool for this job?(emphatically Yes!)
by BrowserUk (Patriarch) on Oct 20, 2003 at 12:39 UTC

    Sorry, I should have explained that.

    When slurping a complete file, I've found that it pays huge dividends to tell perl how big the file is by setting $/ = \(filesize);. This allows perl to pre-allocate the required memory in a single request to the OS, and then read the file in a single call to the OS.

    If you just set $/ = undef;, perl doesn't know how big a file it is to read, it therefore allocates a pre-determined lump of memory (it appears to be about 16k on my system) and then reads to fill it. It then checks to see if there is more, extends the buffer by 16k and reads the next 16k chunk. With a 100MB file, that requires 6400 allocations/reallocations (with copying) and 6400 reads.

    Needless to say, this is extremely slow with very large files. I'd like to give a comparitive figure here, but I've never had the patience to wait long enough for it to complete on my system. I set it going about 5 minutes before starting typing this reply and it still hasn't finished. A crude measure going by the task manager memory for the process suggests that after 8 minutes it has read around 25%. However, as the amount of memory required at each reallocation grows by 16k each time, so the amount of memory copied each time also grows (slowly). The effect is that the next 25% will take considerably longer than the first, and the next considerably longer again. And the last 25% longer still.

    This is one of those situations where giving perl a helping hand by supplying a little extra information has huge performance bonus.


    With respect to using index. Remember, that it is possible that it will find a pattern that doesn't actually exist in any of your 16-byte chunks. If the last N bytes of one chuck combine with first (16-n) bytes of the next chunk to produce the pattern you are looking for, then index will find that combination and report success.

    It therefore becomes necessary to verify that a 16-byte chunk boundary isn't being straddled. This is as easy as

    my $p = 0; while( ($p % 16) != 0 ) { $p = index( $data, $bit_pattern, $p ); } print "Chunk with bit_pattern: $bit_pattern found at: $p" unless $p = -1;

    That could be done better, but it shows what I mean.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

Re: Re: Re: is perl the best tool for this job?(emphatically Yes!)
by BUU (Prior) on Oct 20, 2003 at 07:32 UTC
    $/=$number; tells perl to read that number of bytes instead of lines. so 100*1024*1024 = 100 megs.
Re: Re: Re: is perl the best tool for this job?(emphatically Yes!)
by edan (Curate) on Oct 20, 2003 at 12:51 UTC

    read(FH, $val, 128);

    Please keep in mind that read FILEHANDLE,SCALAR,LENGTH

    'Attempts to read LENGTH bytes of data ...' (emphasis mine - from perldoc -f read)

    You mentioned that you're working with 128 bit 'frames', so you'll want to do 16 byte reads in order to get a 'frame'.

    --
    3dan