in reply to Re^2: Slow find/replace hex in Perl win32
in thread Slow find/replace hex in Perl win32

the private memory footprint just keeps increasing.

On the example I gave, the process memory didn't get above the 3.2MB start-up footprint.

It sounds like the file has no (windows recognisable) newlines, so -pe is trying to load the entire file into memory as a single line?

If so, you may have to resort to processing the file in blocks. Try using:

perl -e"BEGIN{ $/ = \65536 }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0 +/sgx"

And see what if any difference that makes?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy

Replies are listed 'Best First'.
Re^4: Slow find/replace hex in Perl win32
by rickyboone (Novice) on Sep 30, 2010 at 00:13 UTC

    Sorry about the delay... meetings.

    The file processed quickly, but didn't seem to get through the whole file. I'm assuming the change only let the process work through the first 65KB of the file?

    I'm trying to have the file processed as a constant, binary stream. I don't need Perl or Windows to perform any EOL conversions, or working on a line-by-line basis, for example. The intent is for the script to just find the hex string, replace it with another, leaving the rest of the file intact.

      I'm assuming the change only let the process work through the first 65KB of the file?

      No. It did process the whole file, but in 64k chunks.

      The reason it ran more quickly is because if the file doesn't contain newlines, -p will load the file as one huge single line.

      As pointed out above, the problem with the processing files in chunks, is that if the search term straddles a 64k chunk--say 2 bytes at the end of one chunk, and two bytes at the beginning of the next, then the search term won't match and the substitution won't be made.

      The really simple solution to that, it to process the file twice, with different buffer sizes chosen to be relatively prime. You might use 1MB for the first pass and 1MB -3 for the second. This will ensure than any overlaps missed by the first pass will not fall on a boundary on the second pass. Up to 1024GB anyway.

      So,

      perl -e"BEGIN{$/=\(1024**2) }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" infile >outfile1 perl -e"BEGIN{$/=\(1024**2-3)}" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" outfile1 >outfile2

      Two passes is obviously slower than one, but much faster than loading the whole damn file into ram on a constrained machine.

      This last point is what I assume to be the cause of the performance differential between your Linux and Windows set-ups. If the former has sufficient free ram to allow the whole file to be loaded in one pass, and the latter does not and moves into swapping, the difference is explained.

      Another alternative would be to use a sliding buffer, but that too complicated for a one-liner, and often doesn't yield sufficient performance to beat the two-pass approach.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^4: Slow find/replace hex in Perl win32
by jwkrahn (Abbot) on Sep 29, 2010 at 21:59 UTC
    perl -e"BEGIN{ $/ = \65536 }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0 +/sgx"

    And what if the four byte pattern crosses the 65536 byte boundary?    Oops.     :-)

      It was only a test to see if the file lacked appropriate line terminators. Not a solution.

Re^4: Slow find/replace hex in Perl win32
by Anonymous Monk on Sep 29, 2010 at 21:48 UTC
    Maybe the file contains \n but not \r\n, so the auto-splitting works on unix but not windows?

      As I understand it, unless he explicitly removed the :crlf layer from *ARGV, then the IO layer will deal with the linefeeds and readline will never see them. Ie. "\n" has the same effective meaning on both platforms.

      But, if the file contains EBCDIC, then it might not contain anything that looks like a "normal" line ending at all. It's been too long since I worked with EBCDIC and I cannot remember what the equivalent byte code was. Nor even if it had one.

      It still doesn't explain why the code ran so fast on Linux. But, there are known cases with the default Windows memory allocator that can, under some thankfully rare, circumstances, display pathological behaviour.