in reply to Re: Slow find/replace hex in Perl win32
in thread Slow find/replace hex in Perl win32

Just to clarify, there are no line-endings in this file (at least not in ASCII).

I do think I found the problem, though. I didn't realize that perl was trying to find the end of line. Searching for that, I found "slurp mode", -0777 (undefined record separator). And using a few other recommendations, I also reduced the s///sgx options to just s///g, since my example didn't seem to need s and x. It seems to allow the file to be processed in a matter of seconds, and compares properly to other files processed "manually" with hex editors.

perl -0777 -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/g" input > output

I'm waiting on the availability of another file to test another hex string against, but it won't be available until Oct 1. I think the issue is resolved, but I'd like to wait until then to be sure, unless anyone else has any recommendations or considerations I should be aware of.

Replies are listed 'Best First'.
Re^3: Slow find/replace hex in Perl win32
by rickyboone (Novice) on Oct 07, 2010 at 14:41 UTC
    Okay, well I think the code is doing what I want it to do, however I've run into a new problem... "Out of memory!" errors. The file is greater than 2GB, which is more than the available memory space for applications in 32-bit Windows. I'm going to try booting the server with /3GB or /PAE to workaround the issue.
      I'm going to try booting the server with /3GB or /PAE to workaround the issue.

      If that works, it'll will only be a matter of time before the file grows bigger than memory again.

      Did you try the two-pass solution. A tad slower, but it'll never run out of memory. It can handle files upto 1024GB as posted using a 1MB buffer.

      And if 1 Terabyte becomes limiting, increasing the buffer size to 2MB means it can handle 4 TB. A 4MB buffer takes you to 16TB; and so on.

      You can even avoid the need to make two (disk) passes. Simply pipe the output of the first pass to the input of the second:

      perl -e"BEGIN{$/=\(1024**2) }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\x +f0/sg" infile | perl -e"BEGIN{$/=\(1024**2-3)}" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg" >outfile2

      It still makes two passes of the data, but only reads and writes the disk once for each block.

      To demonstrate that it works. Given the input file fred:

      c:\test>type fred 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890

      Using one pass, with a search term that straddles the buffer boundaries, no changes are made:

      c:\test>perl -e"BEGIN{$/=\10}" -pe" s[8901][abcd]" fred > joe c:\test>type joe 1234567890123456789012345678901234567890123456789012345678901234567890 +123456789012345678901234567890123456789012345678901234567890

      But after two piped passes:

      c:\test>perl -e"BEGIN{$/=\10}" -pe" s[8901][abcd]g" fred | perl -e"BEG +IN{$/=\7}" -pe"s[8901][abcd]g" >joe

      The changes are made:

      c:\test>type joe 1234567abcd234567abcd2345678901234567abcd2345678901234567abcd234567890 +1234567abcd234567abcd2345678901234567abcd2345678901234567890

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.