rickyboone has asked for the wisdom of the Perl Monks concerning the following question:

My apologies if this comes across as a newbie question. I'm not a Perl developer, but am trying to use it within an automation process, and I've hit a snag.

The following command runs quickly (a few seconds) on my Linux system (Ubuntu 9.10 x64, Perl 5.10), but is extremely slow on a Windows system (Windows 2003 x86, Strawberry Perl 5.12.1.0).

perl -pe 's/\x00\x42\x00\x11/\x00\x42\x00\xf0/sgx' inputfile > outputfile

The pattern find/replace is intended to fix EBCDIC carriage control characters in a file that is between 500MB to 2GB in size. I'm not sure if this is even the most efficient way to do this, but it would seem to do the trick... if only it would run quickly on the Windows system it needs to run on.

Any thoughts?

Replies are listed 'Best First'.
Re: Slow find/replace hex in Perl win32
by BrowserUk (Patriarch) on Sep 29, 2010 at 19:22 UTC

    On a 2GB/5.7 million line file, it consistently runs in 30-35 seconds on my AS1007 perl:

    [20:14:52.61] C:\test>dir 834245.masks Volume in drive C has no label. Volume Serial Number is 8C78-4B42 Directory of C:\test 18/04/2010 01:02 2,412,431,484 834245.masks 1 File(s) 2,412,431,484 bytes 0 Dir(s) 296,257,802,240 bytes free [20:16:19.67] C:\test>perl -ne "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg +" 834245.masks >junk.dat [20:16:55.62] C:\test>

    Which is only a few seconds longer that wc -l takes to just count the lines.

    How long does your SP 5.12 take?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      You're using perl -ne without a print in the executed chunk. Does junk.dat have any size?

      Not that that is likely to affect reading the source file ...

      As Occam said: Entia non sunt multiplicanda praeter necessitatem.

        You're right. That's a typo. It takes five minutes when writing the data back to the disk.

        [22:01:55.40] C:\test>perl -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg +" 834245.masks >junk.dat [22:06:46.76] C:\test>perl -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/sg +" 834245.masks >junk.dat [22:09:31.99] C:\test>

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      I haven't let it finish, but I have let it run for over 30 minutes before having to kill it. Watching performance stats for the perl.exe process, the CPU hangs around 50%, there is very little I/O (60KB/s or less), and the private memory footprint just keeps increasing.
        the private memory footprint just keeps increasing.

        On the example I gave, the process memory didn't get above the 3.2MB start-up footprint.

        It sounds like the file has no (windows recognisable) newlines, so -pe is trying to load the entire file into memory as a single line?

        If so, you may have to resort to processing the file in blocks. Try using:

        perl -e"BEGIN{ $/ = \65536 }" -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0 +/sgx"

        And see what if any difference that makes?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Try disabling your anti-virus's real-time protection temporarily.

        You could also be running out of memory if you have long sequences without any 0x0A.

Re: Slow find/replace hex in Perl win32
by TomDLux (Vicar) on Sep 29, 2010 at 20:54 UTC

    To figure out what is happening, I would start by adding some print statements, to get a handle on what is happening. Start with something just before and just after opening the file, just after reading a line, ...

    That should help narrow down where your processing is hanging up.

    Once you've got it running through the loop, and it seems to be working, comment out the prints and time the program processing a 1 line file to completion, then 10, 100, 1000, 10000, 100000 line data files. What's the trend? What's the expected processing time for 42 million lines?

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Just to clarify, there are no line-endings in this file (at least not in ASCII).

      I do think I found the problem, though. I didn't realize that perl was trying to find the end of line. Searching for that, I found "slurp mode", -0777 (undefined record separator). And using a few other recommendations, I also reduced the s///sgx options to just s///g, since my example didn't seem to need s and x. It seems to allow the file to be processed in a matter of seconds, and compares properly to other files processed "manually" with hex editors.

      perl -0777 -pe "s/\x00\x42\x00\x11/\x00\x42\x00\xf0/g" input > output

      I'm waiting on the availability of another file to test another hex string against, but it won't be available until Oct 1. I think the issue is resolved, but I'd like to wait until then to be sure, unless anyone else has any recommendations or considerations I should be aware of.

        Okay, well I think the code is doing what I want it to do, however I've run into a new problem... "Out of memory!" errors. The file is greater than 2GB, which is more than the available memory space for applications in 32-bit Windows. I'm going to try booting the server with /3GB or /PAE to workaround the issue.