in reply to Thrashing on very large lines

If you have enough memory on your computer (> 250MB), try pre-allocating enough room in $_ to stop perl from reallocating the buffer and copying its content back and forward as it reads from the file:
perl -ne 'BEGIN { $_ = '-' x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt
update: see my other comment below in this thread

Replies are listed 'Best First'.
Re^2: Thrashing on very large lines
by Anonymous Monk on Apr 20, 2006 at 18:09 UTC
    That worked really well with one caveat. It doesn't work in a BEGIN; I get the error
    Undefined subroutine &main::x called at -e line 1. BEGIN failed--compilation aborted at -e line 1.
    But if I save it to a script it works nicely
    $_ = '-' x (1024*1024*300); while( <> ) { print and next if length($_)==1001; print STDERR $_; }
    perl maxwellsfilter.pl suspect.txt > goodrecs.txt 2> badrecs.txt
    The odd thing (to me) is that it converts the line endings from Unix to Windows.. I didn't think it would that without using "\n" in the print.

    Also, I now realize that with $|++ it did write complete records, but because it converted the line endings the record length changed from 1001 to 1002.

      That worked really well with one caveat. It doesn't work in a BEGIN...

      oops, this is a shell quoting problem, try using double quotes around the - instead of single ones:

      perl -ne 'BEGIN { $_ = "-" x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt
        I should have noticed the quotes myself first time. :(

        Since the command-line solution has the side-effect of translating the line-endings, I went with the scripted version from another reply. But I eventually returned to this command-line version to see how it works.

        Good news: your $_ pre-allocation trick works, mostly.

        Bad news: I had to guess at the right amount.. too low, and it still crawls at some point when it reads the huge record.. too high, and it crawls at the beginning trying to pre-allocate $_. It worked without crawling only for values between 270*1024*1024 to 300*1024*1024. Which is pretty limiting for a general solution. The binmode-script (in this thread) is the best general solution.

        Still, if line-ending translation is ok, and I'm NOT running into single records/lines that are hundred of MB in size, this is a reasonably fast and easy way to sort the records. On the 900MB source file, the command-line version took 2m15s compared to the binmode-script at 1m45s.

        Thanks very much for your help and insight.