Re: Thrashing on very large lines

Replies are listed 'Best First'.
Re^2: Thrashing on very large lines by Anonymous Monk on Apr 20, 2006 at 18:09 UTC
That worked really well with one caveat. It doesn't work in a BEGIN; I get the error `Undefined subroutine &main::x called at -e line 1. BEGIN failed--compilation aborted at -e line 1.` [download] But if I save it to a script it works nicely `$_ = '-' x (10241024300); while( <> ) { print and next if length($_)==1001; print STDERR $_; }` [download] `perl maxwellsfilter.pl suspect.txt > goodrecs.txt 2> badrecs.txt` [download] The odd thing (to me) is that it converts the line endings from Unix to Windows.. I didn't think it would that without using "\n" in the print. Also, I now realize that with $\|++ it did write complete records, but because it converted the line endings the record length changed from 1001 to 1002.	[reply] [d/l] [select]
Re^3: Thrashing on very large lines by salva (Canon) on Apr 20, 2006 at 18:51 UTC
That worked really well with one caveat. It doesn't work in a BEGIN... oops, this is a shell quoting problem, try using double quotes around the `-` instead of single ones: `perl -ne 'BEGIN { $_ = "-" x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt` [download]	[reply] [d/l] [select]
Re^4: Thrashing on very large lines by chr1so (Acolyte) on Apr 24, 2006 at 23:34 UTC
I should have noticed the quotes myself first time. :( Since the command-line solution has the side-effect of translating the line-endings, I went with the scripted version from another reply. But I eventually returned to this command-line version to see how it works. Good news: your $_ pre-allocation trick works, mostly. Bad news: I had to guess at the right amount.. too low, and it still crawls at some point when it reads the huge record.. too high, and it crawls at the beginning trying to pre-allocate $_. It worked without crawling only for values between 27010241024 to 30010241024. Which is pretty limiting for a general solution. The binmode-script (in this thread) is the best general solution. Still, if line-ending translation is ok, and I'm NOT running into single records/lines that are hundred of MB in size, this is a reasonably fast and easy way to sort the records. On the 900MB source file, the command-line version took 2m15s compared to the binmode-script at 1m45s. Thanks very much for your help and insight.	[reply]