Thrashing on very large lines

chr1so has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Thrashing on very large lines by GrandFather (Saint) on Apr 20, 2006 at 05:08 UTC
Open the file binary mode. Read it a modest size block at a time (say 8K) and concatenate the new block onto the end of a buffer. Use `index` to find the line ends and pull the lines out of the buffer using `substr`. Top up the buffer when there are no more line ends. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Thrashing on very large lines by ikegami (Patriarch) on Apr 20, 2006 at 06:50 UTC
open(my $fh_in, '<', ...) or die("Unable to open input file: $!\n"); open(my $fh_good, '>', ...) or die("Unable to open 'good' output file: $!\n"); open(my $fh_bad, '>', ...) or die("Unable to open 'bad' output file: $!\n"); binmode($fh_in); binmode($fh_good); binmode($fh_bad); my $buf = ''; my $skip = 0; local $/ = "\r\n"; # Because we're using binmode. my $rec_len = 1000 + length($/); # Buffer size can be up to $blk_size + $rec_len bytes. my $blk_size = 8192; for (;;) { my $read = read($fh_in, $buf, $blk_size, length($ofs)); defined $read or die("Unable to read input file: $!\n") while (length($buf) >= $rec_len) { my $pos = index($buf, $/); if ($pos < 0) { print $fh_bad $buf; $buf = ''; $skip = 1; } else { $pos += $ofs + length($/); my $bad = $skip \|\| $pos != $rec_len; print { $bad ? $fh_bad : $fh_good } (substr($buf, 0, $pos)); substr($buf, 0, $pos, ''); $skip = 0; } } if ($read == 0) { print $fh_bad $buf if length $buf; last; } } [download] Untested. Update: Cleanup.	[reply] [d/l]
Re^2: Thrashing on very large lines by chr1so (Acolyte) on Apr 20, 2006 at 20:12 UTC
Thanks! This is a little quicker, but more importantly it doesn't convert the line endings like the one-liner. There was one bug when $buf was overwritten by the read() command. I've cleaned it up until it sorts the same number of records as the one-liner. scalar @ARGV == 3 or die "usage: this.pl <suspect.txt> <good.txt> <bad.txt>\n"; open(my $fh_in, '<', shift) or die("Unable to open input file: $!\n"); open(my $fh_good, '>', shift) or die("Unable to open 'good' output file: $!\n"); open(my $fh_bad, '>', shift) or die("Unable to open 'bad' output file: $!\n"); binmode($fh_in); binmode($fh_good); binmode($fh_bad); my $buf = ''; # stores the working buffer my $newbuf = ''; # used for reading next chunk my $is_continued = 0; # remembers that the current record is longer th +an it appears local $/ = "\n"; # Because we're using binmode. my $rec_len = 1000 + length($/); # Working buffer size can be up to $blk_size + $rec_len bytes. my $blk_size = 8192; for (;;) { my $read = read($fh_in, $newbuf, $blk_size); defined $read or die("Unable to read input file: $!\n"); $buf .= $newbuf; # if it didn't read anything new, flush the buffer and end if ($read == 0) { print $fh_bad $buf if length $buf; last; } while (length($buf) >= $rec_len) { my $pos = index($buf, $/); if ($pos < 0) { # no line ending, but long enough print $fh_bad $buf; $buf = ''; # flush, ends while $is_continued = 1; } else { # line ending found $pos += length($/); my $is_bad = $is_continued \|\| $pos != $rec_len; print {$is_bad ? $fh_bad : $fh_good} (substr($buf, 0, $pos)); substr($buf, 0, $pos, ''); # clip written section $is_continued = 0; # reset whenever line ending found } } } [download]	[reply] [d/l]
Re^3: Thrashing on very large lines by ikegami (Patriarch) on Apr 20, 2006 at 21:48 UTC
Moving the check for `$read == 0` was quite appropriate. Admittedly, `$skip` was not optimally named. (I'd prefer `$in_bad_record`. It was called `$skip` because I initially discarded the bad stuff. I didn't notice you wanted to keep it.) However, your fix of the bug in `read` wasn't optimal. Me: `my $read = read($fh_in, $buf, $blk_size, length($ofs));` [download] You: `my $read = read($fh_in, $newbuf, $blk_size); ... $buf .= $newbuf;` [download] What it should be: `my $read = read($fh_in, $buf, $blk_size, length($buf));` [download]	[reply] [d/l] [select]
Re^4: Thrashing on very large lines by chr1so (Acolyte) on Apr 21, 2006 at 00:47 UTC
Re: Thrashing on very large lines by salva (Canon) on Apr 20, 2006 at 09:21 UTC
If you have enough memory on your computer (> 250MB), try pre-allocating enough room in $_ to stop perl from reallocating the buffer and copying its content back and forward as it reads from the file: `perl -ne 'BEGIN { $_ = '-' x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt` [download] update: see my other comment below in this thread	[reply] [d/l]
Re^2: Thrashing on very large lines by Anonymous Monk on Apr 20, 2006 at 18:09 UTC
That worked really well with one caveat. It doesn't work in a BEGIN; I get the error `Undefined subroutine &main::x called at -e line 1. BEGIN failed--compilation aborted at -e line 1.` [download] But if I save it to a script it works nicely `$_ = '-' x (10241024300); while( <> ) { print and next if length($_)==1001; print STDERR $_; }` [download] `perl maxwellsfilter.pl suspect.txt > goodrecs.txt 2> badrecs.txt` [download] The odd thing (to me) is that it converts the line endings from Unix to Windows.. I didn't think it would that without using "\n" in the print. Also, I now realize that with $\|++ it did write complete records, but because it converted the line endings the record length changed from 1001 to 1002.	[reply] [d/l] [select]
Re^3: Thrashing on very large lines by salva (Canon) on Apr 20, 2006 at 18:51 UTC
That worked really well with one caveat. It doesn't work in a BEGIN... oops, this is a shell quoting problem, try using double quotes around the `-` instead of single ones: `perl -ne 'BEGIN { $_ = "-" x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt` [download]	[reply] [d/l] [select]
Re^4: Thrashing on very large lines by chr1so (Acolyte) on Apr 24, 2006 at 23:34 UTC
Re: Thrashing on very large lines by aufflick (Deacon) on Apr 20, 2006 at 06:05 UTC
Do you care about the bad lines? If you are happy to discard them then your job will be easier.	[reply]