chr1so has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple problem and a solution that is too simple..

I have a simple fixed length file of about 1GB in size, with records 1001 bytes long. However, a few lines are the wrong size. I want to separate them into two files, good.txt and bad.txt.

I started with the following, which works on a small test file perfectly:

perl -ne 'print and next if length($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badrecs.txt

The problem seems to be that there is a bad record in the real-world file that is around 250MB long and near the end. So it looks like Perl is thrashing for hours trying to read that line into $_. (In Windows XP, the memory usage bounces up and down 100MB every few seconds, and it's reading about 8k per second.)

I tried adding BEGIN {$|++} so at least it'd flush the buffer to the goodrecs.txt before thrashing, but it still stops writing mid-record. (But it does write a little more.)

What can I do? And what is a general solution? :)

Any guidance is appreciated, oh Perl monks.

Replies are listed 'Best First'.
Re: Thrashing on very large lines
by GrandFather (Saint) on Apr 20, 2006 at 05:08 UTC

    Open the file binary mode. Read it a modest size block at a time (say 8K) and concatenate the new block onto the end of a buffer. Use index to find the line ends and pull the lines out of the buffer using substr. Top up the buffer when there are no more line ends.


    DWIM is Perl's answer to Gödel
Re: Thrashing on very large lines
by ikegami (Patriarch) on Apr 20, 2006 at 06:50 UTC
    open(my $fh_in, '<', ...) or die("Unable to open input file: $!\n"); open(my $fh_good, '>', ...) or die("Unable to open 'good' output file: $!\n"); open(my $fh_bad, '>', ...) or die("Unable to open 'bad' output file: $!\n"); binmode($fh_in); binmode($fh_good); binmode($fh_bad); my $buf = ''; my $skip = 0; local $/ = "\r\n"; # Because we're using binmode. my $rec_len = 1000 + length($/); # Buffer size can be up to $blk_size + $rec_len bytes. my $blk_size = 8192; for (;;) { my $read = read($fh_in, $buf, $blk_size, length($ofs)); defined $read or die("Unable to read input file: $!\n") while (length($buf) >= $rec_len) { my $pos = index($buf, $/); if ($pos < 0) { print $fh_bad $buf; $buf = ''; $skip = 1; } else { $pos += $ofs + length($/); my $bad = $skip || $pos != $rec_len; print { $bad ? $fh_bad : $fh_good } (substr($buf, 0, $pos)); substr($buf, 0, $pos, ''); $skip = 0; } } if ($read == 0) { print $fh_bad $buf if length $buf; last; } }

    Untested.

    Update: Cleanup.

      Thanks! This is a little quicker, but more importantly it doesn't convert the line endings like the one-liner.

      There was one bug when $buf was overwritten by the read() command. I've cleaned it up until it sorts the same number of records as the one-liner.

      scalar @ARGV == 3 or die "usage: this.pl <suspect.txt> <good.txt> <bad.txt>\n"; open(my $fh_in, '<', shift) or die("Unable to open input file: $!\n"); open(my $fh_good, '>', shift) or die("Unable to open 'good' output file: $!\n"); open(my $fh_bad, '>', shift) or die("Unable to open 'bad' output file: $!\n"); binmode($fh_in); binmode($fh_good); binmode($fh_bad); my $buf = ''; # stores the working buffer my $newbuf = ''; # used for reading next chunk my $is_continued = 0; # remembers that the current record is longer th +an it appears local $/ = "\n"; # Because we're using binmode. my $rec_len = 1000 + length($/); # Working buffer size can be up to $blk_size + $rec_len bytes. my $blk_size = 8192; for (;;) { my $read = read($fh_in, $newbuf, $blk_size); defined $read or die("Unable to read input file: $!\n"); $buf .= $newbuf; # if it didn't read anything new, flush the buffer and end if ($read == 0) { print $fh_bad $buf if length $buf; last; } while (length($buf) >= $rec_len) { my $pos = index($buf, $/); if ($pos < 0) { # no line ending, but long enough print $fh_bad $buf; $buf = ''; # flush, ends while $is_continued = 1; } else { # line ending found $pos += length($/); my $is_bad = $is_continued || $pos != $rec_len; print {$is_bad ? $fh_bad : $fh_good} (substr($buf, 0, $pos)); substr($buf, 0, $pos, ''); # clip written section $is_continued = 0; # reset whenever line ending found } } }

        Moving the check for $read == 0 was quite appropriate. Admittedly, $skip was not optimally named. (I'd prefer $in_bad_record. It was called $skip because I initially discarded the bad stuff. I didn't notice you wanted to keep it.) However, your fix of the bug in read wasn't optimal.

        Me:

        my $read = read($fh_in, $buf, $blk_size, length($ofs));

        You:

        my $read = read($fh_in, $newbuf, $blk_size); ... $buf .= $newbuf;
        What it should be:
        my $read = read($fh_in, $buf, $blk_size, length($buf));
Re: Thrashing on very large lines
by salva (Canon) on Apr 20, 2006 at 09:21 UTC
    If you have enough memory on your computer (> 250MB), try pre-allocating enough room in $_ to stop perl from reallocating the buffer and copying its content back and forward as it reads from the file:
    perl -ne 'BEGIN { $_ = '-' x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt
    update: see my other comment below in this thread
      That worked really well with one caveat. It doesn't work in a BEGIN; I get the error
      Undefined subroutine &main::x called at -e line 1. BEGIN failed--compilation aborted at -e line 1.
      But if I save it to a script it works nicely
      $_ = '-' x (1024*1024*300); while( <> ) { print and next if length($_)==1001; print STDERR $_; }
      perl maxwellsfilter.pl suspect.txt > goodrecs.txt 2> badrecs.txt
      The odd thing (to me) is that it converts the line endings from Unix to Windows.. I didn't think it would that without using "\n" in the print.

      Also, I now realize that with $|++ it did write complete records, but because it converted the line endings the record length changed from 1001 to 1002.

        That worked really well with one caveat. It doesn't work in a BEGIN...

        oops, this is a shell quoting problem, try using double quotes around the - instead of single ones:

        perl -ne 'BEGIN { $_ = "-" x (260 * 1024 * 1024 )} print and next if l +ength($_)==1001; print STDERR $_' suspect.txt > goodrecs.txt 2> badre +cs.txt
Re: Thrashing on very large lines
by aufflick (Deacon) on Apr 20, 2006 at 06:05 UTC
    Do you care about the bad lines? If you are happy to discard them then your job will be easier.