Melly has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monkees

I have a large input file. Every time a line starts with a "1", I want to handle the preceding lines up to the preceding "1" (a "1" indicates the start of a different subject's data).

Because of the file-sizes involved, as well as other unavoidable memory requirements, I'm writing out the subject's data to a temporary file (rather than trying to hold all the lines in memory).

All okay - except my handling of new records is, IMHO, clumsy, and leads to an extra call to the subroutine, "output_data" at the end of the process (to handle the final record).

Here is the simplified code:

$counter = 1; open(DATA, "export.dat")||die "Cannot open export.dat for read:$!\n"; while(<DATA>){ $temp_line = $_; if($temp_line =~ /^1(\d*)/){ if($counter > 1 ){ #No temp file yet if this is the first record close TEMP||die "Cannot close temp.dat:$!\n"; &output_data(); } open(TEMP, ">$temp.dat")||die "Can't open temp.dat:$!\n"; print TEMP $temp_line; $counter ++; } elsif(/\S+/){ print TEMP $temp_line; } } close DATA||die "Cannot close $in_dir/export.dat (weird):$!\n"; close TEMP||die "Cannot close $in_dir/temp.dat:$!\n"; &output_data(); sub output_data{ #do stuff with temp.dat }

Can anyone suggest a more elegant way to handle this kind of thing?

Oh, and BTW, I've just noticed that I call my filehandle "DATA" - does this create any potential problems? Hmm, could be confusing to whoever maintains the code if they know about __DATA__ - better change it I guess....

map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
Tom Melly, pm@tomandlu.co.uk

Replies are listed 'Best First'.
Re: Ugly variable-length record handling
by roboticus (Chancellor) on Dec 28, 2006 at 14:28 UTC
    Melly:

    Actually, that doesn't look too bad. I'd change a few minor things to make it look roughly like this: (UNTESTED!)

    #!/usr/bin/perl -w use strict; use warnings; my $counter=0; open(DATA, "export.dat")||die "Cannot open export.dat for read:$!\n"; while(<DATA>){ if (/^1(\d*)/) { if ($counter > 0) { #No temp file yet if this is the first record close TEMP||die "Cannot close temp.dat:$!\n"; &output_data(); } open(TEMP, ">$temp.dat")||die "Can't open temp.dat:$!\n"; $counter = 0; } if (/\S+/) { print TEMP $temp_line; ++$counter; } } close DATA||die "Cannot close $in_dir/export.dat (weird):$!\n"; if ($counter > 0) { close TEMP||die "Cannot close $in_dir/temp.dat:$!\n"; } &output_data(); sub output_data{ #do stuff with temp.dat }
    The main change is that I consolidated the write to the temp file to a single place to clarify things. That way, the special case handler is smaller and easier to read, and all the writes appear in the same location. (Handy, if you need to change it, so it's changed in a single location.)

    --roboticus

      Thanks roboticus - the change you suggest is minor but welcome. The way I handle the new records still bugs me, but it looks like that may be as good as it gets...

      map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
      Tom Melly, pm@tomandlu.co.uk
Re: Ugly variable-length record handling
by jettero (Monsignor) on Dec 28, 2006 at 14:08 UTC
    Is there a way to detect possible start locations? You could maybe memorize them with tell and go back (using seek) in the source file when you reach a certain point?

    -Paul

      I'm not sure that using tell and seek leads to a more elegant solution (or at least I'm not sure how to use them to craft a more elegant solution).

      I can only detect the start locations when I get to them (and the start-line has to be included in temp.dat).

      Just to clarify - here is a simplified (and very short) version of export.dat:

      100000001 200546Mary 200549#0002897 20055100001 100000003 200546Kathy 200547#0002530 200549#0002897 200552123 Elm Street
      map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
      Tom Melly, pm@tomandlu.co.uk

        Kinda reminds me of an IRS "magnetic tape" file...

        You're probably better off going with a process-as-you go model. I just usually go out of my way to avoid temp-files if I can. And I really like to go through a source file only once when possible.

        When I suggested tell/seek, I was thinking you needed to see the end to find the start, but that probably ins't the case here so process-as-you-go is probably what you want.

        -Paul

Re: Ugly variable-length record handling
by alpha (Scribe) on Dec 28, 2006 at 15:07 UTC
    Maybe a little off-topic, but in: open(TEMP, ">$temp.dat") "$" is useless there, or it's just me ?

      Well spotted - it was a typo from my simplifying the code - originally the line was open(TEMP, ">$out_dir/temp.dat")...

      map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
      Tom Melly, pm@tomandlu.co.uk
Re: Ugly variable-length record handling
by Limbic~Region (Chancellor) on Dec 28, 2006 at 18:20 UTC
    Melly,
    Salutations from vacation land. This could be way off base so forgive me. Assuming that individual records are not that large (just the file), then a user controlled buffer should work fine.
    #!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'export.dat'; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my @buffer; while (<$fh>) { if (/^1(\d+.*)/) { # Start of new record output_data(@buffer) if @buffer; # process last record if not + first @buffer = $1; # put new line in buffer } else { push @buffer, $_; # Add line to buffer } } output_data(@buffer) if @buffer; # Process last record if pre +sent sub output_data {}
    I am sure you will have to tweak it. Update: I just realized writing to a temporary file was important. Oh well.

    Cheers - L~R

Re: Ugly variable-length record handling
by johngg (Canon) on Dec 30, 2006 at 23:50 UTC
    I thought there might be a way to detect reaching the end of input file within the loop to avoid the last call outside of the loop. Unfortunately, it has taken me a while to get my head around the concept so this post is probably too late for you. Also, I wouldn't say it is any more elegant than your original, probably less. However, now that I've got it working it is probably worth sharing it. Here's the script

    the data file

    and the output

    I hope this is of some interest despite the late response.

    Cheers,

    JohnGG

Re: Ugly variable-length record handling
by graff (Chancellor) on Jan 01, 2007 at 21:28 UTC
    I've had to put up with that sort of loop lots of times -- yes, it has a klugy feel to it, but it still seems easier, simpler, clearer, etc, than alternatives, so I just live with it.

    Other things can be done in general to streamline your code a little bit -- fewer variables, fewer lines of code, less open/close overhead on the temp file:

    open( DAT, '<', 'export.dat' ) or die "Open for read failed on export.dat: $!"; my $tmp; # file handle while ( <DAT> ) { next unless ( /\S/ ); if ( /^1/ ) { if ( defined( $tmp )) { output_data( $tmp ); } open( $tmp, '+>', 'export.tmp' ) # open read/write, truncate +first or die "Open failed for export.tmp: $!"; } print $tmp $_; } output_data( $tmp ); sub output_data { my $fh = shift; seek( $fh, 0, 0 ); # rewind to start of file # do stuff with export.tmp contents... close $fh; }

      Thanks all - some nice variations... but I'll probably stick with the original (IIABDFI) - that said, here's a 'nicer' version I came up with (which uses a two-element array, plus a push and a shift), but I suspect that the niceness comes at the expense of readability...

      use strict; open(DAT, "export.dat")||die "Cannot open export.dat for read:$!\n"; my @temp_lines; push @temp_lines, scalar <DAT>; my $closed = 1; while($temp_lines[0] !~ /^<EOF>$/){ open(TEMP, ">temp.dat") if $closed; $closed = 0; push @temp_lines, (eof DAT) ? '<EOF>' : (scalar <DAT>) ; print TEMP shift @temp_lines; if($temp_lines[0] =~ /^(1|<EOF>)/){ close TEMP; &output_data(); $closed = 1; } } close DAT; sub output_data{ print "in output_data\n"; open(TEST, 'temp.dat')||die "cannot open temp in sub:$!\n"; while(<TEST>){ chomp; print "$_\n" if $_ =~ /\S/; } print "end of output_data\n"; }
      map{$a=1-$_/10;map{$d=$a;$e=$b=$_/20-2;map{($d,$e)=(2*$d*$e+$a,$e**2 -$d**2+$b);$c=$d**2+$e**2>4?$d=8:_}1..50;print$c}0..59;print$/}0..20
      Tom Melly, pm@tomandlu.co.uk