Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for the best possible way to parse a file that contains additional "batches" within this file. Basically, a while loop with additional while loops within the the original.

Example

FH
BH
1234123
1234123
BH
1234963
1234963
1234963
BH
1234999
1234999
1234999
1234999
1234999
FT

* Note: FH: File Header, BH: Batch Header, FT: File Trailer

How can I develop a script to parse the entire file and at the same time parse each individual batch? I need to make this the most efficient way possible.

I appreciate any and all help.

Thanks in advance!

v.scan

Replies are listed 'Best First'.
Re: Parse File With Sub While Loops
by Bird (Pilgrim) on Sep 16, 2002 at 22:00 UTC
    You may be able to do this very easily with the range operator.
    while (<MYFILE>) { if (/^FH$/ .. /^FT$/) { # $_ is a line within FH and FT lines # including the delimiting lines } if (/^BH$/ ... /^BH$|^FT$/) { # $_ is a line between BH lines, or between # a BH and FT line }
    Basically, the first if is true only if we've already found a line containing only FH, but haven't yet found a line containing FT. The second if is true when we've found a BH line, but haven't yet found either another BH line or an FT line.

    Hope this helps,
    -- Bird

    Oh, the reason the second if uses three dots (...) is because the two dot version can become false in the same check that it became true. Essentially, if you use the two dot version to match a block which uses the same start and end delimiter, you may only end up processing the first line of the block (which would be the delimiter, in this case).

      Oh, my God!

      Bird, your node shows me the light on the range operator, which I didn't know in it's full power!

      I think that the information you linked is worth to be read immediately, so I paste it here:

      In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated. It can test the right operand and become false on the same evaluation it became true (as in awk), but it still returns true once. If you don't want it to test the right operand till the next evaluation, as in sed, just use three dots ("...") instead of two. In all other regards, "..." behaves just like ".." does.

      The right operand is not evaluated while the operator is in the "false" state, and the left operand is not evaluated while the operator is in the "true" state. The precedence is a little lower than || and &&. The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint. You can exclude the beginning point by waiting for the sequence number to be greater than 1. If either operand of scalar ".." is a constant expression, that operand is implicitly compared to the $. variable, the current line number.

      Ciao!
      --bronto

      # Another Perl edition of a song:
      # The End, by The Beatles
      END {
        $you->take($love) eq $you->made($love) ;
      }

Re: Parse File With Sub While Loops
by fsn (Friar) on Sep 16, 2002 at 21:51 UTC
    You give too little information for me to make a more elaborate suggestion, but I'll try to give you a generic one.

    When processing a file like this, I try to make just one while loop, and then have some kind of statemachine-ish construction to do the actual work for me. The important thing is to avoid reading from the file in more than one place, like in the main loop and then a sub loop that exhausts some data, since that always seems to give me problems where I must reinsert data back into the buffer in some way, or handle special cases.

    So, this is my general design principle (in perl-ish pseudocode):

    my $state; open SESAME, "infile"; while (<SESAME>) { # Setting the "states" of the "state machine" if ( $_ =~ /FH/) { $state = "FH" } if ( $_ =~ /BH/) { $state = "BH"} . . # Do different things with the data depending on the # settings of the "state machine" if ( $state eq "FH" ) { # do this } if ( $state eq "BH" ) { # do that } . . } close SESAME;
    No idea if this helps you.
Re: Parse File With Sub While Loops
by dug (Chaplain) on Sep 16, 2002 at 23:03 UTC
    In the TIMTOWDI spirit, here is one that uses some sugar cooked up by thedamian.

    Be forewarned, it makes some assumtions about your file format that may not be true.
    #!/usr/bin/perl -w use strict; $|++; ## # NOTE: This code Assumes (and we all know what that means) that the +file # being fed to has no more than one "boundary" (/FH|BH/) per line, and + # that the file is delimited by newlines. # use Switch 'Perl6'; # Import thedamian's sugar, it's better than C&H. use English '-no_match_vars'; # since we're using some Perl 6 syntax h +ere, may # as well get rid of $0 in the usage sta +tement my $file = shift or die "USAGE $PROGRAM_NAME filename\n"; open( FH, $file ) or die "Coudln't open $file: $!\n"; my $batchnum = 0; # global batch tracker while (<FH>) { chomp(); next if m/^$/; given ($_) { when /^FH$/ { print "File Header\n"; last; } when /^BH$/ { print "Batch Header\n"; $batchnum++; last; } when /^FT$/ { print "File Trailer\n"; last; } when /.*/ { handle_batch_content($_); } } } sub handle_batch_content { my $batch_content = shift; print "Got $batch_content in $batchnum\n"; # or whatever else you wa +nt to do }
    Given the example file you provided, assuming that Example is newline delimited, this script produces:
    File Header
    Batch Header
    Got 1234123 in 1
    Got 1234123 in 1
    Batch Header
    Got 1234963 in 2
    Got 1234963 in 2
    Got 1234963 in 2
    Batch Header
    Got 1234999 in 3
    Got 1234999 in 3
    Got 1234999 in 3
    Got 1234999 in 3
    Got 1234999 in 3
    File Trailer
    
    HTH,
      dug
Re: Parse File With Sub While Loops
by anithri (Beadle) on Sep 16, 2002 at 23:51 UTC
    More TMTOWTDI... Assuming your BH is predictable and has a static component...
    FileStart
    Batch: 123
    234
    1235613246
    1434312
    12521
    124215
    Batch: 133
    614
    1641
    32463
    142351
    123
    Batch: 358
    214
    125
    612
    FileEnd
    
    Then you could set you Inpute Record Seperator to $/="Batch" and get each batch as a chunk, then process each chunk individually.
    open IN, "somefile.txt"; $fileheadinfo = <IN>; $/ = "Batch"; while ($batch = <IN>) { next if $batch = "Batch"; #first line @lines = split /\n/,$batch; $batchinfo = "Batch" . shift @lines; #get batch info pop @lines; #get rid of bar Batch at end foreach $line (@lines) { process($line) } }
Re: Parse File With Sub While Loops
by Aristotle (Chancellor) on Sep 17, 2002 at 10:30 UTC
    Maybe you are looking for Inline::Files?

    Makeshifts last the longest.