babysFirstPerl has asked for the wisdom of the Perl Monks concerning the following question:

I am parsing a text file that can have a varying number of headers and footers that I want to ignore. The script that I wrote works, but it opens and reads the input file twice. I feel there must be a better way, where I only have to open and read the file once. Any help on this would be great!

My code:

$numHeaders= $ARGV[0]; $numFooters= $ARGV[1]; #count number of lines in file open (INPUT, $l_infile); for (<INPUT>){}; $numLines= $.; close(INPUT); open (INPUT, $l_infile); while(<INPUT>) { my @fields = split "," , $_; if ($. != $numHeaders && $. <= $numLines-$numFooters) { # do parsing work and print } } close (INPUT);

Replies are listed 'Best First'.
Re: More efficient way to exclude footers
by Athanasius (Archbishop) on Aug 19, 2015 at 14:57 UTC

    Hello babysFirstPerl, and welcome to the Monastery!

    Am I correct in thinking that the file begins with a single header of $ARGV[0] lines and ends with a single footer of $ARGV[1] lines? If so, the following approach should do what you want. It reads the file exactly once, and processes it on-the-fly so that the number of lines held in memory never exceeds one plus the number of lines in the footer:

    #! perl use strict; use warnings; my $header_lines = $ARGV[0] // 0; my $footer_lines = $ARGV[1] // 0; <DATA> for 1 .. $header_lines; # Throw away the header my @lines; while (<DATA>) { push @lines, parse_line($_); print shift @lines if @lines > $footer_lines; } sub parse_line { my ($line) = @_; # ...Parse $line... return $line; } __DATA__ Header 1 Header 2 Text 1 Text 2 Text 3 Text 4 Text 5 Footer 1 Footer 2 Footer 3

    Output:

    0:50 >perl 1348_SoPW.pl 2 3 Text 1 Text 2 Text 3 Text 4 Text 5 0:55 >

    Update: A couple of additional points:

    • This test: if ($. != $numHeaders && ... should be if ($. > $numHeaders && ....
    • If you have to read a file more than once, you don’t have to close and re-open it: just use seek.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      A couple of small points, so small, in fact, that I hesitate to mention them... Ah, what the heck...

      • The command-line parameter capture statements of the form
            my $header_lines = $ARGV[0] // 0;
        could be written
            my $header_lines = $ARGV[0] || 0;
        (logical-or  || instead of  // defined-or) to make the statements Perl version-agnostic (defined-or not introduced until version 5.10). All the rest of the code seems to require nothing more than version 5.0.0. (Tested under 5.8.9.)
      • The while-loop line processing code
            push @lines, parse_line($_);
            print shift @lines if @lines > $footer_lines;
        could be written
            push @lines, $_;
            print parse_line(shift @lines) if @lines > $footer_lines;
        to avoid parsing footer lines (although they still would be read). I have to admit that with only a dozen footer lines to deal with, it's hard to imagine this would make any detectable difference, but if line parsing is extremely expensive... Who knows? (This change also tested.)
      Anyway, my two cents, maybe I'll squeeze some XP outta it.


      Give a man a fish:  <%-{-{-{-<

      Hello, babysFirstPerl.
      There is another way: using regular expressions. If I am correct this program has to go over input only once and it is't slow? -
      use strict; use warnings; my $header_lines = $ARGV[0] // 0; my $footer_lines = $ARGV[1] // 0; my $whole_input; # slurp whole file into one scalar variable {local $/ ; $whole_input = <DATA>}; # (this can exceed memory if data is too much) # define what line is in regular expression language: # not newline x (zero or more times) + one newline after my $line_regex = qr/[^\n]*\n/; # treat whole input as string and substitute lines with empty strings: $whole_input =~ s/\A (?:$line_regex){$header_lines} //x; # delete some lines from the beginning $whole_input =~ s/ (?:$line_regex){$footer_lines} \z//x; # delete some lines from the ending print $whole_input; # now it is not whole, and you can parse __DATA__ Header 1 Header 2 Text 1 Text 2 Text 3 Text 4 Text 5 Footer 1 Footer 2 Footer 3
      But if the last line of the file ends not with newline, second regex do not match and don't delete anything.
        But if the last line of the file ends not with newline, second regex do not match and don't delete anything.

        That can easily be fixed by changing the regex object definition
            my $line_regex = qr/[^\n]*\n/;
        to
            my $line_regex = qr/[^\n]*\n?/;
        (note final  \n has  ? quantifier added). (Tested.)

        But you need to go one step further in the example: show extraction of each remaining line for further processing.

        Update: And see also File::Slurp.


        Give a man a fish:  <%-{-{-{-<

Re: More efficient way to exclude footers
by roboticus (Chancellor) on Aug 19, 2015 at 18:49 UTC

    babysFirstPerl:

    Since your file is small, you might just want to read the file into memory, chop off the header and footer using a hash slice, and then process the rest:

    $numHeaders= $ARGV[0]; $numFooters= $ARGV[1]; # Read the file into memory open (INPUT, $l_infile); my @file = <INPUT>; close INPUT; # Treating the array as a scalar value gives you the number of lines # (not that you really need to worry about this right now) my $numLines = @file; # Split the headers and footers into their own arrays my @headers = splice @file, 0, $numHeaders; my @footers = splice @file, $#file-$numFooters, $numFooters; for my $line (@headers) { # do whatever you want with the headers } for my $line (@file) { # process your data }

    The splice function(perldoc -f splice) returns whatever you chop out of your array, so if you don't want the headers or footers, just don't save them into new variables.

    # Discard the headers and footers splice @file, 0, $numHeaders; splice @file, $#file-$numFooters, $numFooters;

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: More efficient way to exclude footers
by poj (Abbot) on Aug 19, 2015 at 14:34 UTC

    If there is a pattern in the text that identified a line as a header or footer then you should be able to ignore them as you parse the file.

    poj
Re: More efficient way to exclude footers
by stevieb (Canon) on Aug 19, 2015 at 14:36 UTC

    This is a perfect job for Tie::File, along with an array slice. Note that the @file array holds the file open, so if you change the array in any way, you'll also modify the file live-time. If you aren't editing the file, best to take a copy of the tied array, then untie @file before doing any processing.

    use warnings; use strict; use Tie::File; my $file = 'a.txt'; my $num_headers = $ARGV[0]; my $num_footers = ++$ARGV[1]; tie my @file, 'Tie::File', $file or die $!; my $stop = scalar @file - $num_footers; my @section = @file[$num_headers..$stop]; untie @file; print "$_\n" for @section;

    Input file:

    h1 h2 data more data even more data blah f1 f2 f3

    Result:

    $ ./header.pl 2 3 data more data even more data blah

    -stevieb

      This is a perfect job for Tie::File

      How so? It needlessly reads the entire file (except for the ~5 lines of headers and footers) twice!

      If that's no problem because the file is small, why didn't you just read the whole thing into memory instead of adding the monstrous overheard of Tie::File to the equation?

      If that's a problem because the file is large, use a rolling buffer. I think you'll find that saying it'll make it 10 times faster is an understatement.

        Thank you ikegami, I always appreciate being shown new (to me) and better/more efficient ways to do things.

        That's why I'm here... to learn, and to pass on.

Re: More efficient way to exclude footers
by ateague (Monk) on Aug 19, 2015 at 21:11 UTC

    EDIT:

    Well I feel silly. As AnomalousMonk below mentioned, I had (unintentionally) provided the same solution as Athanasius did earlier. I could have sworn Athanasius' solution involved reading in the file all into an array before processing. That is what I get for not reading all the replies carefully I suppose.

    Original reply spoilr'd to avoid polluting the thread with redundant bits (Unless there is a "Delete Post" gubbin I somehow missed?)

      Again, isn't this essentially what Athanasius already suggested here?


      Give a man a fish:  <%-{-{-{-<

        Ach!

        You are absolutely right. I have updated the original post

Re: More efficient way to exclude footers
by Anonymous Monk on Aug 19, 2015 at 14:24 UTC
      $numHeaders and $numFooters probably won't ever exceed 10. The file itself is typically around 6,000 lines. I've thought of reading it in reverse- but then I still have to exclude the headers, so I have the same problem just in the opposite direction.
Re: More efficient way to exclude footers
by Intermediate Dave (Novice) on Aug 19, 2015 at 17:51 UTC
    This is probably overkill, but my first thought was, can't you do a system call to the Unix command wc (which can give you the number of lines in your input file?)

    And CPAN also has two modules that I think will give you the same functionality.

    Otherwise, for the headers, I'm thinking you could just add a variable which keeps track of how many lines you've read in. Then you'd be able to calculate whether you're still dealing with header rows.
    while(<INPUT>) { $linecount++ ; next if $linecount <= $numHeaders;
    The footer problem is a bit trickier, but after the header you could first just push every row into an array as a separate string element. Then you'd know exactly how many lines you're dealing with, and you can parse only the elements that lead up to @array[ scalar @array - $numFooters ]

    (Updated to remove extraneous parentheses and use scalar on @array)
      @array[ length(@array) - $numFooters) ]

      But  length(@array) returns the length of the string representing the number of elements in the array:

      c:\@Work\Perl\monks\babysFirstPerl>perl -wMstrict -le "my @array = (0 .. 10_000); print length(@array); " 5
      So what you really want is  @array[ @array - $numFooters ] (untested). (Also: There's an extra right-paren after  $numFooters in the original reply.)

      ... for the headers, I'm thinking you could just add a variable which keeps track of how many lines you've read in. ... after the header you could first just push every row into an array ... parse only the elements that lead up to @array[ length(@array) - $numFooters) ]

      With necessary semantic corrections, isn't this pretty much exactly what Athanasius suggested above?


      Give a man a fish:  <%-{-{-{-<

      while(<INPUT>) { $linecount++ ; next if $linecount <= $numHeaders; # ... }
      Why not, but if the file is very large you're doing a $linecount incrementation and a test for every line in the file.
      <INPUT> for 1..$numHeaders;
      is just reading the first $numHeaders lines of the file and throws them away, and you don't add any overhead for the rest of your file. And, BTW, the $. special variable contains the line count of the last read file handle, so that you don't need the $linecount incrementation.
Re: More efficient way to exclude footers
by kcott (Archbishop) on Aug 26, 2015 at 13:19 UTC

    G'day babysFirstPerl,

    Welcome to the Monastery.

    I'd use the following steps:

    1. Open the file once.
    2. Read through all the headers and capture the file position (tell).
    3. Read the remaining lines and calculate the last data line (based on total lines in file and known number of footer lines).
    4. Reposition the file pointer to the start of the data (seek) and reset the line counter ($.).
    5. Read just the data lines and process as required.
    6. Close the file once.

    Here's my test code (pm_1139175_skip_head_and_foot.pl):

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $file = 'pm_1139175_skip_head_and_foot.txt'; my ($headers, $footers) = (2, 3); open my $fh, '<', $file; <$fh> for 1 .. $headers; my $last_head_pos = tell $fh; 1 while <$fh>; my $last_data_line = $. - $footers; seek $fh, $last_head_pos, 0; $. = $headers; while (<$fh>) { last if $. > $last_data_line; print; } close $fh;

    Given this input:

    $ cat pm_1139175_skip_head_and_foot.txt head1 head2 data1 data2 data3 data4 foot1 foot2 foot3

    That script produces:

    $ pm_1139175_skip_head_and_foot.pl data1 data2 data3 data4

    — Ken

A reply falls below the community's threshold of quality. You may see it by logging in.