More efficient way to exclude footers

by AnomalousMonk (Archbishop) on Aug 19, 2015 at 17:25 UTC

A couple of small points, so small, in fact, that I hesitate to mention them... Ah, what the heck...

The command-line parameter capture statements of the form
my $header_lines = $ARGV[0] // 0;
could be written
my $header_lines = $ARGV[0] || 0;
(logical-or || instead of // defined-or) to make the statements Perl version-agnostic (defined-or not introduced until version 5.10). All the rest of the code seems to require nothing more than version 5.0.0. (Tested under 5.8.9.)
The while-loop line processing code
push @lines, parse_line($_);
print shift @lines if @lines > $footer_lines;
could be written
push @lines, $_;
print parse_line(shift @lines) if @lines > $footer_lines;
to avoid parsing footer lines (although they still would be read). I have to admit that with only a dozen footer lines to deal with, it's hard to imagine this would make any detectable difference, but if line parsing is extremely expensive... Who knows? (This change also tested.)

Give a man a fish: <%-{-{-{-<

Re^3: More efficient way to exclude footers

by rsFalse (Chaplain) on Aug 19, 2015 at 17:31 UTC

use strict;
use warnings;

my $header_lines = $ARGV[0] // 0;
my $footer_lines = $ARGV[1] // 0;

my $whole_input;
# slurp whole file into one scalar variable
{local $/ ; $whole_input = <DATA>};
# (this can exceed memory if data is too much)

# define what line is in regular expression language:
# not newline x (zero or more times) + one newline after
my $line_regex = qr/[^\n]*\n/;

# treat whole input as string and substitute lines with empty strings:
$whole_input =~ s/\A (?:$line_regex){$header_lines}   //x; 
                  # delete some lines from the beginning
$whole_input =~ s/   (?:$line_regex){$footer_lines} \z//x;
                  # delete some lines from the ending

print $whole_input; # now it is not whole, and you can parse

__DATA__
Header 1
Header 2
Text 1
Text 2
Text 3
Text 4
Text 5
Footer 1
Footer 2
Footer 3
[download]

[reply]
[d/l]

by AnomalousMonk (Archbishop) on Aug 19, 2015 at 17:51 UTC

But if the last line of the file ends not with newline, second regex do not match and don't delete anything.

That can easily be fixed by changing the regex object definition
my $line_regex = qr/[^\n]*\n/;
to
my $line_regex = qr/[^\n]*\n?/;
(note final \n has ? quantifier added). (Tested.)

But you need to go one step further in the example: show extraction of each remaining line for further processing.

Update: And see also File::Slurp.

Give a man a fish: <%-{-{-{-<

Re^4: More efficient way to exclude footers

by rsFalse (Chaplain) on Aug 19, 2015 at 18:09 UTC

Re: More efficient way to exclude footers
by roboticus (Chancellor) on Aug 19, 2015 at 18:49 UTC

babysFirstPerl:

Since your file is small, you might just want to read the file into memory, chop off the header and footer using a hash slice, and then process the rest:

$numHeaders= $ARGV[0];
$numFooters= $ARGV[1];

# Read the file into memory
open (INPUT, $l_infile);
my @file = <INPUT>;
close INPUT;

# Treating the array as a scalar value gives you the number of lines
# (not that you really need to worry about this right now)
my $numLines = @file;

# Split the headers and footers into their own arrays
my @headers = splice @file, 0, $numHeaders;
my @footers = splice @file, $#file-$numFooters, $numFooters;

for my $line (@headers) {
   # do whatever you want with the headers
}

for my $line (@file) {
   # process your data
}
[download]

The splice function(perldoc -f splice) returns whatever you chop out of your array, so if you don't want the headers or footers, just don't save them into new variables.

# Discard the headers and footers
splice @file, 0, $numHeaders;
splice @file, $#file-$numFooters, $numFooters;
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Re: More efficient way to exclude footers
by poj (Abbot) on Aug 19, 2015 at 14:34 UTC

If there is a pattern in the text that identified a line as a header or footer then you should be able to ignore them as you parse the file.

Re: More efficient way to exclude footers
by stevieb (Canon) on Aug 19, 2015 at 14:36 UTC

This is a perfect job for Tie::File, along with an array slice. Note that the @file array holds the file open, so if you change the array in any way, you'll also modify the file live-time. If you aren't editing the file, best to take a copy of the tied array, then untie @file before doing any processing.

use warnings;
use strict;

use Tie::File;

my $file = 'a.txt';

my $num_headers = $ARGV[0];
my $num_footers = ++$ARGV[1];

tie my @file, 'Tie::File', $file or die $!;

my $stop = scalar @file - $num_footers;

my @section = @file[$num_headers..$stop];

untie @file;

print "$_\n" for @section;
[download]

Input file:

h1
h2
data
more data
even more data
blah
f1
f2
f3
[download]

Result:

$ ./header.pl 2 3

data
more data
even more data
blah
[download]

-stevieb

by ikegami (Patriarch) on Aug 19, 2015 at 16:46 UTC

This is a perfect job for Tie::File

How so? It needlessly reads the entire file (except for the ~5 lines of headers and footers) twice!

If that's no problem because the file is small, why didn't you just read the whole thing into memory instead of adding the monstrous overheard of Tie::File to the equation?

If that's a problem because the file is large, use a rolling buffer. I think you'll find that saying it'll make it 10 times faster is an understatement.

Re^3: More efficient way to exclude footers

by stevieb (Canon) on Aug 21, 2015 at 04:01 UTC

Thank you ikegami, I always appreciate being shown new (to me) and better/more efficient ways to do things.

That's why I'm here... to learn, and to pass on.

Re: More efficient way to exclude footers
by ateague (Monk) on Aug 19, 2015 at 21:11 UTC

EDIT:

Well I feel silly. As AnomalousMonk below mentioned, I had (unintentionally) provided the same solution as Athanasius did earlier. I could have sworn Athanasius' solution involved reading in the file all into an array before processing. That is what I get for not reading all the replies carefully I suppose.

Original reply spoilr'd to avoid polluting the thread with redundant bits (Unless there is a "Delete Post" gubbin I somehow missed?)

Re^3: More efficient way to exclude footers

by AnomalousMonk (Archbishop) on Aug 19, 2015 at 21:48 UTC

Again, isn't this essentially what Athanasius already suggested here?

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]

by ateague (Monk) on Aug 19, 2015 at 22:09 UTC

Ach!

You are absolutely right. I have updated the original post

Re: More efficient way to exclude footers
by Anonymous Monk on Aug 19, 2015 at 14:24 UTC

What are the typical and maximum values of $numFooters? If it's not too big, maybe you could buffer that many lines?

Also, this may provide some inspiration: How do I read a file line by line in reverse order (from EOF to start of file)

[reply]
[d/l]

by babysFirstPerl (Initiate) on Aug 19, 2015 at 14:31 UTC

$numHeaders and $numFooters probably won't ever exceed 10. The file itself is typically around 6,000 lines. I've thought of reading it in reverse- but then I still have to exclude the headers, so I have the same problem just in the opposite direction.

Re: More efficient way to exclude footers
by Intermediate Dave (Novice) on Aug 19, 2015 at 17:51 UTC

two

modules

while(<INPUT>) {
  $linecount++ ; 
  next if $linecount <= $numHeaders;
[download]

@array[ scalar @array - $numFooters ]

by AnomalousMonk (Archbishop) on Aug 19, 2015 at 18:18 UTC

@array[ length(@array) - $numFooters) ]

But length(@array) returns the length of the string representing the number of elements in the array:

c:\@Work\Perl\monks\babysFirstPerl>perl -wMstrict -le
"my @array = (0 .. 10_000);
 print length(@array);
"
5
[download]

@array[ @array - $numFooters ]

untested

$numFooters

... for the headers, I'm thinking you could just add a variable which keeps track of how many lines you've read in. ... after the header you could first just push every row into an array ... parse only the elements that lead up to @array[ length(@array) - $numFooters) ]

With necessary semantic corrections, isn't this pretty much exactly what Athanasius suggested above?

Give a man a fish: <%-{-{-{-<

by Laurent_R (Canon) on Aug 19, 2015 at 18:59 UTC

while(<INPUT>) {
  $linecount++ ; 
  next if $linecount <= $numHeaders;
  # ...
}
[download]

$linecount

<INPUT> for 1..$numHeaders;
[download]

$numHeaders

$.

$linecount

Re: More efficient way to exclude footers
by kcott (Archbishop) on Aug 26, 2015 at 13:19 UTC

G'day babysFirstPerl,

Welcome to the Monastery.

I'd use the following steps:

Open the file once.
Read through all the headers and capture the file position (tell).
Read the remaining lines and calculate the last data line (based on total lines in file and known number of footer lines).
Reposition the file pointer to the start of the data (seek) and reset the line counter ($.).
Read just the data lines and process as required.
Close the file once.

Here's my test code (pm_1139175_skip_head_and_foot.pl):

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my $file = 'pm_1139175_skip_head_and_foot.txt';
my ($headers, $footers) = (2, 3);

open my $fh, '<', $file;

<$fh> for 1 .. $headers;
my $last_head_pos = tell $fh;

1 while <$fh>;
my $last_data_line = $. - $footers;

seek $fh, $last_head_pos, 0;
$. = $headers;

while (<$fh>) {
    last if $. > $last_data_line;
    print;
}

close $fh;
[download]

Given this input:

$ cat pm_1139175_skip_head_and_foot.txt
head1
head2
data1
data2
data3
data4
foot1
foot2
foot3
[download]

That script produces:

$ pm_1139175_skip_head_and_foot.pl
data1
data2
data3
data4
[download]

— Ken