Re: grepping a large file and stopping on first match to a list

Does this stop searching on the first match?

Well, it does, you can test it yourself:

$ perl -le '
    use List::Util "first";
    first { $_->() }
      sub { print 0; 0 },
      sub { print 1; 1 },
      sub { print 2; 1 };
'
[download]

a more interesting question is: does the expression $map =~ m"^(.*)$"gm put 861_600 lines on perl's argument stack before even calling first? It does, which IMO kind of defeats the purpose of "stopping on the first match".

The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient. I use it often. I also recommend to avoid smartmatch; it's too smart for most programmers ("27-way recursive runtime dispatch by operand type" is something I personally don't even want to understand). If by "exactly matches" you mean literally matches (as by eq), compile regex like this:

my $regex = join '|', map quotemeta, @strings;
$regex = qr/^($regex)$/;
[download]

and use it like this:

$string = $1 if $map =~ $regex;
[download]

I don't see much point in tie'ing the file; nor in memory mapping it, for that matter. But maybe you need the map for something else. If not, just read the file in the usual way, using the while loop.

Comment on Re: grepping a large file and stopping on first match to a list Select or Download Code

Replies are listed 'Best First'.

Re^2: grepping a large file and stopping on first match to a list
by msh210 (Monk) on Feb 23, 2016 at 07:15 UTC

a more interesting question is: does the expression $map =~ m"^(.*)$"gm put 861_600 lines on perl's argument stack before even calling first? It does, which IMO kind of defeats the purpose of "stopping on the first match".

Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that. (Thank you both for them!) Any idea what can, if anything?

If it helps (and to address some of the questions in kcott's reply), the file is text, with a maximum line length of (I'm estimating now, since I don't have it at hand) a few hundred bytes, and is static (meaning unchanging, so that I can, if necessary, include it in my Perl script instead of opening it as a file).

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

[reply]
[d/l]
[select]

Re^3: grepping a large file and stopping on first match to a list

by Athanasius (Archbishop) on Feb 23, 2016 at 07:57 UTC

Hello msh210,

Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that.

No, Anonymous Monk did not say that! Quite the contrary: he and kcott are recommending that you read the file line-by-line, stopping when the first match is found. (kcott’s quotation from the Tie::File documentation specifically rules out any need to read in the whole file.)

Of course, this assumes that if a match occurs at all, it will occur within a single line. If your matches may overlap two or more lines, you will need to adapt the approach by employing a sliding window technique to examine n lines at a time, moving the window forward each time by discarding the first line and adding the next (n + 1th) line to the window. Then the key task is to determine the optimum size of n — which must be large enough to ensure that all possible matches are accommodated. To find the most efficient size for n, a certain amount of trial-and-error is usually required.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: grepping a large file and stopping on first match to a list

by msh210 (Monk) on Feb 23, 2016 at 15:32 UTC

Ah, I hadn't understood. Thank you.

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

[reply]
[d/l]

Re^3: grepping a large file and stopping on first match to a list

by Anonymous Monk on Feb 23, 2016 at 08:01 UTC

Ah, so it's the loading of the entire file that's slowing me down.

And you say that your and kcott's solutions don't avoid that

first

use strict;
use warnings;

my @strings = qw(
  foobarbaz111222333444
  barbazfoo111555666999
  bazfoobar999888777666
);

my $text = <<'END';
bdfjadslfjhasdjklfhjklashdflkjadshfjkladhfjkldhfljkafjadfdji
adlfjhdlsjkfuiowerfhwehrbeblbflasdfbhjkaldhqpqheuihfbdfkhyyy
qjdpdbnfdbjdfklbasjfbajksdbfjaksdbfjaksdfbjkasdfhydsfadfjyyy
END

$text x= 1_000_000;
$text .= "bazfoobar999888777666\n";

printf "text size: %d bytes (%.2f mb)\n",
  length($text), length($text) / ( 1024 * 1024 );

my $regex = join '|', map quotemeta, @strings;
$regex = qr/^($regex)$/m;

printf "found %s at %d\n", $1, $-[0] if $text =~ $regex;
[download]

[reply]
[d/l]
[select]

Re^3: grepping a large file and stopping on first match to a list [no modules]

by kcott (Archbishop) on Mar 23, 2016 at 10:34 UTC

[I haven't logged in for a tenday or so. Apolgies for the late response.]

In an earlier post, Anonymous Monk wrote:

"The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient."

I'd just like to say that I concur with the "not the tie stuff" part.

"... to address some of the questions in kcott's reply ..."

Thanks for that. I think you shouldn't use any modules at all (except the pragmata). Consider code like this:

#!/usr/bin/env perl
#
#   pm_1155868_first_grep_large_file_2.pl [no modules]
#

use strict;
use warnings;
use autodie;

open my $lines_fh, '<', input_filename();
while (<$lines_fh>) {
    next unless is_a_match($_);
    process_match($_) and last;
}

sub input_filename { 'pm_1155868_input.txt' }
sub get_strings { [qw{rst uvw xyz}] }
BEGIN {
    my $re = '(?:' . join('|', @{get_strings()}) . ')';
    sub is_a_match { $_[0] =~ /$re/ }
}
sub process_match { print "Match: $_[0]" }
[download]

It produces the same output:

$ pm_1155868_first_grep_large_file_2.pl
Match: dddddddxyzdddd
[download]

You'll need to modify &input_filename, &get_strings and &process_match to suit.
$re and &is_a_match should work fine as is; modify if necessary.
Note how the while loop ignores all records that don't match (next); when the first match is found, it uses it and then exits the loop (last). This seems to match your requirements.
Use Benchmark, as previously discussed, to compare potential solutions.

— Ken

[reply]
[d/l]
[select]