grepping a large file and stopping on first match to a list

msh210 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: grepping a large file and stopping on first match to a list
by kcott (Archbishop) on Feb 23, 2016 at 00:49 UTC

"Does this stop searching on the first match?"

Well, the obvious answer is run it and find out for yourself.

You haven't told us anything about 'huge_file' except how many lines it has. What's the file size? What's the average (or typical) record length? Is there anything special about the file or is it just plain text?

You haven't told anything about '@strings' except that it's "a small set ". What's the average (or typical) string length? Is there anything special about the strings?

List::Util::first() stops searching on the first match. Why do you think this may not be the case with your code?

If you're concerned about memory usage, you might want to consider the built-in module, Tie::File. Its documentation says:

"The file is not loaded into memory, so this will work even for gigantic files."

If you're concerned about speed, try several methods and Benchmark them.

I'll also point out that the Smartmatch Operator (~~) is experimental, subject to change, and not suitable for production code.

Bearing in mind all the things I don't know about your situation, here's how I might have tackled this task:

#!/usr/bin/env perl -l

use strict;
use warnings;
use autodie;

use List::Util qw{first};
use Tie::File;

my @strings = qw{rst uvw xyz};
my $re = '(?:' . join('|', @strings) . ')';

tie my @lines, 'Tie::File', 'pm_1155868_input.txt';
my $match = first { /$re/ } @lines;
untie @lines;

print "Match: '$match'";
[download]

With this data:

$ cat pm_1155868_input.txt
aaaaaaaaaaaaaa
bbbbbbbbbbbbbb
cccccccccccccc
dddddddxyzdddd
eeeeeeeuvweeee
fffffffrstffff
gggggggggggggg
hhhhhhhhhhhhhh
iiiiiiiiiiiiii
[download]

I get this result:

Match: 'dddddddxyzdddd'
[download]

That may be suitable for your needs. If so, great! If not, you'll need to provide more information: answering the questions I've already posed would be a good start; see also "How do I post a question effectively?".

— Ken

[reply]
[d/l]
[select]

Re: grepping a large file and stopping on first match to a list
by Anonymous Monk on Feb 23, 2016 at 04:14 UTC

Does this stop searching on the first match?

$ perl -le '
    use List::Util "first";
    first { $_->() }
      sub { print 0; 0 },
      sub { print 1; 1 },
      sub { print 2; 1 };
'
[download]

$map =~ m"^(.*)$"gm

first

The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient. I use it often. I also recommend to avoid smartmatch; it's too smart for most programmers ("27-way recursive runtime dispatch by operand type" is something I personally don't even want to understand). If by "exactly matches" you mean literally matches (as by eq), compile regex like this:

my $regex = join '|', map quotemeta, @strings;
$regex = qr/^($regex)$/;
[download]

$string = $1 if $map =~ $regex;
[download]

tie

while

[reply]
[d/l]
[select]

Re^2: grepping a large file and stopping on first match to a list

by msh210 (Monk) on Feb 23, 2016 at 07:15 UTC

a more interesting question is: does the expression $map =~ m"^(.*)$"gm put 861_600 lines on perl's argument stack before even calling first? It does, which IMO kind of defeats the purpose of "stopping on the first match".

Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that. (Thank you both for them!) Any idea what can, if anything?

If it helps (and to address some of the questions in kcott's reply), the file is text, with a maximum line length of (I'm estimating now, since I don't have it at hand) a few hundred bytes, and is static (meaning unchanging, so that I can, if necessary, include it in my Perl script instead of opening it as a file).

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

[reply]
[d/l]
[select]

Re^3: grepping a large file and stopping on first match to a list

by Athanasius (Archbishop) on Feb 23, 2016 at 07:57 UTC

Hello msh210,

Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that.

No, Anonymous Monk did not say that! Quite the contrary: he and kcott are recommending that you read the file line-by-line, stopping when the first match is found. (kcott’s quotation from the Tie::File documentation specifically rules out any need to read in the whole file.)

Of course, this assumes that if a match occurs at all, it will occur within a single line. If your matches may overlap two or more lines, you will need to adapt the approach by employing a sliding window technique to examine n lines at a time, moving the window forward each time by discarding the first line and adding the next (n + 1th) line to the window. Then the key task is to determine the optimum size of n — which must be large enough to ensure that all possible matches are accommodated. To find the most efficient size for n, a certain amount of trial-and-error is usually required.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^4: grepping a large file and stopping on first match to a list

by msh210 (Monk) on Feb 23, 2016 at 15:32 UTC

Re^3: grepping a large file and stopping on first match to a list

by Anonymous Monk on Feb 23, 2016 at 08:01 UTC

Ah, so it's the loading of the entire file that's slowing me down.

And you say that your and kcott's solutions don't avoid that

first

use strict;
use warnings;

my @strings = qw(
  foobarbaz111222333444
  barbazfoo111555666999
  bazfoobar999888777666
);

my $text = <<'END';
bdfjadslfjhasdjklfhjklashdflkjadshfjkladhfjkldhfljkafjadfdji
adlfjhdlsjkfuiowerfhwehrbeblbflasdfbhjkaldhqpqheuihfbdfkhyyy
qjdpdbnfdbjdfklbasjfbajksdbfjaksdbfjaksdfbjkasdfhydsfadfjyyy
END

$text x= 1_000_000;
$text .= "bazfoobar999888777666\n";

printf "text size: %d bytes (%.2f mb)\n",
  length($text), length($text) / ( 1024 * 1024 );

my $regex = join '|', map quotemeta, @strings;
$regex = qr/^($regex)$/m;

printf "found %s at %d\n", $1, $-[0] if $text =~ $regex;
[download]

[reply]
[d/l]
[select]

Re^3: grepping a large file and stopping on first match to a list [no modules]

by kcott (Archbishop) on Mar 23, 2016 at 10:34 UTC

[I haven't logged in for a tenday or so. Apolgies for the late response.]

In an earlier post, Anonymous Monk wrote:

"The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient."

I'd just like to say that I concur with the "not the tie stuff" part.

"... to address some of the questions in kcott's reply ..."

Thanks for that. I think you shouldn't use any modules at all (except the pragmata). Consider code like this:

#!/usr/bin/env perl
#
#   pm_1155868_first_grep_large_file_2.pl [no modules]
#

use strict;
use warnings;
use autodie;

open my $lines_fh, '<', input_filename();
while (<$lines_fh>) {
    next unless is_a_match($_);
    process_match($_) and last;
}

sub input_filename { 'pm_1155868_input.txt' }
sub get_strings { [qw{rst uvw xyz}] }
BEGIN {
    my $re = '(?:' . join('|', @{get_strings()}) . ')';
    sub is_a_match { $_[0] =~ /$re/ }
}
sub process_match { print "Match: $_[0]" }
[download]

It produces the same output:

$ pm_1155868_first_grep_large_file_2.pl
Match: dddddddxyzdddd
[download]

You'll need to modify &input_filename, &get_strings and &process_match to suit.
$re and &is_a_match should work fine as is; modify if necessary.
Note how the while loop ignores all records that don't match (next); when the first match is found, it uses it and then exits the loop (last). This seems to match your requirements.
Use Benchmark, as previously discussed, to compare potential solutions.

— Ken

[reply]
[d/l]
[select]

Re: grepping a large file and stopping on first match to a list
by marioroy (Prior) on Feb 23, 2016 at 19:44 UTC

Greetings msh210,

The MCE module includes a small demonstration for scanning file(s) quickly with the ability to stop upon reaching the first match.

/path/to/egrep.pl --help

/path/to/egrep.pl --max-count=1 -n "str1|str2|str3" /path/to/huge_file
[download]

It scans the file in chunks with parallelism enabled. It's quite fast.

Regards, Mario

[reply]
[d/l]

Re^2: grepping a large file and stopping on first match to a list

by Anonymous Monk on Feb 23, 2016 at 19:50 UTC

The example supports many options.

egrep.pl --help

Options for Many-Core Engine:
  --max-workers=NUM         override max workers (default auto)
                              e.g. auto, auto-2, 4

  --chunk-size=NUM[KM]      override chunk size (default 4M)
                              minimum: 200K; maximum: 20M

Usage: egrep.pl [OPTION]... PATTERN [FILE] ...
Search for PATTERN in each FILE or standard input.
Example: egrep.pl -i 'hello world' menu.h main.c

Regexp selection and interpretation:
  -e, --regexp=PATTERN      use PATTERN as a regular expression
  -i, --ignore-case         ignore case distinctions

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
      --help                display this help and exit

Output control:
  -m, --max-count=NUM       stop after NUM matches
  -n, --line-number         print line number with output lines
  -H, --with-filename       print the filename for each match
  -h, --no-filename         suppress the prefixing filename on output
  -q, --quiet, --silent     suppress all normal output
  -R, -r, --recursive       equivalent to --directories=recurse
      --include=PATTERN     files that match PATTERN will be examined
      --exclude=PATTERN     files that match PATTERN will be skipped
      --exclude-from=FILE   files that match PATTERN in FILE will be s
+kipped
      --exclude-dir=PATTERN directories that match PATTERN will be ski
+pped
                            requires a recent egrep binary for --exclu
+de-dir
  -L, --files-without-match only print FILE names containing no match
  -l, --files-with-matches  only print FILE names containing matches
  -c, --count               only print a count of matching lines per F
+ILE

With no FILE, or when FILE is -, read standard input. If less than
two FILEs given, assume -h. Exit status is 0 if match, 1 if no match,
and 2 if trouble.
[download]

[reply]
[d/l]

Re^2: grepping a large file and stopping on first match to a list

by msh210 (Monk) on Feb 24, 2016 at 19:07 UTC

Thank you.

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

[reply]
[d/l]

Re: grepping a large file and stopping on first match to a list
by perlfan (Parson) on Feb 23, 2016 at 16:56 UTC

If this is a relatively stable corpus of some kind, then there are any number of things you could do - use the commandline grep, throw it in SQLite/MySQL - or even better, Sphinx.

Once you do this, any suggestion you use in this discussion for efficiency in the Perl script itself will work even better for you.

[reply]
[d/l]

Re^2: grepping a large file and stopping on first match to a list

by msh210 (Monk) on Feb 23, 2016 at 17:03 UTC

Thanks! I'll look into it.

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

[reply]
[d/l]