msh210 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks,

I have a file with 861_600 lines of text. I wish to find the earliest line that exactly matches any of a small set of strings. Here's part of my current code:

use List::Util 'first'; use File::Map 'map_handle'; open my $fh, '<', 'huge_file'; map_handle my $map, $fh; my $string = first {$_ ~~ @strings} ($map =~ m"^(.*)$"gm);

Does this stop searching on the first match? If not, would the following stop searching on the first match?

use List::MoreUtils 'firstval'; use File::Map 'map_handle'; open my $fh, '<', 'huge_file'; map_handle my $map, $fh; my $string = firstval {$_ ~~ @strings} ($map =~ m"^(.*)$"gm);

If not, would anything stop searching on the first match? Any other advice for how to speed up this code would be much appreciated.

$_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

Replies are listed 'Best First'.
Re: grepping a large file and stopping on first match to a list
by kcott (Archbishop) on Feb 23, 2016 at 00:49 UTC

    G'day msh210,

    "Does this stop searching on the first match?"

    Well, the obvious answer is run it and find out for yourself.

    You haven't told us anything about 'huge_file' except how many lines it has. What's the file size? What's the average (or typical) record length? Is there anything special about the file or is it just plain text?

    You haven't told anything about '@strings' except that it's "a small set ". What's the average (or typical) string length? Is there anything special about the strings?

    List::Util::first() stops searching on the first match. Why do you think this may not be the case with your code?

    If you're concerned about memory usage, you might want to consider the built-in module, Tie::File. Its documentation says:

    "The file is not loaded into memory, so this will work even for gigantic files."

    If you're concerned about speed, try several methods and Benchmark them.

    I'll also point out that the Smartmatch Operator (~~) is experimental, subject to change, and not suitable for production code.

    Bearing in mind all the things I don't know about your situation, here's how I might have tackled this task:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; use List::Util qw{first}; use Tie::File; my @strings = qw{rst uvw xyz}; my $re = '(?:' . join('|', @strings) . ')'; tie my @lines, 'Tie::File', 'pm_1155868_input.txt'; my $match = first { /$re/ } @lines; untie @lines; print "Match: '$match'";

    With this data:

    $ cat pm_1155868_input.txt aaaaaaaaaaaaaa bbbbbbbbbbbbbb cccccccccccccc dddddddxyzdddd eeeeeeeuvweeee fffffffrstffff gggggggggggggg hhhhhhhhhhhhhh iiiiiiiiiiiiii

    I get this result:

    Match: 'dddddddxyzdddd'

    That may be suitable for your needs. If so, great! If not, you'll need to provide more information: answering the questions I've already posed would be a good start; see also "How do I post a question effectively?".

    — Ken

Re: grepping a large file and stopping on first match to a list
by Anonymous Monk on Feb 23, 2016 at 04:14 UTC
    Does this stop searching on the first match?
    Well, it does, you can test it yourself:
    $ perl -le ' use List::Util "first"; first { $_->() } sub { print 0; 0 }, sub { print 1; 1 }, sub { print 2; 1 }; '
    a more interesting question is: does the expression $map =~ m"^(.*)$"gm put 861_600 lines on perl's argument stack before even calling first? It does, which IMO kind of defeats the purpose of "stopping on the first match".

    The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient. I use it often. I also recommend to avoid smartmatch; it's too smart for most programmers ("27-way recursive runtime dispatch by operand type" is something I personally don't even want to understand). If by "exactly matches" you mean literally matches (as by eq), compile regex like this:

    my $regex = join '|', map quotemeta, @strings; $regex = qr/^($regex)$/;
    and use it like this:
    $string = $1 if $map =~ $regex;
    I don't see much point in tie'ing the file; nor in memory mapping it, for that matter. But maybe you need the map for something else. If not, just read the file in the usual way, using the while loop.
      a more interesting question is: does the expression $map =~ m"^(.*)$"gm put 861_600 lines on perl's argument stack before even calling first? It does, which IMO kind of defeats the purpose of "stopping on the first match".

      Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that. (Thank you both for them!) Any idea what can, if anything?

      If it helps (and to address some of the questions in kcott's reply), the file is text, with a maximum line length of (I'm estimating now, since I don't have it at hand) a few hundred bytes, and is static (meaning unchanging, so that I can, if necessary, include it in my Perl script instead of opening it as a file).

      $_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6

        Hello msh210,

        Ah, so it's the loading of the entire file that's slowing me down. And you say that your and kcott's solutions don't avoid that.

        No, Anonymous Monk did not say that! Quite the contrary: he and kcott are recommending that you read the file line-by-line, stopping when the first match is found. (kcott’s quotation from the Tie::File documentation specifically rules out any need to read in the whole file.)

        Of course, this assumes that if a match occurs at all, it will occur within a single line. If your matches may overlap two or more lines, you will need to adapt the approach by employing a sliding window technique to examine n lines at a time, moving the window forward each time by discarding the first line and adding the next (n + 1th) line to the window. Then the key task is to determine the optimum size of n — which must be large enough to ensure that all possible matches are accommodated. To find the most efficient size for n, a certain amount of trial-and-error is usually required.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        Ah, so it's the loading of the entire file that's slowing me down.
        That, and probably smartmatch too (depending on what it does...).
        And you say that your and kcott's solutions don't avoid that
        No. Just don't use smartmatch and first:
        use strict; use warnings; my @strings = qw( foobarbaz111222333444 barbazfoo111555666999 bazfoobar999888777666 ); my $text = <<'END'; bdfjadslfjhasdjklfhjklashdflkjadshfjkladhfjkldhfljkafjadfdji adlfjhdlsjkfuiowerfhwehrbeblbflasdfbhjkaldhqpqheuihfbdfkhyyy qjdpdbnfdbjdfklbasjfbajksdbfjaksdbfjaksdfbjkasdfhydsfadfjyyy END $text x= 1_000_000; $text .= "bazfoobar999888777666\n"; printf "text size: %d bytes (%.2f mb)\n", length($text), length($text) / ( 1024 * 1024 ); my $regex = join '|', map quotemeta, @strings; $regex = qr/^($regex)$/m; printf "found %s at %d\n", $1, $-[0] if $text =~ $regex;
        That simulates your memory mapped file.

        [I haven't logged in for a tenday or so. Apolgies for the late response.]

        In an earlier post, Anonymous Monk wrote:

        "The approach suggested by kcott (compiling the combined regex; not the tie stuff) should be fairly efficient."

        I'd just like to say that I concur with the "not the tie stuff" part.

        "... to address some of the questions in kcott's reply ..."

        Thanks for that. I think you shouldn't use any modules at all (except the pragmata). Consider code like this:

        #!/usr/bin/env perl # # pm_1155868_first_grep_large_file_2.pl [no modules] # use strict; use warnings; use autodie; open my $lines_fh, '<', input_filename(); while (<$lines_fh>) { next unless is_a_match($_); process_match($_) and last; } sub input_filename { 'pm_1155868_input.txt' } sub get_strings { [qw{rst uvw xyz}] } BEGIN { my $re = '(?:' . join('|', @{get_strings()}) . ')'; sub is_a_match { $_[0] =~ /$re/ } } sub process_match { print "Match: $_[0]" }

        It produces the same output:

        $ pm_1155868_first_grep_large_file_2.pl Match: dddddddxyzdddd
        • You'll need to modify &input_filename, &get_strings and &process_match to suit.
        • $re and &is_a_match should work fine as is; modify if necessary.
        • Note how the while loop ignores all records that don't match (next); when the first match is found, it uses it and then exits the loop (last). This seems to match your requirements.
        • Use Benchmark, as previously discussed, to compare potential solutions.

        — Ken

Re: grepping a large file and stopping on first match to a list
by marioroy (Prior) on Feb 23, 2016 at 19:44 UTC

    Greetings msh210,

    The MCE module includes a small demonstration for scanning file(s) quickly with the ability to stop upon reaching the first match.

    /path/to/egrep.pl --help /path/to/egrep.pl --max-count=1 -n "str1|str2|str3" /path/to/huge_file

    It scans the file in chunks with parallelism enabled. It's quite fast.

    Regards, Mario

      The example supports many options.

      egrep.pl --help Options for Many-Core Engine: --max-workers=NUM override max workers (default auto) e.g. auto, auto-2, 4 --chunk-size=NUM[KM] override chunk size (default 4M) minimum: 200K; maximum: 20M Usage: egrep.pl [OPTION]... PATTERN [FILE] ... Search for PATTERN in each FILE or standard input. Example: egrep.pl -i 'hello world' menu.h main.c Regexp selection and interpretation: -e, --regexp=PATTERN use PATTERN as a regular expression -i, --ignore-case ignore case distinctions Miscellaneous: -s, --no-messages suppress error messages -v, --invert-match select non-matching lines --help display this help and exit Output control: -m, --max-count=NUM stop after NUM matches -n, --line-number print line number with output lines -H, --with-filename print the filename for each match -h, --no-filename suppress the prefixing filename on output -q, --quiet, --silent suppress all normal output -R, -r, --recursive equivalent to --directories=recurse --include=PATTERN files that match PATTERN will be examined --exclude=PATTERN files that match PATTERN will be skipped --exclude-from=FILE files that match PATTERN in FILE will be s +kipped --exclude-dir=PATTERN directories that match PATTERN will be ski +pped requires a recent egrep binary for --exclu +de-dir -L, --files-without-match only print FILE names containing no match -l, --files-with-matches only print FILE names containing matches -c, --count only print a count of matching lines per F +ILE With no FILE, or when FILE is -, read standard input. If less than two FILEs given, assume -h. Exit status is 0 if match, 1 if no match, and 2 if trouble.

      Thank you.

      $_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6
Re: grepping a large file and stopping on first match to a list
by perlfan (Parson) on Feb 23, 2016 at 16:56 UTC
    In practical terms, it would behoove you to pre-process the nearly 1 million lines of text ahead of time so you're searching over a much smaller set of possible matches.

    If this is a relatively stable corpus of some kind, then there are any number of things you could do - use the commandline grep, throw it in SQLite/MySQL - or even better, Sphinx.

    Once you do this, any suggestion you use in this discussion for efficiency in the Perl script itself will work even better for you.

      Thanks! I'll look into it.

      $_="msh210";$"=$\;@_=@{[split//,uc]}[2,0];$_="@_$\1";$\=$/;++$_[0]for$...1;print lc substr crypt($_,"@_"),1,6