comment on

I'm a bit late to the party but I noticed this thread is still active. Here's my contribution, which is based on and very similar to choroba's idea but using %+ instead of %-, and adding siberia-man's suggestion for skipping lines of the file before the main loop. The idea of the following code is that since we're constructing the regex ourselves, we know that only one named capture group (?<mN>...) will match at a time, and so keys %+ should only ever return one value, from which we extract the digits N. As for why I sort the strings by length, see the tutorial Building Regex Alternations Dynamically. Another thing to note is that if multiple patters could match on a single line, only the first one is matched; I'm not sure if that's acceptable in your case? It would also be possible to modify the code to find all matches on a single line with the /g modifier.

use warnings;
use strict;

my @pat_array = sort { length $b <=> length $a }
    qw/ foo ba baz quzz /;
my $InStartLineNumber = 2; # nr. of lines to skip

my $i=0;
my ($regex) = map {qr/$_/} join '|',
    map { '(?<m'.$i++.'>'.quotemeta.')' }
    @pat_array; # pre-sorted above
my @match_count = (0) x @pat_array;

<DATA> for 1..$InStartLineNumber;
while (<DATA>) {
    if ($_=~$regex) {
        $match_count[ substr( (keys %+)[0], 1 ) ]++;
    }
}

for my $i (0..$#pat_array) {
    print $pat_array[$i],": ",$match_count[$i],"\n";
}

__DATA__
Skip me foo
Skip me bar
Hello foo
World
bar
foo bar
baz
foo
quz
[download]

Output:

quzz: 0
foo: 3
baz: 1
ba: 1
[download]

I haven't yet benchmarked this against a big file, but give it a try. The above code assumes that you need your output in @match_count as you showed. If other data structures are acceptable, note the code can be simplified even more by using a single capture group and a hash, as follows. The set-up code and __DATA__ section is the same as the above.

my ($regex) = map {qr/($_)/} join '|', map {quotemeta} @pat_array;
my %match_count;

<DATA> for 1..$InStartLineNumber;
while (<DATA>) {
    if ($_=~$regex) {
        $match_count{$1}++;
    }
}

for my $k (sort keys %match_count) {
    print $k,": ",$match_count{$k},"\n";
}
[download]

Output:

ba: 1
baz: 1
foo: 3
[download]

One more thought: You haven't said why you need to skip lines in the file, but if the number of lines you're skipping is large, then of course that will take some time. If the amount of data you want to skip is somehow predictable, you could seek ahead in the file, this would be much faster. For example, say you have already processed a set of lines from the beginning of the file, and now you want to process the rest of the file, then I would suggest that the code which processes the first part of the file should record where it stopped (tell), so you can then seek to that position.

In reply to Re: Multiple patterns match in a big file and track the counts of each pattern matched by haukex
in thread Multiple patterns match in a big file and track the counts of each pattern matched by ansh007

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.