I'm a bit late to the party but I noticed this thread is still active. Here's my contribution, which is based on and very similar to choroba's idea but using %+ instead of %-, and adding siberia-man's suggestion for skipping lines of the file before the main loop. The idea of the following code is that since we're constructing the regex ourselves, we know that only one named capture group (?<mN>...) will match at a time, and so keys %+ should only ever return one value, from which we extract the digits N. As for why I sort the strings by length, see the tutorial Building Regex Alternations Dynamically. Another thing to note is that if multiple patters could match on a single line, only the first one is matched; I'm not sure if that's acceptable in your case? It would also be possible to modify the code to find all matches on a single line with the /g modifier.
use warnings; use strict; my @pat_array = sort { length $b <=> length $a } qw/ foo ba baz quzz /; my $InStartLineNumber = 2; # nr. of lines to skip my $i=0; my ($regex) = map {qr/$_/} join '|', map { '(?<m'.$i++.'>'.quotemeta.')' } @pat_array; # pre-sorted above my @match_count = (0) x @pat_array; <DATA> for 1..$InStartLineNumber; while (<DATA>) { if ($_=~$regex) { $match_count[ substr( (keys %+)[0], 1 ) ]++; } } for my $i (0..$#pat_array) { print $pat_array[$i],": ",$match_count[$i],"\n"; } __DATA__ Skip me foo Skip me bar Hello foo World bar foo bar baz foo quz
Output:
quzz: 0 foo: 3 baz: 1 ba: 1
I haven't yet benchmarked this against a big file, but give it a try. The above code assumes that you need your output in @match_count as you showed. If other data structures are acceptable, note the code can be simplified even more by using a single capture group and a hash, as follows. The set-up code and __DATA__ section is the same as the above.
my ($regex) = map {qr/($_)/} join '|', map {quotemeta} @pat_array; my %match_count; <DATA> for 1..$InStartLineNumber; while (<DATA>) { if ($_=~$regex) { $match_count{$1}++; } } for my $k (sort keys %match_count) { print $k,": ",$match_count{$k},"\n"; }
Output:
ba: 1 baz: 1 foo: 3
One more thought: You haven't said why you need to skip lines in the file, but if the number of lines you're skipping is large, then of course that will take some time. If the amount of data you want to skip is somehow predictable, you could seek ahead in the file, this would be much faster. For example, say you have already processed a set of lines from the beginning of the file, and now you want to process the rest of the file, then I would suggest that the code which processes the first part of the file should record where it stopped (tell), so you can then seek to that position.
In reply to Re: Multiple patterns match in a big file and track the counts of each pattern matched
by haukex
in thread Multiple patterns match in a big file and track the counts of each pattern matched
by ansh007
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |