Re: Multiple patterns match in a big file and track the counts of each pattern matched

I think you can significantly improve performance separating the logic responsible for skipping and counting. The next step is to rework the counting code using builtin features of Perl. Finally you can try to apply this approach in your script with your data. It would be very interesting to know results of this. Try to apply the following code (please be noticed that it is not complete, it just shows the concept of the approach described above). I have commented it enough to understand what happens on each step. Some mandatory parts are omitted to emphasize key moments of the approach. You need add them in a final version before starting your tests.

# initialize the array of patterns
# the same code as you use in your script, just complete the line
my @pat_array = ...;

# this is new hash variable used for counting matches
# it used entirely instead your approach
my %match_count;

# skip first lines
# simple read them and do nothing over them
<LOG_READ> for ( 1..$InStartLineNumber );

# normal work
# read line by line the rest of the file and do something
while ( <LOG_READ> ) {

    # read the line, and store in the variable explicitly
    chomp;
    my $line = $_;

    # walk through the list of patterns
    # test the line for matching each pattern
    # and count every successful match in the hash
    map { $line =~ m/\Q$_\E/ and $match_count{$_}++; } @pat_array;

}

# The rest of code handling with @pat_array and %match_count
[download]

Comment on Re: Multiple patterns match in a big file and track the counts of each pattern matched Download Code

Replies are listed 'Best First'.
Re^2: Multiple patterns match in a big file and track the counts of each pattern matched by ansh007 (Novice) on Dec 04, 2017 at 11:11 UTC
Thank you so much for such a detailed explanation and the piece of code. It works as expected, but takes similar time to my code. Mine takes 1min 35 secs and this takes 1min 32secs. Can you please help me to optimize it at least up to 40 secs ? waiting for your response :)	[reply]
Re^3: Multiple patterns match in a big file and track the counts of each pattern matched by siberia-man (Friar) on Dec 04, 2017 at 19:59 UTC
Definitely, 1GB file is quite huge! Do you really think that it is possible to improve the performance in this case? Any way there are two another hints given by other monks: 1) use `index` or 2) combine few small regexps into the bigger one. Also you can remove the part creating the regexps out of the loop: create regexps before looping and use "compiled" regexps within the loop.	[reply] [d/l]


"be consistent"
	PerlMonks