Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Multiple patterns match in a big file and track the counts of each pattern matched

by siberia-man (Friar)
on Nov 28, 2017 at 12:33 UTC ( [id://1204422]=note: print w/replies, xml ) Need Help??


in reply to Multiple patterns match in a big file and track the counts of each pattern matched

I think you can significantly improve performance separating the logic responsible for skipping and counting. The next step is to rework the counting code using builtin features of Perl. Finally you can try to apply this approach in your script with your data. It would be very interesting to know results of this. Try to apply the following code (please be noticed that it is not complete, it just shows the concept of the approach described above). I have commented it enough to understand what happens on each step. Some mandatory parts are omitted to emphasize key moments of the approach. You need add them in a final version before starting your tests.
# initialize the array of patterns # the same code as you use in your script, just complete the line my @pat_array = ...; # this is new hash variable used for counting matches # it used entirely instead your approach my %match_count; # skip first lines # simple read them and do nothing over them <LOG_READ> for ( 1..$InStartLineNumber ); # normal work # read line by line the rest of the file and do something while ( <LOG_READ> ) { # read the line, and store in the variable explicitly chomp; my $line = $_; # walk through the list of patterns # test the line for matching each pattern # and count every successful match in the hash map { $line =~ m/\Q$_\E/ and $match_count{$_}++; } @pat_array; } # The rest of code handling with @pat_array and %match_count
  • Comment on Re: Multiple patterns match in a big file and track the counts of each pattern matched
  • Download Code

Replies are listed 'Best First'.
Re^2: Multiple patterns match in a big file and track the counts of each pattern matched
by ansh007 (Novice) on Dec 04, 2017 at 11:11 UTC

    Thank you so much for such a detailed explanation and the piece of code. It works as expected, but takes similar time to my code. Mine takes 1min 35 secs and this takes 1min 32secs. Can you please help me to optimize it at least up to 40 secs ? waiting for your response :)

      Definitely, 1GB file is quite huge! Do you really think that it is possible to improve the performance in this case? Any way there are two another hints given by other monks: 1) use index or 2) combine few small regexps into the bigger one. Also you can remove the part creating the regexps out of the loop: create regexps before looping and use "compiled" regexps within the loop.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1204422]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2024-04-16 09:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found