mwb613 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Thanks again in advance for looking. Sorry for the long post.

I spent the weekend needlessly speeding up some log file manipulations. I have a set of log files in CSV format that close every 5 minutes or so. I pull the files down, look for certain values and do some transforms, shipping, etc if they match. Up to this point I was doing it in a very non-performant way (I'll write it here for anyone in the future who is dumb enough to do it the way I did):

while (<$file_handle>) { my @line_array = split(/,/,$_); push(@filtered_result,$_) if $line_array[70] =~ /192\.168\.200\.|1 +0\.10\.200/; #THIS ISN'T EXACTLY WHAT I WAS DOING SINCE I USUALLY HAD + MULTIPLE VALUES TO CHECK }

Of course, as you guys probably already know, just GREPing the raw line is significantly faster (15x in my tests) than splitting each line and checking an individual array position. So, since it didn't really matter to me where the IP address I was filtering was located, (just that it was there) I changed it to the following:

my $grep_filters = [ { 'sub' => sub { my ($line) = shift @_; return $line if $line =~ /,SEVERE,/; }, }, { 'sub' => sub { my ($line) = shift @_; return $line if $line =~ /192\.168\.200\.|10\.10\.200/ +; return undef; }, }, ]; while(<$FILE>){ foreach my $my_filter_fn (@$filters){ my $return = $my_filter_fn->{'sub'}->($_); push(@return_array,$return) unless not defined $return; } }

This sped things up greatly. For a 400k line file with 15k matches timethese (10 times) takes about 17 seconds vs almost 200 seconds using split. Then, after some googling, I found the grep function and that you could run it right on a file handle. I still wanted to have the option to run multiple GREPs which made things a little harder but I ended up with this:

my @filter_string_array = ('192\.168\.200\.|10\.10\.200',',SEVERE,'); ... my @local_filter_string_array = @filter_string_array; my $first_filter_string = shift @local_filter_string_array; my @output = grep {/$first_filter_string/} <$FILE>; foreach my $filter_string (@local_filter_string_array){ @output = grep {/$filter_string/} @output; }

This saved about 5 seconds on 10 iterations over the method above where we'd loop through the file and grep the line. I was hoping to find a solution, however, where the grep function could be expanded (nested?) an arbitrary number of times based on the different match strings that we get. We can do it by creating a string and doing an eval EXPR like this:

my $grep_source = '<$FILE>'; my @filter_string_array = ('192\.168\.200\.|10\.10\.200',',SEVERE,'); foreach my $filter_string (@filter_string_array){ $grep_string_expansion = 'grep {/' . $filter_string . '/} (' . $gr +ep_source . ')'; $grep_source = $grep_string_expansion; } #string should look like this: grep {/,SEVERE,/} (grep {/192\.168\.2 +00\.|10\.10\.200/} (<$FILE>)); ... my @output = eval $grep_string_expansion;

This actually shaves another second off of the timethese result (told you it was needless optimization) but doing eval EXPR doesn't usually strike me as the best way to do things. For one thing we have to hard code the name of the file handle (I guess we could use a place-holder and a replace) but in general I'm just wondering / hoping that there's some thing I've never heard of that can turn these filters ('192\.168\.200\.|10\.10\.200',',SEVERE,') into this form without eval EXPR:

grep {/,SEVERE,/} (grep {/192\.168\.200\.|10\.10\.200/} (<$FILE>));

I'm pretty fuzzy on programming terminology but I think we're trying to "curry" the grep function with multiple arguments against the array (which is initially <$FILE>). Anyways, the iterative example given above might be the cleanest but I wanted to see if anyone had any input. Thanks!

Replies are listed 'Best First'.
Re: Auto-Expansion of Grep Function
by haukex (Archbishop) on Nov 13, 2017 at 20:25 UTC

    Optimization is of course a science and it should be done by locating hotspots with a profiler (Devel::NYTProf), and attacking the places where the code is spending the most time <update2> and using a module like Benchmark to test and compare alternatives </update2>. But here are some rough rules of thumb / unverified gut feelings based on experience (every mention of "slow" below is therefore subjective):

    • grep will loop over the entire array, so running it multiple times will loop over the list multiple times, and will therefore be slow
    • you are right that eval on a string is relatively slow
    • (any_list_context) = <$FILE>, as grep and foreach require, will load the entire file into memory (slow and memory-hungry)
    • therefore, foreach (@lines) will be slower than a while (<>) loop, which reads and processes only one line at a time
    • instead of your while (<>) { foreach my $pattern (@patterns) { ... } } (nested loops always slow things down), I would suggest a single regex, or regexen combined with logic (if (/regex1/ && /regex2/))

    So personally my starting point would be something like this, based on Building Regex Alternations Dynamically:

    my @strings_to_match = ('192.168.200.', '10.10.200'); my ($regex) = map { qr/$_/ } join '|', map {quotemeta} sort { length $b <=> length $a } @strings_to_match; my @filtered_result; while (<>) { push @filtered_result, $_ if m{ ,SEVERE, .* $regex }x; }

    Note that I am only recommending this solution because you said "it didn't really matter to me where the IP address I was filtering was located, (just that it was there)". Otherwise, I would recommend Text::CSV_XS for loading CSV files.

    I was hoping to find a solution, however, where the grep function could be expanded (nested?) an arbitrary number of times based on the different match strings that we get.

    No, sorry, it doesn't work that way*, however, you can make the matching logic for a single loop (while/foreach/grep) as complex as you need, as I showed above. I would not recommend building a string of Perl code and evaling it either, because it's much too easy to get burned (and in some cases expose security holes).

    Minor edits for clarity.

    * Update: The solution would be "lazy lists", which is something that can be implemented by iterators - although this is probably much too advanced for now (and probably wouldn't give you a performance gain either), Dominus's book Higher-Order Perl is a wonderful read on that topic.

Re: Auto-Expansion of Grep Function
by AnomalousMonk (Archbishop) on Nov 13, 2017 at 20:15 UTC
    ... that can turn these filters ('192\.168\.200\.|10\.10\.200',',SEVERE,') into this form ...

    grep {/,SEVERE,/} (grep {/192\.168\.200\.|10\.10\.200/} (<$FILE>));

    Have you tried it?

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @stuff = qw( A C B:Mild C:Severe B A:Severe Etc B:Severe And So On ); ;; my @out = grep /Severe/, grep /A|B/, @stuff; dd \@out; " ["A:Severe", "B:Severe"]


    Give a man a fish:  <%-{-{-{-<

Re: Auto-Expansion of Grep Function
by holli (Abbot) on Nov 13, 2017 at 22:39 UTC
    Have a look at RegExp::Assemble. Using that you can create a single regex from several other regexes that matches the same as the individual regexes would.


    holli

    You can lead your users to water, but alas, you cannot drown them.