comment on

Hi Monks,

Thanks again in advance for looking. Sorry for the long post.

I spent the weekend needlessly speeding up some log file manipulations. I have a set of log files in CSV format that close every 5 minutes or so. I pull the files down, look for certain values and do some transforms, shipping, etc if they match. Up to this point I was doing it in a very non-performant way (I'll write it here for anyone in the future who is dumb enough to do it the way I did):

while (<$file_handle>) {
    my @line_array = split(/,/,$_);
    push(@filtered_result,$_) if $line_array[70] =~ /192\.168\.200\.|1
+0\.10\.200/; #THIS ISN'T EXACTLY WHAT I WAS DOING SINCE I USUALLY HAD
+ MULTIPLE VALUES TO CHECK
}
[download]

Of course, as you guys probably already know, just GREPing the raw line is significantly faster (15x in my tests) than splitting each line and checking an individual array position. So, since it didn't really matter to me where the IP address I was filtering was located, (just that it was there) I changed it to the following:

my $grep_filters = [
     {
         'sub' => sub {
                 my ($line) = shift @_;
                 return $line if $line =~ /,SEVERE,/;
             },
     },
    {
        'sub' => sub {
                my ($line) = shift @_;
                return $line if $line =~ /192\.168\.200\.|10\.10\.200/
+;
                return undef;
            },
    },
];
while(<$FILE>){
    foreach my $my_filter_fn (@$filters){
        my $return = $my_filter_fn->{'sub'}->($_);
        push(@return_array,$return) unless not defined $return;
    }
}
[download]

This sped things up greatly. For a 400k line file with 15k matches timethese (10 times) takes about 17 seconds vs almost 200 seconds using split. Then, after some googling, I found the grep function and that you could run it right on a file handle. I still wanted to have the option to run multiple GREPs which made things a little harder but I ended up with this:

my @filter_string_array = ('192\.168\.200\.|10\.10\.200',',SEVERE,');
...
my @local_filter_string_array = @filter_string_array;
my $first_filter_string = shift @local_filter_string_array;
my @output = grep {/$first_filter_string/} <$FILE>;
foreach my $filter_string (@local_filter_string_array){
    @output = grep {/$filter_string/} @output;
}
[download]

This saved about 5 seconds on 10 iterations over the method above where we'd loop through the file and grep the line. I was hoping to find a solution, however, where the grep function could be expanded (nested?) an arbitrary number of times based on the different match strings that we get. We can do it by creating a string and doing an eval EXPR like this:

my $grep_source = '<$FILE>';
my @filter_string_array = ('192\.168\.200\.|10\.10\.200',',SEVERE,');
foreach my $filter_string (@filter_string_array){
    $grep_string_expansion = 'grep {/' . $filter_string . '/} (' . $gr
+ep_source . ')';
    $grep_source = $grep_string_expansion;
} #string should look like this: grep {/,SEVERE,/} (grep {/192\.168\.2
+00\.|10\.10\.200/} (<$FILE>));
...
my @output = eval $grep_string_expansion;
[download]

This actually shaves another second off of the timethese result (told you it was needless optimization) but doing eval EXPR doesn't usually strike me as the best way to do things. For one thing we have to hard code the name of the file handle (I guess we could use a place-holder and a replace) but in general I'm just wondering / hoping that there's some thing I've never heard of that can turn these filters ('192\.168\.200\.|10\.10\.200',',SEVERE,') into this form without eval EXPR:

grep {/,SEVERE,/} (grep {/192\.168\.200\.|10\.10\.200/} (<$FILE>));
[download]

I'm pretty fuzzy on programming terminology but I think we're trying to "curry" the grep function with multiple arguments against the array (which is initially <$FILE>). Anyways, the iterative example given above might be the cleanest but I wanted to see if anyone had any input. Thanks!

In reply to Auto-Expansion of Grep Function by mwb613

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.