in reply to Re^3: How to improve speed of reading big files
in thread How to improve speed of reading big files

I think we are on the same page on the main idea. I was surprised about this sub() as I wouldn't have even thought of that idea!
sub filterLog() { s/ {2,}/ /g; s/^((?:\S+ ){3}).+?\[?I\]?:/$1/; s/ ?: / /g; s/ ACK (\w) / $1 ACK /; ....}
I think that this is the idiomatic Perl way to set the $_ variable to some value. It is possible to say: $_ =x;. It is possible to use $_ like in your sub above. But I think this is the idiomatic way to refer to a $variable as $_.
sub filterLog { my $input = shift; foreach ($input) { s/ {2,}/ /g; s/^((?:\S+ ){3}).+?\[?I\]?:/$1/; s/ ?: / /g; s/ ACK (\w) / $1 ACK /; } # $input has been modified... #....... }

Replies are listed 'Best First'.
Re^5: How to improve speed of reading big files
by BrowserUk (Patriarch) on Sep 18, 2009 at 10:24 UTC

    The point was that he was already implicitly using $_ in a couple of places, so why not go the whole hog and avoid making copies of every line.

    The foreach scalar is a neat trick, but ultimately a do-once loop is more confusing to beginners than manipulating a global variable, which they tend to do naturally anyway before they learn better. There are many ways of improving the code in the CS sense, but the OP asked how to speed things up.

    Avoiding unnecessary copying is one way. Unwrapping the sub back into the main loop is another. Avoiding the allocations involved in generating multiple long lists, and lists of anonymous arrays, is probably the most effective first step.

    If you need optimise, then the only way is to profile properly and re-run the test after each change. Since we're only party to two subs, there's no way to know if this might be better done using a simple shell pipe chain. We also don't know whether each set of files is processed by once by a single set of query parameters; once each by several sets; frequently by many sets etc.

    Under some scenarios, loading the logs directly into a DB and querying that makes most sense. Under others, the overhead of loading a DB would outweight the gains. Without the full picture you can only attempt to answer the question as asked.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.