There are a couple things you can do to speed up your code a bit. I don't know if they'll give you the kind of improvements you are looking for though. You didn't mention what hardware you got those runtimes on, by the way. For heavy log processing, it often makes sense to use a dedicated host or hosts and this might be a problem worth throwing some hardware at as you need it in the future. Analyzing 6 hours worth of data in 4 hours still allows you to finish well ahead of starting the next batch so why isn't it quick enough? What are your requirements?

Following are some optimizations.

The first thing I see is that you use s/// to count characters. You should use tr/// for that instead.

my $count = $string =~ tr/@//;

I don't know how much it will help in your case but using alternation when a character class will do is generally frowned upon. Also, you might as well use precompiled expressions when you can. I'd change

$CHARS = '(<|>|\[|\])';
to
$CHARS = qr/([][<>])/;

In your first loop, why do you perform a substitution on $id after matching it with a regular expression in the first place? I would replace

$id = $1 if (/msgid=([^:]+):/); $id =~ s/(^<|>$)//g;
with
($id) = /msgid=<([^>]+)/;
It isn't exactly the same but I think it would work based on your sample data. If the msgid isn't always bracketed by '<' and '>' then it will break. It will also break if a '>' is permitted in a msgid. One way or the other though, there is bound to be a way to express what you need in one regex.

This may be more of a style issue, but in that first loop, you also have

if (!$id || $id =~ /^(\s+|)$/) { $no_id++; next; }
Why not just say $no_id++ and next unless $id =~ /\S/;? For that matter, you start it with next if (!/MsgTrace/); which would be better expressed with next unless /MsgTrace/; because it avoid the double negative.

In your next loop, you make a copy of your data each time. That's a waste especially since you go to lengths not to modify your copy and you throw both the copy and the original out when you are done. (In fact, you probably shouldn't go to the trouble throwing the original out, by the way. You aren't doing what you think anyway.) Just work with the original. That's bound to save you some time. Also, consider passing a reference to process_it() rather than the array itself. Also, in that function, each of those greps is a loop. Try to combine them. Instead of

(@data) = grep(!/received from internet:/, @in); (@data) = grep(!/Error-Handler/, @data) if ( grep(/Error-Handler/, @data) && $data[1] );
try something like (untested):
if ($data[1]) { # Do you really mean that, by the way? my (@t1,@t2); @t1 = map { if (/Error-Handler/) { $_ } else { !/received from internet:/ and push @t2, $_; (); } } @data; @data = @t2 if @t1; }
There's probably a better way to do that but that's one way you might reduce those three loops (greps) to one (map).

Another micro-optimisation: Don't capture things you don't use in regexen:

if ($line[0] =~ /( |-)([a-zA-Z]+)$/) {
would be better written as
if ($line[0] =~ /[ -](a-zA-Z]+)$/) {
instead.

I imagine you can clean it up in other ways too. The most dramatic changes to performance usually come from algorithmic changes and I don't know enough about your problem to really think about a better way to do it. Eliminating loops within other loops is one rule of thumb that should help you a great deal as long as you remember that greps and maps are loops too.

-sauoq
"My two cents aren't worth a dime.";

In reply to Re: The need for speed by sauoq
in thread The need for speed by l2kashe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.