I'm working on a script that will split parts of a log file to different output files if the line matches certain keywords or regex. First the script reads in the user defined keywords/regex and name of the output files. Here is an example rule file:

## This is a comment! OPTIONS --> OPTIONS.txt GET --> GET.txt REGEX: ^124\.40\.\d{1,3}\.\d{1,3} --> REGEX.txt ::default:: --> default.txt
## is for comments, lines starting with REGEX are regex and anything that doesnt match a keyword or rule gets placed in a default file. Anyways, I tested this on 500mb log files and it takes about 1min 20sec to execute. I'm looking to use this on much bigger logs and was wondering if this could be optimized any better. I'm aware its better to seek the file on disk, but I'd like to use STDIN since this script will most likely be used at the end of a long pipe:
#! /usr/bin/perl use strict; use warnings; my $default; my %rules; open INFILE, shift || "split_rules.txt" or die $!; while(<INFILE>) { unless(m/^##/) { if(m/::default:: --> (\S+)/) { $default = $1; open DEFAULT, ">$default" || die $1; } elsif(m/REGEX: (\S+) --> (\S+)/) { $rules{qr/$1/} = $2; } elsif(m/(\S+) --> (\S+)/) { my $string = quotemeta($1); $rules{qr/$string/} = $2; } else { die "$0: Syntax Error!\n"; } } } close INFILE; foreach my $rule (keys %rules) { open(my $fh, ">", $rules{$rule}) || die $!; $rules{$rule} = $fh; } while(my $line = <STDIN>) { study $line; my $match = 0; foreach my $rule (keys %rules) { if($line =~ /$rule/) { $match=1; print {$rules{$rule}} $line; } } if(defined($default) && $match!=1) { print DEFAULT $line; } } foreach my $rule (keys %rules) { close $rules{$rule}; } if(defined($default)) { close DEFAULT; }
Thanks! UPDATE: crashtest pointed out that putting a compiled regex in to a hash converts it back to a string. My execution time went down to 22seconds. Thanks!

In reply to Splitting Apache Log Files by cmm7825

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.