From: "cadetg@googlemail.com" <cadetg@googlemail.com>
Message-ID: <1175171070.630366.108750@n59g2000hsh.googlegroups.com>
Date: 29 Mar 2007 05:24:30 -0700

Dear Perl Monks, I am developing at the moment a script which has to
parse 20GB files. The files I have to parse are some logfiles. My
problem is that it takes ages to parse the files. I am doing something
like this:

my %lookingFor;
# keys => different name of one subset
# values => array of one subset

my $fh = new FileHandle "< largeLogFile.log";
while (<$fh>) {
  foreach my $subset (keys %lookingFor) {
    foreach my $item (@{$subset}) {
      if (<$fh> =~ m/$item/) {
        my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
$fh>;
      }
   }
}

I've already tried to speed it up by using the regExp flag=>o by doing
something like this:

$isSubSet=buildRegexp(@allSubSets);
while (<$fh>) {
  foreach my $subset (keys %lookingFor) {
    if (&$isSubSet(<$fh>)) {
      my $writeFh = new FileHandle ">> myout.log";
      print $writeFh <$fh>;
    }
  }
 }
sub buildRegexp {
  my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
\+\\@\$R[$_ +]/io" } ( 0..$#R );
  my $matchsub = eval "sub { $expr }";
  if ($@) { $logger->error("Failed in building regex @R: $@"); return
ERROR; }
  $matchsub;
}

I don't know how to optimize this more. Maybe it would be possible to
do something with "map"? I think the "o" flag didn't speed it up at
all. Also I've tried to split the one big file into a few small ones
and use some forks childs to parse each of the small ones. Also this
didn't help.

Thanks a lot for your help!

Cheers
-Marco