From: "Mumia W." <paduille.4060.mumia.w+nospam@earthlink.net>
Message-ID: <hrROh.16204$PL.12911@newsread4.news.pas.earthlink.net>
Date: Thu, 29 Mar 2007 16:03:25 GMT

On 03/29/2007 07:24 AM, cadetg@googlemail.com wrote:
> Dear Perl Monks, I am developing at the moment a script which has to
> parse 20GB files. The files I have to parse are some logfiles. My
> problem is that it takes ages to parse the files. I am doing something
> like this:
> 
> my %lookingFor;
> # keys => different name of one subset
> # values => array of one subset
> 
> my $fh = new FileHandle "< largeLogFile.log";
> [1:] while (<$fh>) {
>   foreach my $subset (keys %lookingFor) {
>     foreach my $item (@{$subset}) {
> [2:]      if (<$fh> =~ m/$item/) {

You are aware that line 2 reads in a new chunk from $fh, and the old 
chunk read on line 1 is forgotten, don't you?


>         my $writeFh = new FileHandle ">> myout.log"; print $writeFh <
> $fh>;

You can open the write filehandle once and keep it open til you are done.


>       }
>    }
> }
> 
> I've already tried to speed it up by using the regExp flag=>o by doing
> something like this:
> 
> $isSubSet=buildRegexp(@allSubSets);
> while (<$fh>) {
>   foreach my $subset (keys %lookingFor) {
>     if (&$isSubSet(<$fh>)) {
>       my $writeFh = new FileHandle ">> myout.log";
>       print $writeFh <$fh>;
>     }
>   }
>  }
> sub buildRegexp {
>   my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S
> \+\\@\$R[$_ +]/io" } ( 0..$#R );
>   my $matchsub = eval "sub { $expr }";
>   if ($@) { $logger->error("Failed in building regex @R: $@"); return
> ERROR; }
>   $matchsub;
> }
> 
> I don't know how to optimize this more. Maybe it would be possible to
> do something with "map"? I think the "o" flag didn't speed it up at
> all. Also I've tried to split the one big file into a few small ones
> and use some forks childs to parse each of the small ones. Also this
> didn't help.
> 
> Thanks a lot for your help!
> 
> Cheers
> -Marco
> 

It might not be possible to get much faster with such large files, but 
try this out:

#!/usr/bin/perl
use strict;
use warnings;
use FileHandle;
use Data::Dumper;
use Alias;

my %lookingFor = (
     houseware => [qw(wallpaper hangers doorknobs)],
     );

my %lookingForRx = lookingForRx(%lookingFor);


my $fh = new FileHandle '< largeLogFile.log';
my $writeFh = new FileHandle '> myout.log';

while (my $line = <$fh>) {
     foreach my $subset (keys %lookingForRx) {
         if ($line =~ /$lookingForRx{$subset}/) {
             print $writeFh $line;
         }
     }
}


$writeFh->close;
$fh->close;

#####################################

sub lookingForRx {
     our (%oldHash, @oldArray);
     local %oldHash = @_;
     local @oldArray;

     my %hash;
     foreach my $subset (keys %oldHash) {
         alias oldArray => $oldHash{$subset};
         my $rx = do { local $" = '|'; "(@oldArray)"  };
         $hash{$subset} = qr/$rx/;
     }
     %hash;
}


__END__

I haven't really tested this other than to make sure it compiles.