From: "Mumia W." Message-ID: Date: Thu, 29 Mar 2007 16:03:25 GMT On 03/29/2007 07:24 AM, cadetg@googlemail.com wrote: > Dear Perl Monks, I am developing at the moment a script which has to > parse 20GB files. The files I have to parse are some logfiles. My > problem is that it takes ages to parse the files. I am doing something > like this: > > my %lookingFor; > # keys => different name of one subset > # values => array of one subset > > my $fh = new FileHandle "< largeLogFile.log"; > [1:] while (<$fh>) { > foreach my $subset (keys %lookingFor) { > foreach my $item (@{$subset}) { > [2:] if (<$fh> =~ m/$item/) { You are aware that line 2 reads in a new chunk from $fh, and the old chunk read on line 1 is forgotten, don't you? > my $writeFh = new FileHandle ">> myout.log"; print $writeFh < > $fh>; You can open the write filehandle once and keep it open til you are done. > } > } > } > > I've already tried to speed it up by using the regExp flag=>o by doing > something like this: > > $isSubSet=buildRegexp(@allSubSets); > while (<$fh>) { > foreach my $subset (keys %lookingFor) { > if (&$isSubSet(<$fh>)) { > my $writeFh = new FileHandle ">> myout.log"; > print $writeFh <$fh>; > } > } > } > sub buildRegexp { > my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S > \+\\@\$R[$_ +]/io" } ( 0..$#R ); > my $matchsub = eval "sub { $expr }"; > if ($@) { $logger->error("Failed in building regex @R: $@"); return > ERROR; } > $matchsub; > } > > I don't know how to optimize this more. Maybe it would be possible to > do something with "map"? I think the "o" flag didn't speed it up at > all. Also I've tried to split the one big file into a few small ones > and use some forks childs to parse each of the small ones. Also this > didn't help. > > Thanks a lot for your help! > > Cheers > -Marco > It might not be possible to get much faster with such large files, but try this out: #!/usr/bin/perl use strict; use warnings; use FileHandle; use Data::Dumper; use Alias; my %lookingFor = ( houseware => [qw(wallpaper hangers doorknobs)], ); my %lookingForRx = lookingForRx(%lookingFor); my $fh = new FileHandle '< largeLogFile.log'; my $writeFh = new FileHandle '> myout.log'; while (my $line = <$fh>) { foreach my $subset (keys %lookingForRx) { if ($line =~ /$lookingForRx{$subset}/) { print $writeFh $line; } } } $writeFh->close; $fh->close; ##################################### sub lookingForRx { our (%oldHash, @oldArray); local %oldHash = @_; local @oldArray; my %hash; foreach my $subset (keys %oldHash) { alias oldArray => $oldHash{$subset}; my $rx = do { local $" = '|'; "(@oldArray)" }; $hash{$subset} = qr/$rx/; } %hash; } __END__ I haven't really tested this other than to make sure it compiles.