domcyrus has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks, I am developing at the moment a script which has to parse 20GB files. The files I have to parse are some logfiles. My problem is that it takes ages to parse the files. I am doing something like this:
my %lookingFor; # keys => different name of one subset # values => array of one subset my $fh = new FileHandle "< largeLogFile.log"; while (<$fh>) { foreach my $subset (keys %lookingFor) { foreach my $item (@{$subset}) { if (<$fh> =~ m/$item/) { my $writeFh = new FileHandle ">> myout.log"; print $writeFh <$fh>; } } }
I've already tried to speed it up by using the regExp flag=>o by doing something like this:
$isSubSet=buildRegexp(@allSubSets); while (<$fh>) { foreach my $subset (keys %lookingFor) { if (&$isSubSet(<$fh>)) { my $writeFh = new FileHandle ">> myout.log"; print $writeFh <$fh>; } } } sub buildRegexp { my @R = @_; my $expr = join '||', map { "\$_[0] =~ m/\(To\|is\)\\:\\S\+\\@\$R[$_ +]/io" } ( 0..$#R ); my $matchsub = eval "sub { $expr }"; if ($@) { $logger->error("Failed in building regex @R: $@"); return ERROR; } $matchsub; }
I don't know how to optimize this more. Maybe it would be possible to do something with "map"? I think the "o" flag didn't speed it up at all. Also I've tried to split the one big file into a few small ones and use some forks childs to parse each of the small ones. Also this didn't help. Thanks a lot for your help! -Marco

Replies are listed 'Best First'.
Re: how to parse large files
by Anno (Deacon) on Mar 29, 2007 at 12:30 UTC
    Speed is not your only problem.
    while (<$fh>) { foreach my $subset (keys %lookingFor) { foreach my $item (@{$subset}) { if (<$fh> =~ m/$item/) { my $writeFh = new FileHandle ">> myout.log"; print $writeFh <$fh>; } } }
    You are reading a new line from the log file each time you try a match in the "if" statement. Make that
    if ( m/$item/ ) {
    so you run all the matches on the current line. That won't make it faster but correcter.

    Further, you are re-opening the output file each time there is something to write. Open it once before the loop (simple write mode is okay, no need for append), and just use it in the loop. That will make it faster though there's no way of saying by how much.

    Try that. If it's still too slow, go back to optimizing the regex work.

    Anno

Re: how to parse large files
by rhesa (Vicar) on Mar 29, 2007 at 12:20 UTC
    Your bottleneck is likely still IO. Two tips:
    • Don't open a new Filehandle for each match; instead, open one at the start, and just write a match to it
    • Try to eliminate the remaining foreach loop

    Splitting the file up and processing it with multiple simultaneous processes isn't going to help much if you're reading off a single disk or a shared bus.

Re: how to parse large files
by Moron (Curate) on Mar 29, 2007 at 12:26 UTC
    Looking at the second example code, it looks like the regexps are being built from scratch on a per use basis, in spite of there being only one call to build them all at the beginning.

    The first example seems therefore a better basis to optimise from. And I would be inclined to blame its performance problems on the fact that you are getting a new output filehandle per line which is causing huge object proliferation to take place. You only need the output filehandle to be constructed once, e.g.:

    my %lookingFor; # keys => different name of one subset # values => array of one subset my $fh = new FileHandle "< largeLogFile.log"; my $writeFh = new FileHandle ">> myout.log"; while (<$fh>) { foreach my $subset (keys %lookingFor) { foreach my $item (@{$subset}) { if (<$fh> =~ m/$item/) { print $writeFh <$fh>; } } } close $fh or die $!; close $writeFh or die $!;
    Update: your code will also fail a "use strict" which should be placed at the beginning. To solve that, construct the hash so that its values are array references rather than array names and the inner loop should change to...
    foreach my $item ( @{$lookingFor{ $subset }} {

    -M

    Free your mind

Re: how to parse large files
by dk (Chaplain) on Mar 29, 2007 at 13:04 UTC
    You're almost there with buildRegexp, but you can speedup it further by removing sub call:

    sub build_regexp { my $expr = join '|', map { "(?:To|is)\:\S+\@$_" } @_; eval { $expr = qr/$expr/i }; die "aaa!! $@ !!!" if $@; $expr; }
    (note that I'd suggest changing () grouping into (?:) grouping if you're not using $1). then, all you need is
    my $writeFh = new FileHandle ">> myout.log"; my $is_subset = build_regexp(@allSubSets); while (<fh>) { next unless /$is_subset/; print $writeFh <$fh>; }
      oh. print $writeFh $_; of course.
Re: how to parse large files
by johngg (Canon) on Mar 29, 2007 at 13:07 UTC
    If at all possible I would try to go through the combinations of items once at the beginning of the script, taking the process out of the while ( <$fh> ) { ... } loop, building a single regex to match any wanted line. Something like

    my @items = (); foreach my $subset ( keys %lookingFor ) { foreach my $item ( @$subset ) { push @items, $item; } } my $rxMatchItems; { local $" = q{|}; $rxMatchItems = qr{(?:@items)}; }

    As others have pointed out, open the output file once at the beginning of the script then the remainder of the code could be reduced to something like

    while ( <$fh> ) { next unless m{$rxMatchItems}; print $writeFh $_; }

    This approach all depends on whether your items can easily be combined into one regex, but I think it might be more efficient if feasible.

    I hope this is of use.

    Cheers,

    JohnGG

Re: how to parse large files
by roboticus (Chancellor) on Mar 29, 2007 at 18:06 UTC
    domcyrus:

    In addition to all the other notes you have, you might want to change the structure of your loop a bit. Rather than search over all the values in a hash, make an inverted hash and then just do a single lookup for each value. That way, you needn't execute nested loops.

    Specifically:

    #!/usr/bin/perl -w use strict; use warnings; my %lookingFor = ( 'lazy' => ['lazy', 'tired'], 'entry' => ['entry', 'opening', 'ingress'], 'file' => ['file', 'files', 'filehandle'], 'such' => ['such'] ); # Build an inverted hash with pointers from the individual items # to the matching key in lookingFor my %revLUP; for my $k (%lookingFor) { for my $v (@{$lookingFor{$k}}) { $revLUP{$v}=$k; } } while (my $buf = <DATA>) { # Print line if it has a 'magic word' in it print $buf if grep { defined $revLUP{$_} } split /\s+/, $buf; } __DATA__ Now is the time for all good men to come to the aid of their party. The quick red fox jumped over the lazy brown dog. [tye]: yes, on Window or on Unix, the old file is still open so it is just its directory entry that gets clobbered [bart]: On Linux, you can unlink a file and the processes that have the file open, will still see the contents. I suspect the same happens here. [tye]: "busy" only seems to apply to executable files, talexb. no problem deleting files that are open (though Win32 C RTL /defaults/ to locking the file such that this is prevented) [blokhead]: in short, the filehandle is tied to an inode, not a filename [bart]: Meaning, the directory points to the new contents, and the old contents is unlinked (but visible). Is that correct?
    which yields:
    $ ./bigfile.pl over the lazy brown dog. the old file is still open so it is just its directory entry that gets clobbered file and the processes that have the file open, will still see the contents. I deleting files that are open (though the file such that this is prevented) [blokhead]: in short, the filehandle $
    I don't know if this method will save you any time or not, as I haven't done any benchmarking. In any case, it may have a lot to do with the number of items in %lookingFor, the performance of grep, etc. But if this helps at all, then you can then look for further speedups.

    --roboticus

Re: how to parse large files
by planetscape (Chancellor) on Mar 29, 2007 at 14:26 UTC
Re: how to parse large files
by Krambambuli (Curate) on Mar 30, 2007 at 08:32 UTC
    I'm a bit surprized that none of the answers so far have mentioned or asked about Devel::DProf. So: have you tried to run your script under the profiler ? In my experience, it might sometimes come as a surprize to see where most of the time is spent.

      The bottleneck's largely IO. If it were up to me, I would start by running a simple test with the time utility to see how much of the execution time went to executing the program versus waiting on IO and then decide whether it's worth profiling the code.

Re: how to parse large files
by leocharre (Priest) on Mar 29, 2007 at 13:16 UTC
    I've read something in the past that really comes to mind here, it's "perl iterators" - it's some way of coding that handles a large task one at a time. I suggest looking into this- first hit on google
      Thanks a lot for all of your answers. First of all the code I've posted was just some pseudo code. I didn't want you to read too much. So in my script I am not opening all the time a filehandler. Also the IO wait is very low, so it is really just the CPU which is having a lot work. It seems that I

      I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all.

      Cheers
      -Marco
        Thanks a lot for all of your answers.

        You welcome! But then, please see below.

        First of all the code I've posted was just some pseudo code. I didn't want you to read too much.

        That is very kind of you, but going the pseudo code route is generally not the best thing to do: one is always advised to prepare minimal but complete examples instead.

        So in my script I am not opening all the time a filehandler

        So what? However minimal or "pseudo" you want to stay, putting the

        my $writeFh = new FileHandle ">> myout.log";

        line inside or outside of your loop takes exactly the same amount of space, and of keystrokes. So why not putting it where it better belongs to, to start with?

        Also the IO wait is very low, so it is really just the CPU which is having a lot work.

        Granted, your code has several, bigger problems, e.g. the if (<$fh> =~ m/$item/) thingy, but are you sure? How can you say that?


        I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all.

        As I wrote above, you welcome. But why didn't you write that you already asked the very same question (link @ GG) in clpmisc? Why didn't you write there you also asked here? What did they tell you there? Why didn't you report?

        For completeness, I'm pasting all the messages of the thread hereafter:

Re: how to parse large files
by newroz (Monk) on Mar 30, 2007 at 10:35 UTC
    Hi,
    Tie::File module can be used to access the gigantic file on disk, instead of necessity to split the file into chunks and slurp into memory.
Re: how to parse large files
by kingkongrevenge (Scribe) on Mar 31, 2007 at 15:16 UTC

    Use compiled regular expressions as the hash keys. This stops you from rebuilding the regex on every iteration. Alternatively, memoize buildRegexp. Or Tie::RegexpHash might be useful.

    You can eliminate a for loop and the long ||'ed regexp by changing the data structure. Don't group the subset names in an array to be used as the key. Put each subset name in the hash as a single key but have the value of the hash as a reference to the array. Equivalent subset names point to the same array.

    my %lookingFor; my $ref = [1, 2, 3]; $lookingFor{qr/a.*blah/o} = $ref; $lookingFor{qr/the.*blee/o} = $ref; #Tie::RegexpHash would let you do $lookingFor{$bigfileline} #without looping over keys() at all. No idea if it's faster.

    The bottleneck is probably IO, but if it's not a more sophisticated data structure might help. You're not using the hash as a hash, but really as an array of arrays, i.e. an associative list. Is %lookingFor really big? If it is you might figure out a hierarchy and make it into a tree so you don't have to search linearly.