how to parse large files

domcyrus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: how to parse large files by Anno (Deacon) on Mar 29, 2007 at 12:30 UTC
Speed is not your only problem. `while (<$fh>) { foreach my $subset (keys %lookingFor) { foreach my $item (@{$subset}) { if (<$fh> =~ m/$item/) { my $writeFh = new FileHandle ">> myout.log"; print $writeFh <$fh>; } } }` [download] You are reading a new line from the log file each time you try a match in the "`if`" statement. Make that `if ( m/$item/ ) {` [download] so you run all the matches on the current line. That won't make it faster but correcter. Further, you are re-opening the output file each time there is something to write. Open it once before the loop (simple write mode is okay, no need for append), and just use it in the loop. That will make it faster though there's no way of saying by how much. Try that. If it's still too slow, go back to optimizing the regex work. Anno	[reply] [d/l] [select]
Re: how to parse large files by rhesa (Vicar) on Mar 29, 2007 at 12:20 UTC
Your bottleneck is likely still IO. Two tips: Don't open a new Filehandle for each match; instead, open one at the start, and just write a match to it Try to eliminate the remaining `foreach` loop Splitting the file up and processing it with multiple simultaneous processes isn't going to help much if you're reading off a single disk or a shared bus.	[reply] [d/l]
Re: how to parse large files by Moron (Curate) on Mar 29, 2007 at 12:26 UTC
Looking at the second example code, it looks like the regexps are being built from scratch on a per use basis, in spite of there being only one call to build them all at the beginning. The first example seems therefore a better basis to optimise from. And I would be inclined to blame its performance problems on the fact that you are getting a new output filehandle per line which is causing huge object proliferation to take place. You only need the output filehandle to be constructed once, e.g.: `my %lookingFor; # keys => different name of one subset # values => array of one subset my $fh = new FileHandle "< largeLogFile.log"; my $writeFh = new FileHandle ">> myout.log"; while (<$fh>) { foreach my $subset (keys %lookingFor) { foreach my $item (@{$subset}) { if (<$fh> =~ m/$item/) { print $writeFh <$fh>; } } } close $fh or die $!; close $writeFh or die $!;` [download] Update: your code will also fail a "use strict" which should be placed at the beginning. To solve that, construct the hash so that its values are array references rather than array names and the inner loop should change to... `foreach my $item ( @{$lookingFor{ $subset }} {` [download] -M Free your mind	[reply] [d/l] [select]
Re: how to parse large files by dk (Chaplain) on Mar 29, 2007 at 13:04 UTC
You're almost there with `buildRegexp`, but you can speedup it further by removing `sub` call: `sub build_regexp { my $expr = join '\|', map { "(?:To\|is)\:\S+\@$_" } @_; eval { $expr = qr/$expr/i }; die "aaa!! $@ !!!" if $@; $expr; }` [download] (note that I'd suggest changing `()` grouping into `(?:)` grouping if you're not using `$1`). then, all you need is `my $writeFh = new FileHandle ">> myout.log"; my $is_subset = build_regexp(@allSubSets); while (<fh>) { next unless /$is_subset/; print $writeFh <$fh>; }` [download]	[reply] [d/l] [select]
Re^2: how to parse large files by dk (Chaplain) on Mar 29, 2007 at 13:06 UTC
oh. `print $writeFh $_;` of course.	[reply] [d/l]
Re: how to parse large files by johngg (Canon) on Mar 29, 2007 at 13:07 UTC
If at all possible I would try to go through the combinations of items once at the beginning of the script, taking the process out of the `while ( <$fh> ) { ... }` loop, building a single regex to match any wanted line. Something like `my @items = (); foreach my $subset ( keys %lookingFor ) { foreach my $item ( @$subset ) { push @items, $item; } } my $rxMatchItems; { local $" = q{\|}; $rxMatchItems = qr{(?:@items)}; }` [download] As others have pointed out, open the output file once at the beginning of the script then the remainder of the code could be reduced to something like `while ( <$fh> ) { next unless m{$rxMatchItems}; print $writeFh $_; }` [download] This approach all depends on whether your items can easily be combined into one regex, but I think it might be more efficient if feasible. I hope this is of use. Cheers, JohnGG	[reply] [d/l] [select]
Re: how to parse large files by roboticus (Chancellor) on Mar 29, 2007 at 18:06 UTC
domcyrus: In addition to all the other notes you have, you might want to change the structure of your loop a bit. Rather than search over all the values in a hash, make an inverted hash and then just do a single lookup for each value. That way, you needn't execute nested loops. Specifically: #!/usr/bin/perl -w use strict; use warnings; my %lookingFor = ( 'lazy' => ['lazy', 'tired'], 'entry' => ['entry', 'opening', 'ingress'], 'file' => ['file', 'files', 'filehandle'], 'such' => ['such'] ); # Build an inverted hash with pointers from the individual items # to the matching key in lookingFor my %revLUP; for my $k (%lookingFor) { for my $v (@{$lookingFor{$k}}) { $revLUP{$v}=$k; } } while (my $buf = <DATA>) { # Print line if it has a 'magic word' in it print $buf if grep { defined $revLUP{$_} } split /\s+/, $buf; } __DATA__ Now is the time for all good men to come to the aid of their party. The quick red fox jumped over the lazy brown dog. [tye]: yes, on Window or on Unix, the old file is still open so it is just its directory entry that gets clobbered [bart]: On Linux, you can unlink a file and the processes that have the file open, will still see the contents. I suspect the same happens here. [tye]: "busy" only seems to apply to executable files, talexb. no problem deleting files that are open (though Win32 C RTL /defaults/ to locking the file such that this is prevented) [blokhead]: in short, the filehandle is tied to an inode, not a filename [bart]: Meaning, the directory points to the new contents, and the old contents is unlinked (but visible). Is that correct? [download] which yields: $ ./bigfile.pl over the lazy brown dog. the old file is still open so it is just its directory entry that gets clobbered file and the processes that have the file open, will still see the contents. I deleting files that are open (though the file such that this is prevented) [blokhead]: in short, the filehandle $ [download] I don't know if this method will save you any time or not, as I haven't done any benchmarking. In any case, it may have a lot to do with the number of items in `%lookingFor`, the performance of `grep`, etc. But if this helps at all, then you can then look for further speedups. --roboticus	[reply] [d/l] [select]
Re: how to parse large files by planetscape (Chancellor) on Mar 29, 2007 at 14:26 UTC
You may also wish to have a look at grinder's most excellent Regexp::Assemble. See grinder's scratchpad and Why machine-generated solutions will never cease to amaze me for more. HTH, planetscape	[reply]
Re: how to parse large files by Krambambuli (Curate) on Mar 30, 2007 at 08:32 UTC
I'm a bit surprized that none of the answers so far have mentioned or asked about Devel::DProf. So: have you tried to run your script under the profiler ? In my experience, it might sometimes come as a surprize to see where most of the time is spent.	[reply]
Re^2: how to parse large files by chromatic (Archbishop) on Apr 11, 2007 at 21:18 UTC
The bottleneck's largely IO. If it were up to me, I would start by running a simple test with the `time` utility to see how much of the execution time went to executing the program versus waiting on IO and then decide whether it's worth profiling the code.	[reply] [d/l]
Re: how to parse large files by leocharre (Priest) on Mar 29, 2007 at 13:16 UTC
I've read something in the past that really comes to mind here, it's "perl iterators" - it's some way of coding that handles a large task one at a time. I suggest looking into this- first hit on google	[reply]
Re^2: how to parse large files by domcyrus (Acolyte) on Mar 29, 2007 at 14:17 UTC
Thanks a lot for all of your answers. First of all the code I've posted was just some pseudo code. I didn't want you to read too much. So in my script I am not opening all the time a filehandler. Also the IO wait is very low, so it is really just the CPU which is having a lot work. It seems that I I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all. Cheers -Marco	[reply]
Re^3: how to parse large files by blazar (Canon) on Apr 04, 2007 at 12:06 UTC
Thanks a lot for all of your answers. You welcome! But then, please see below. First of all the code I've posted was just some pseudo code. I didn't want you to read too much. That is very kind of you, but going the pseudo code route is generally not the best thing to do: one is always advised to prepare minimal but complete examples instead. So in my script I am not opening all the time a filehandler So what? However minimal or "pseudo" you want to stay, putting the `my $writeFh = new FileHandle ">> myout.log";` [download] line inside or outside of your loop takes exactly the same amount of space, and of keystrokes. So why not putting it where it better belongs to, to start with? Also the IO wait is very low, so it is really just the CPU which is having a lot work. Granted, your code has several, bigger problems, e.g. the `if (<$fh> =~ m/$item/)` thingy, but are you sure? How can you say that? I will have a look now at the iterators thing which looks little bit confusing to me as I am not that familiar with perl at all. As I wrote above, you welcome. But why didn't you write that you already asked the very same question (link @ GG) in clpmisc? Why didn't you write there you also asked here? What did they tell you there? Why didn't you report? For completeness, I'm pasting all the messages of the thread hereafter: Read more... (19 kB)	[reply] [d/l] [select]
Re: how to parse large files by newroz (Monk) on Mar 30, 2007 at 10:35 UTC
Hi, Tie::File module can be used to access the gigantic file on disk, instead of necessity to split the file into chunks and slurp into memory.	[reply]
Re: how to parse large files by kingkongrevenge (Scribe) on Mar 31, 2007 at 15:16 UTC
Use compiled regular expressions as the hash keys. This stops you from rebuilding the regex on every iteration. Alternatively, memoize buildRegexp. Or Tie::RegexpHash might be useful. You can eliminate a for loop and the long \|\|'ed regexp by changing the data structure. Don't group the subset names in an array to be used as the key. Put each subset name in the hash as a single key but have the value of the hash as a reference to the array. Equivalent subset names point to the same array. `my %lookingFor; my $ref = [1, 2, 3]; $lookingFor{qr/a.blah/o} = $ref; $lookingFor{qr/the.blee/o} = $ref; #Tie::RegexpHash would let you do $lookingFor{$bigfileline} #without looping over keys() at all. No idea if it's faster.` [download] The bottleneck is probably IO, but if it's not a more sophisticated data structure might help. You're not using the hash as a hash, but really as an array of arrays, i.e. an associative list. Is %lookingFor really big? If it is you might figure out a hierarchy and make it into a tree so you don't have to search linearly.	[reply] [d/l]