faster way to grep

coldy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a list of terms that need to be search for occurrneces in all files within a directory - basically I would like the same behaviour as unix grep. I have written some code but the section where I grep does not seem to work.


my @ids =<IDLIST>;

my $dir = $opts{d};
chdir($dir) or die "$!";
opendir(DIR, ".") or die "Can't open $dir: $!";
my @files = grep {/fa$/}  readdir DIR;
close DIR;
foreach my $id (@ids) {
   foreach my $fasta (@files){
        local @ARGV = @files;
        print grep (/$id/,<>), "\n";
   }
}
[download]

The reason I dont use unix grep is that I need to further post process the result.

I know each $id in @ids does occur in one of the files but in my code the result of print is always empty! - also, I would like to do this at the same speed as unix grep if possible. Any suggestions?

Thanks in advance, Chris.

Comment on faster way to grep Download Code

Replies are listed 'Best First'.
Re: faster way to grep by toolic (Bishop) on Mar 15, 2010 at 22:44 UTC
I know each $id in @ids does occur in one of the files but in my code the result of print is always empty! The quickest thing to try is to chomp your @ids: `my @ids =<IDLIST>; chomp @ids;` [download] You should not assume that one approach is faster than another; Benchmark it instead. Avoiding unix grep should make your code more portable, but you would have to measure whether it is faster with or without unix grep. Also, I think you are unnecessarily looping through all your files multiple times. It seems there is no need for your `foreach my $fasta (@files) {` loop. Here is a slightly more conventional approach, in my opinion, which does not use Perl's grep: `foreach my $id (@ids) { foreach my $fasta (@files) { open my $fh, '<', $fasta or die "can not open file $fasta: $!" +; while (<$fh>) { print if /$id/; } close $fh; } }` [download]	[reply] [d/l] [select]
Re^2: faster way to grep by coldy (Scribe) on Mar 15, 2010 at 22:53 UTC
Thanks, that worked. Will try some benchmarking now.	[reply]
Re: faster way to grep by Illuminatus (Curate) on Mar 16, 2010 at 00:59 UTC
If your files are not too large, toolic's solution might be faster if you read in the entire file at one time into an array and loop through the array: `my @fileData = <FILE>; foreach my $fileLine (@fileData) {` [download] Also, if you don't mind using unix grep, you can still use it and do post-processing, by opening the grep as a pipe: `open (DATA, "grep <string> <file-list> \|"); while (<DATA>) {` [download]	[reply] [d/l] [select]
Re: faster way to grep by 7stud (Deacon) on Mar 16, 2010 at 08:36 UTC
What do you get when you add: `print "@ids\n"; print "@files\n";` [download] to your code? This doesn't seem like the right approach to me: `foreach my $id (@ids) { foreach my $fasta (@files) { open my $fh, '<', $fasta or die "can not open file $fasta: $!" +; while (<$fh>) { print if /$id/; } close $fh; } }` [download] If there are 5 ids and 10 files, that code will open and close each file 5 times. How about something like this: `my $pattern = join '\|', @ids; print grep {/$pattern/} <>;` [download]	[reply] [d/l] [select]
Re^2: faster way to grep by dsheroh (Monsignor) on Mar 16, 2010 at 11:19 UTC
Combining the `@ids` into a single regex and testing all of them at once is definitely the big win here. The absolute biggest performance gain you can get from the OP's posted code is to fix it so that each file is only read once instead of re-reading it for each id. If you're dealing with a lot of ids and the single combined regex starts slowing down unacceptably, take a look at Regexp::Assemble for a way of building reasonably efficient regexes which check for a large number of target patterns in one shot. I've used it for up to ~500 target words/phrases at a time and I see no reason why it shouldn't perform well with much larger target sets.	[reply] [d/l]
Re: faster way to grep by hok_si_la (Curate) on Mar 16, 2010 at 16:07 UTC
Hey Chris, I may be a bit late on this, but you might try modifying and using 'ack' (check the license obviously). I asked a similar question a few months ago and several monks suggested I give it a shot. IMHO it is much better than grep. Better than Grep Best of luck, Hok_si_la	[reply]