in reply to faster way to grep

What do you get when you add:

print "@ids\n"; print "@files\n";

to your code?

This doesn't seem like the right approach to me:

foreach my $id (@ids) { foreach my $fasta (@files) { open my $fh, '<', $fasta or die "can not open file $fasta: $!" +; while (<$fh>) { print if /$id/; } close $fh; } }

If there are 5 ids and 10 files, that code will open and close each file 5 times. How about something like this:

my $pattern = join '|', @ids; print grep {/$pattern/} <>;

Replies are listed 'Best First'.
Re^2: faster way to grep
by dsheroh (Monsignor) on Mar 16, 2010 at 11:19 UTC
    Combining the @ids into a single regex and testing all of them at once is definitely the big win here. The absolute biggest performance gain you can get from the OP's posted code is to fix it so that each file is only read once instead of re-reading it for each id.

    If you're dealing with a lot of ids and the single combined regex starts slowing down unacceptably, take a look at Regexp::Assemble for a way of building reasonably efficient regexes which check for a large number of target patterns in one shot. I've used it for up to ~500 target words/phrases at a time and I see no reason why it shouldn't perform well with much larger target sets.