there are a few pitfalls to using the code posted above. i'd like to take a minute to explain some of them to everyone, in no particular order.
- you'll have a hard time tracking down bugs unless you die or warn on failed opens and closes. use strict and warnings for the same reason.
- don't use map or grep in a void context. it's returning something, and you're just throwing it away. read a little more about that at the faq. i'd replace
@molecules = <MOL>;
map { chomp; $_ = quotemeta; } @molecules;
with
push my @molecules, quotemeta chomp while <MOL>;
anyway, if i wanted to be fancy.
- your $bigRegex can fail depending on the order of the elements. consider (i'm making this up) /ABC|AABC/. ABC will also match AABC, it probably should not. ABC will also match ABCD, and surely that's not right. replace
$bigRegex = join '|', @molecules;
with
my $bigRegex;
($bigRegex .= join( '|', '\b'. $_ . '\b' ) ) for @molecules;
to test for word boundaries. also, i have a feelingmy $bigRegex;
$bigRegex = join '|', map { "\b$_\b" } sort { length $b <=> length $a
+} @molecules;
will speed up the regex by testing by longest words first, but i may be wrong.
- the original poster asked for fields < 6 to be ignored, so the if condition should check for >= instead of ==
- i believe your regex is incorrect. although it's hard to judge the original posters idea of valid data. if it's okay to have empty values for the first two fields, the regex will fail. /^([^\t]+\t)?($bigRegex)/ matches line begin, followed by a group of ( one or more non-tab characters followed by a tab ).... if the first field is empty, this fails. use /^([^\t]*\t)?($bigRegex)/ instead (a * instead of a +.)
all in all, your code will work with a few modifications. i find it a little obfuscated, though. here it is, with the changes i've suggested.
#!/usr/bin/perl -w
use strict;
open( MOL, "molecules" ) or die "ack! - $!";
push my @molecules, quotemeta chomp while <MOL>;
close MOL or warn "ack - $!";
my $bigRegex;
$bigRegex = join '|',
map { "\b$_\b" }
sort { length $b <=> length $a } @molecules;
open( LOCUS, "locus" ) or die "ack! - $!";
my @locus = <LOCUS>;
close LOCUS or die "ack! - $!";
for(@locus)
{
print if( scalar( split /\t/ ) >= 6
&& /^([^\t]*\t)?($bigRegex)/ );
}
by the way, i like your use of
?() in the regex. i recommend readers investigate this powerful construct by reading about it in
perlre.
~Particle *accelerates*
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.