Re^2: Parsing a tab delimited file

there are a few pitfalls to using the code posted above. i'd like to take a minute to explain some of them to everyone, in no particular order.

you'll have a hard time tracking down bugs unless you die or warn on failed opens and closes. use strict and warnings for the same reason.
don't use map or grep in a void context. it's returning something, and you're just throwing it away. read a little more about that at the faq. i'd replace
```
@molecules = <MOL>;
map { chomp; $_ = quotemeta; } @molecules;
[download]
```
with
```
push my @molecules, quotemeta chomp while <MOL>;
[download]
```
anyway, if i wanted to be fancy.
your $bigRegex can fail depending on the order of the elements. consider (i'm making this up) /ABC|AABC/. ABC will also match AABC, it probably should not. ABC will also match ABCD, and surely that's not right. replace
```
$bigRegex = join '|', @molecules;
[download]
```
with
```
my $bigRegex;
($bigRegex .= join( '|', '\b'. $_ . '\b' ) ) for @molecules;
[download]
```
to test for word boundaries. also, i have a feeling
```
my $bigRegex;
$bigRegex = join '|', map { "\b$_\b" } sort { length $b <=> length $a 
+} @molecules;
[download]
```
will speed up the regex by testing by longest words first, but i may be wrong.
the original poster asked for fields < 6 to be ignored, so the if condition should check for >= instead of ==
i believe your regex is incorrect. although it's hard to judge the original posters idea of valid data. if it's okay to have empty values for the first two fields, the regex will fail. /^([^\t]+\t)?($bigRegex)/ matches line begin, followed by a group of ( one or more non-tab characters followed by a tab ).... if the first field is empty, this fails. use /^([^\t]*\t)?($bigRegex)/ instead (a * instead of a +.)

all in all, your code will work with a few modifications. i find it a little obfuscated, though. here it is, with the changes i've suggested.

#!/usr/bin/perl -w
use strict;

open( MOL, "molecules" ) or die "ack! - $!";
push my @molecules, quotemeta chomp while <MOL>;
close MOL or warn "ack - $!";

my $bigRegex;
$bigRegex = join '|', 
    map { "\b$_\b" } 
    sort { length $b <=> length $a } @molecules;

open( LOCUS, "locus" ) or die "ack! - $!";
my @locus = <LOCUS>;
close LOCUS or die "ack! - $!";

for(@locus) 
{
    print if( scalar( split /\t/ ) >= 6 
        && /^([^\t]*\t)?($bigRegex)/ );
}
[download]

by the way, i like your use of ?() in the regex. i recommend readers investigate this powerful construct by reading about it in perlre.

~Particle *accelerates*

Comment on Re^2: Parsing a tab delimited file Select or Download Code

Replies are listed 'Best First'.
Re: Re^2: Parsing a tab delimited file by graff (Chancellor) on May 13, 2002 at 20:59 UTC
Thanks -- those are mostly good points... except there's a problem in the second item: I'd replace `@molecules = <MOL>; map { chomp; $_ = quotemeta; } @molecules;` [download] with `push my @molecules, quotemeta chomp while <MOL>;` [download] The problem with this replacement is that chomp simply modifies its arg ($_ in this case), but does not return the modified arg -- something other than the chomp'ed string gets returned to quotemeta, and pushed onto @molecules. I just recently tried this sort of approach in an attempt to shorten a one-liner, and didn't get what you expect: `push @m, quotemeta chomp while <DATA>; print join("\|",@m),$/; __DATA__ AAA BBB &*(` [download] Yields: 1\|1\|1 Maybe "map" in a void context isn't sexy, but it does work. (I agree that grep in a void context would be silly.)	[reply] [d/l] [select]
Re^4: Parsing a tab delimited file by particle (Vicar) on May 13, 2002 at 22:37 UTC
you're right! i should have tested before posting. i still think map in void context is a bad idea, so i came up with this: `my @molecules = map { chomp; quotemeta } <DATA>;` ...but is it sexy enough ;-) ~Particle accelerates	[reply] [d/l]