in reply to Parsing a tab delimited file

So, you have a "locus" file containing lines with 6 tab-delimited fields (and maybe some lines that don't fit that description), and you have a "molecules" file with a list of target strings, and your goal is to print the 6-field "locus" lines that contain any of the "molecule" strings in the first or second field.

If I got that right, then I think something like the following will do:

open( MOL, "molecules" ); @molecules = <MOL>; close MOL; map { chomp; $_ = quotemeta; } @molecules; $bigRegex = join '|', @molecules; open( LOCUS, "locus" ); @locus = <LOCUS>; close LOCUS; foreach (@locus) { print if ( scalar( split /\t/ ) == 6 && /^([^\t]+\t)?($bigRegex)/ ); }
You may want to look at "qr" on the perlop manpage (under "Regexp Quote-like Operators), if you have a lot of molecule patterns to look for and/or a lot of locus data to go through. The "print if" condition says that there must be 6 fields on the line, and the set of molecule strings in $bigRegex needs to match either at the beginning of each line, or after the first tab character.

Replies are listed 'Best First'.
Re^2: Parsing a tab delimited file
by particle (Vicar) on May 10, 2002 at 16:35 UTC
    there are a few pitfalls to using the code posted above. i'd like to take a minute to explain some of them to everyone, in no particular order.

    • you'll have a hard time tracking down bugs unless you die or warn on failed opens and closes. use strict and warnings for the same reason.
    • don't use map or grep in a void context. it's returning something, and you're just throwing it away. read a little more about that at the faq. i'd replace
      @molecules = <MOL>; map { chomp; $_ = quotemeta; } @molecules;
      with
      push my @molecules, quotemeta chomp while <MOL>;
      anyway, if i wanted to be fancy.
    • your $bigRegex can fail depending on the order of the elements. consider (i'm making this up) /ABC|AABC/. ABC will also match AABC, it probably should not. ABC will also match ABCD, and surely that's not right. replace
      $bigRegex = join '|', @molecules;
      with
      my $bigRegex; ($bigRegex .= join( '|', '\b'. $_ . '\b' ) ) for @molecules;
      to test for word boundaries. also, i have a feeling
      my $bigRegex; $bigRegex = join '|', map { "\b$_\b" } sort { length $b <=> length $a +} @molecules;
      will speed up the regex by testing by longest words first, but i may be wrong.
    • the original poster asked for fields < 6 to be ignored, so the if condition should check for >= instead of ==
    • i believe your regex is incorrect. although it's hard to judge the original posters idea of valid data. if it's okay to have empty values for the first two fields, the regex will fail. /^([^\t]+\t)?($bigRegex)/ matches line begin, followed by a group of ( one or more non-tab characters followed by a tab ).... if the first field is empty, this fails. use /^([^\t]*\t)?($bigRegex)/ instead (a * instead of a +.)
    all in all, your code will work with a few modifications. i find it a little obfuscated, though. here it is, with the changes i've suggested.

    #!/usr/bin/perl -w use strict; open( MOL, "molecules" ) or die "ack! - $!"; push my @molecules, quotemeta chomp while <MOL>; close MOL or warn "ack - $!"; my $bigRegex; $bigRegex = join '|', map { "\b$_\b" } sort { length $b <=> length $a } @molecules; open( LOCUS, "locus" ) or die "ack! - $!"; my @locus = <LOCUS>; close LOCUS or die "ack! - $!"; for(@locus) { print if( scalar( split /\t/ ) >= 6 && /^([^\t]*\t)?($bigRegex)/ ); }
    by the way, i like your use of ?() in the regex. i recommend readers investigate this powerful construct by reading about it in perlre.

    ~Particle *accelerates*

      Thanks -- those are mostly good points... except there's a problem in the second item:
      I'd replace
      @molecules = <MOL>; map { chomp; $_ = quotemeta; } @molecules;
      with
      push my @molecules, quotemeta chomp while <MOL>;
      The problem with this replacement is that chomp simply modifies its arg ($_ in this case), but does not return the modified arg -- something other than the chomp'ed string gets returned to quotemeta, and pushed onto @molecules.

      I just recently tried this sort of approach in an attempt to shorten a one-liner, and didn't get what you expect:

      push @m, quotemeta chomp while <DATA>; print join("|",@m),$/; __DATA__ AAA BBB &*(
      Yields:
      1|1|1

      Maybe "map" in a void context isn't sexy, but it does work. (I agree that grep in a void context would be silly.)

        you're right! i should have tested before posting. i still think map in void context is a bad idea, so i came up with this: my @molecules = map { chomp; quotemeta } <DATA>; ...but is it sexy enough ;-)

        ~Particle *accelerates*