Re: Parsing a tab delimited file
by davis (Vicar) on May 09, 2002 at 11:16 UTC
|
Hi, If you show us your current code, we'd probably be able to help.
cheers
davis
Is this going out live?
No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist
| [reply] |
|
|
my @locus_small = ();
my $line;
foreach $line (@locus) {
my @tokens = split(/\t+/, $line);
unless(scalar @tokens < 6) {
push(@locus_small, "$tokens[0]\t$tokens[1]\t");
}
}
Then I search and print the @found which is based on locus_small but I want the @locus which matched.
So really what I want to do is only search a few columns for a match but print the whole row if there is a match.
foreach my $molecule (@molecules) {
my @found = grep /\Q$molecule\E/i, @locus_small;
if (@found) {
print OUTDATA ($molecule, ": \n\t", join "\t", @found);
}
}
| [reply] [d/l] [select] |
|
|
you're doing a lot of extra work in your code. my example below will read your set of locii into an array, create a set of molecules, and print the full locus record if a molecule is found in the first two tokens. records with less than six fields are skipped.
i skip the interim array (@locus_small in your code,) and use nested fors instead of grep, because i think it makes more sense. the really tricky bit is ( @{[]} = split /\s/ ) < 6, but i think my comments should help everyone understand what i'm doing.
the main loop is effectively six lines of code, which should be all you need. oh, and yes, i split on single space instead of tab in my example -- i'm too lazy to change the settings in my editor to spit out tabs instead of spaces ;-) enjoy!
#!/usr/bin/perl -w
use strict;
$|++;
# create phony filehandle named OUTDATA (use STDOUT for debugging)
*OUTDATA = *STDOUT;
# create phony molecule list
my @phony_molecule_list = qw( abc def abcd abc );
# create hash of molecules, to avoid duplicates
my %molecules;
# populate hash of molecules
@molecules{@phony_molecule_list}++;
# create list of locii, from DATA filehandle
chomp( my @locii = <DATA> );
# for each molecule, sorted by longest word first
for my $molecule (sort { length $b <=> length $a } keys %molecules)
{
LOCUS: for(@locii)
{
# skip if less than xxx tokens
# i need to fake out split to get number of fields. usually
# you can force list context by () = ..., but this doesn't
# work with split. so, i force split to return its output
# to an anonymous array (list context,) then evaluate the
# anonymous array in scalar context to get number of element
+s.
next LOCUS if( ( @{[]} = split /\s/ ) < 6 );
# get first two fields (assumes at least three fields exist)
my($test4match) = ( /(.+?\s.+?)\s/ );
# match a molecule (whole words only), and print the line
if( $test4match =~ /\b\Q$molecule\E\b/ )
{
print OUTDATA $_,$/;
}
}
}
__DATA__
abcd ghi 1 2 3 4
xyx yxy a b c
abc xyx z y x w
efg def 5 6 7 8
abc c o
deg abc 9 0
1 abd abc x x
~Particle *accelerates* | [reply] [d/l] [select] |
Re: Parsing a tab delimited file
by hotshot (Prior) on May 09, 2002 at 11:22 UTC
|
didn't quite understood, but here goes:
while (<INPUT>) {
chomp;
($first, $second) = /^(first column)+\s+(second column)+.*/; # I d
+on't know what should be matched (digits, chars, ..)
&saveColumns($first, $second);
print "whole row: $_\n";
}
Is this what you ment?
Thanks.
Hotshot | [reply] [d/l] |
Re: Parsing a tab delimited file
by choocroot (Friar) on May 09, 2002 at 11:31 UTC
|
use the join() function to recompose the splitted fields ?
while( my $line = <INPUT> ) {
# get your columns
my @fields = split( /\t/, $line );
doWhatYouWantWith( $fields[0], $fields[1] );
# recompose your line from the modified columns
# with a tab separator
print join( "\t", @fields );
}
| [reply] [d/l] |
|
|
| [reply] [d/l] |
Re: Parsing a tab delimited file
by graff (Chancellor) on May 10, 2002 at 06:35 UTC
|
So, you have a "locus" file containing lines with 6 tab-delimited
fields (and maybe some lines that don't fit that description),
and you have a "molecules" file with a list of target strings, and your
goal is to print the 6-field "locus" lines that contain any of the
"molecule" strings in the first or second field.
If I got that right, then I think something like the following
will do:
open( MOL, "molecules" );
@molecules = <MOL>;
close MOL;
map { chomp; $_ = quotemeta; } @molecules;
$bigRegex = join '|', @molecules;
open( LOCUS, "locus" );
@locus = <LOCUS>;
close LOCUS;
foreach (@locus) {
print if ( scalar( split /\t/ ) == 6 &&
/^([^\t]+\t)?($bigRegex)/ );
}
You may want to look at "qr" on the perlop
manpage (under "Regexp Quote-like Operators), if you have a lot of molecule patterns to look for
and/or a lot of locus data to go through. The "print if"
condition says that there must be 6 fields on the line,
and the set of molecule strings in $bigRegex needs to match
either at the beginning of each line, or after the first
tab character.
| [reply] [d/l] |
|
|
there are a few pitfalls to using the code posted above. i'd like to take a minute to explain some of them to everyone, in no particular order.
- you'll have a hard time tracking down bugs unless you die or warn on failed opens and closes. use strict and warnings for the same reason.
- don't use map or grep in a void context. it's returning something, and you're just throwing it away. read a little more about that at the faq. i'd replace
@molecules = <MOL>;
map { chomp; $_ = quotemeta; } @molecules;
with
push my @molecules, quotemeta chomp while <MOL>;
anyway, if i wanted to be fancy.
- your $bigRegex can fail depending on the order of the elements. consider (i'm making this up) /ABC|AABC/. ABC will also match AABC, it probably should not. ABC will also match ABCD, and surely that's not right. replace
$bigRegex = join '|', @molecules;
with
my $bigRegex;
($bigRegex .= join( '|', '\b'. $_ . '\b' ) ) for @molecules;
to test for word boundaries. also, i have a feelingmy $bigRegex;
$bigRegex = join '|', map { "\b$_\b" } sort { length $b <=> length $a
+} @molecules;
will speed up the regex by testing by longest words first, but i may be wrong.
- the original poster asked for fields < 6 to be ignored, so the if condition should check for >= instead of ==
- i believe your regex is incorrect. although it's hard to judge the original posters idea of valid data. if it's okay to have empty values for the first two fields, the regex will fail. /^([^\t]+\t)?($bigRegex)/ matches line begin, followed by a group of ( one or more non-tab characters followed by a tab ).... if the first field is empty, this fails. use /^([^\t]*\t)?($bigRegex)/ instead (a * instead of a +.)
all in all, your code will work with a few modifications. i find it a little obfuscated, though. here it is, with the changes i've suggested.
#!/usr/bin/perl -w
use strict;
open( MOL, "molecules" ) or die "ack! - $!";
push my @molecules, quotemeta chomp while <MOL>;
close MOL or warn "ack - $!";
my $bigRegex;
$bigRegex = join '|',
map { "\b$_\b" }
sort { length $b <=> length $a } @molecules;
open( LOCUS, "locus" ) or die "ack! - $!";
my @locus = <LOCUS>;
close LOCUS or die "ack! - $!";
for(@locus)
{
print if( scalar( split /\t/ ) >= 6
&& /^([^\t]*\t)?($bigRegex)/ );
}
by the way, i like your use of ?() in the regex. i recommend readers investigate this powerful construct by reading about it in perlre.
~Particle *accelerates* | [reply] [d/l] [select] |
|
|
push @m, quotemeta chomp while <DATA>;
print join("|",@m),$/;
__DATA__
AAA
BBB
&*(
Yields: 1|1|1
Maybe "map" in a void context isn't sexy, but it does work.
(I agree that grep in a void context would be silly.)
| [reply] [d/l] [select] |
|
|