Parsing a tab delimited file

snowy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing a tab delimited file by davis (Vicar) on May 09, 2002 at 11:16 UTC
Hi, If you show us your current code, we'd probably be able to help. cheers davis Is this going out live? No, Homer, very few cartoons are broadcast live - it's a terrible strain on the animator's wrist	[reply]
Re: Re: Parsing a tab delimited file by snowy (Sexton) on May 09, 2002 at 11:33 UTC
Here is the code I have for opening the file and parsing the tab @ molecules and @locus are text files which are loaded into arrays `my @locus_small = (); my $line; foreach $line (@locus) { my @tokens = split(/\t+/, $line); unless(scalar @tokens < 6) { push(@locus_small, "$tokens[0]\t$tokens[1]\t"); } }` [download] Then I search and print the @found which is based on locus_small but I want the @locus which matched. So really what I want to do is only search a few columns for a match but print the whole row if there is a match. `foreach my $molecule (@molecules) { my @found = grep /\Q$molecule\E/i, @locus_small; if (@found) { print OUTDATA ($molecule, ": \n\t", join "\t", @found); } }` [download]	[reply] [d/l] [select]
Re: Re: Re: Parsing a tab delimited file by particle (Vicar) on May 09, 2002 at 13:41 UTC
you're doing a lot of extra work in your code. my example below will read your set of locii into an array, create a set of molecules, and print the full locus record if a molecule is found in the first two tokens. records with less than six fields are skipped. i skip the interim array (@locus_small in your code,) and use nested fors instead of grep, because i think it makes more sense. the really tricky bit is `( @{[]} = split /\s/ ) < 6`, but i think my comments should help everyone understand what i'm doing. the main loop is effectively six lines of code, which should be all you need. oh, and yes, i split on single space instead of tab in my example -- i'm too lazy to change the settings in my editor to spit out tabs instead of spaces ;-) enjoy! #!/usr/bin/perl -w use strict; $\|++; # create phony filehandle named OUTDATA (use STDOUT for debugging) OUTDATA = STDOUT; # create phony molecule list my @phony_molecule_list = qw( abc def abcd abc ); # create hash of molecules, to avoid duplicates my %molecules; # populate hash of molecules @molecules{@phony_molecule_list}++; # create list of locii, from DATA filehandle chomp( my @locii = <DATA> ); # for each molecule, sorted by longest word first for my $molecule (sort { length $b <=> length $a } keys %molecules) { LOCUS: for(@locii) { # skip if less than xxx tokens # i need to fake out split to get number of fields. usually # you can force list context by () = ..., but this doesn't # work with split. so, i force split to return its output # to an anonymous array (list context,) then evaluate the # anonymous array in scalar context to get number of element +s. next LOCUS if( ( @{[]} = split /\s/ ) < 6 ); # get first two fields (assumes at least three fields exist) my($test4match) = ( /(.+?\s.+?)\s/ ); # match a molecule (whole words only), and print the line if( $test4match =~ /\b\Q$molecule\E\b/ ) { print OUTDATA $_,$/; } } } __DATA__ abcd ghi 1 2 3 4 xyx yxy a b c abc xyx z y x w efg def 5 6 7 8 abc c o deg abc 9 0 1 abd abc x x [download] ~Particle accelerates	[reply] [d/l] [select]
Re: Parsing a tab delimited file by hotshot (Prior) on May 09, 2002 at 11:22 UTC
didn't quite understood, but here goes: `while (<INPUT>) { chomp; ($first, $second) = /^(first column)+\s+(second column)+.*/; # I d +on't know what should be matched (digits, chars, ..) &saveColumns($first, $second); print "whole row: $_\n"; }` [download] Is this what you ment? Thanks. Hotshot	[reply] [d/l]
Re: Parsing a tab delimited file by choocroot (Friar) on May 09, 2002 at 11:31 UTC
use the join() function to recompose the splitted fields ? `while( my $line = <INPUT> ) { # get your columns my @fields = split( /\t/, $line ); doWhatYouWantWith( $fields[0], $fields[1] ); # recompose your line from the modified columns # with a tab separator print join( "\t", @fields ); }` [download]	[reply] [d/l]
Re: Re: Parsing a tab delimited file by tachyon (Chancellor) on May 09, 2002 at 11:59 UTC
Or perhaps just `print $line` perhaps? Why regenerate it - it's there already! cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Parsing a tab delimited file by graff (Chancellor) on May 10, 2002 at 06:35 UTC
So, you have a "locus" file containing lines with 6 tab-delimited fields (and maybe some lines that don't fit that description), and you have a "molecules" file with a list of target strings, and your goal is to print the 6-field "locus" lines that contain any of the "molecule" strings in the first or second field. If I got that right, then I think something like the following will do: `open( MOL, "molecules" ); @molecules = <MOL>; close MOL; map { chomp; $_ = quotemeta; } @molecules; $bigRegex = join '\|', @molecules; open( LOCUS, "locus" ); @locus = <LOCUS>; close LOCUS; foreach (@locus) { print if ( scalar( split /\t/ ) == 6 && /^([^\t]+\t)?($bigRegex)/ ); }` [download] You may want to look at "qr" on the perlop manpage (under "Regexp Quote-like Operators), if you have a lot of molecule patterns to look for and/or a lot of locus data to go through. The "print if" condition says that there must be 6 fields on the line, and the set of molecule strings in $bigRegex needs to match either at the beginning of each line, or after the first tab character.	[reply] [d/l]
Re^2: Parsing a tab delimited file by particle (Vicar) on May 10, 2002 at 16:35 UTC
there are a few pitfalls to using the code posted above. i'd like to take a minute to explain some of them to everyone, in no particular order. you'll have a hard time tracking down bugs unless you die or warn on failed opens and closes. use strict and warnings for the same reason. don't use map or grep in a void context. it's returning something, and you're just throwing it away. read a little more about that at the faq. i'd replace `@molecules = <MOL>; map { chomp; $_ = quotemeta; } @molecules;` [download] with `push my @molecules, quotemeta chomp while <MOL>;` [download] anyway, if i wanted to be fancy. your $bigRegex can fail depending on the order of the elements. consider (i'm making this up) /ABC\|AABC/. ABC will also match AABC, it probably should not. ABC will also match ABCD, and surely that's not right. replace `$bigRegex = join '\|', @molecules;` [download] with `my $bigRegex; ($bigRegex .= join( '\|', '\b'. $_ . '\b' ) ) for @molecules;` [download] to test for word boundaries. also, i have a feeling `my $bigRegex; $bigRegex = join '\|', map { "\b$_\b" } sort { length $b <=> length $a +} @molecules;` [download] will speed up the regex by testing by longest words first, but i may be wrong. the original poster asked for fields `< 6` to be ignored, so the if condition should check for `>=` instead of `==` i believe your regex is incorrect. although it's hard to judge the original posters idea of valid data. if it's okay to have empty values for the first two fields, the regex will fail. `/^([^\t]+\t)?($bigRegex)/` matches line begin, followed by a group of ( one or more non-tab characters followed by a tab ).... if the first field is empty, this fails. use `/^([^\t]\t)?($bigRegex)/` instead (a instead of a +.) all in all, your code will work with a few modifications. i find it a little obfuscated, though. here it is, with the changes i've suggested. `#!/usr/bin/perl -w use strict; open( MOL, "molecules" ) or die "ack! - $!"; push my @molecules, quotemeta chomp while <MOL>; close MOL or warn "ack - $!"; my $bigRegex; $bigRegex = join '\|', map { "\b$_\b" } sort { length $b <=> length $a } @molecules; open( LOCUS, "locus" ) or die "ack! - $!"; my @locus = <LOCUS>; close LOCUS or die "ack! - $!"; for(@locus) { print if( scalar( split /\t/ ) >= 6 && /^([^\t]\t)?($bigRegex)/ ); }` [download] by the way, i like your use of `?()` in the regex. i recommend readers investigate this powerful construct by reading about it in perlre. ~Particle accelerates*	[reply] [d/l] [select]
Re: Re^2: Parsing a tab delimited file by graff (Chancellor) on May 13, 2002 at 20:59 UTC
Thanks -- those are mostly good points... except there's a problem in the second item: I'd replace `@molecules = <MOL>; map { chomp; $_ = quotemeta; } @molecules;` [download] with `push my @molecules, quotemeta chomp while <MOL>;` [download] The problem with this replacement is that chomp simply modifies its arg ($_ in this case), but does not return the modified arg -- something other than the chomp'ed string gets returned to quotemeta, and pushed onto @molecules. I just recently tried this sort of approach in an attempt to shorten a one-liner, and didn't get what you expect: `push @m, quotemeta chomp while <DATA>; print join("\|",@m),$/; __DATA__ AAA BBB &*(` [download] Yields: 1\|1\|1 Maybe "map" in a void context isn't sexy, but it does work. (I agree that grep in a void context would be silly.)	[reply] [d/l] [select]
Re^4: Parsing a tab delimited file by particle (Vicar) on May 13, 2002 at 22:37 UTC