AWallBuilder has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm relatively new and cannot get this program to run properly, I think there is something wrong with the logic of my looping. Basically I have a list of 3500 gene ids and I want to look through another file which has one row for each gene_SNP pair. For most genes there are many SNPs. I want to pull out the SNP position information for all of the genes in the first list. At the moment my script just writes output (ie. finds a match in the key and geneid) for the first key that it looks at.

use strict; use warnings; my %scGeneids; # gene ids for marker genes my $key; open(INlist,"scCOGwSNPs_geneids.txt"); while(<INlist>){ chomp; $scGeneids{$_}=1; } close(INlist); open(IN,"snps.position.txt") or die "can't open infile"; open(OUT,">markerGen_allSNPPos.txt"); foreach $key (keys %scGeneids){ print "searching for $key \n"; while (<IN>){ chomp; my @columns=split("\t",$_); my $gene_id=$columns[0]; my $SNP_pos=$columns[2]; if ($key eq $gene_id){ print OUT join("\t",$key,$SNP_pos),"\n"; } } } close(IN);

head INlist

213810.locus_tag:RUM_03050 213810.locus_tag:RUM_03410 213810.locus_tag:RUM_03700 213810.locus_tag:RUM_06140 213810.locus_tag:RUM_06600 213810.locus_tag:RUM_07030 213810.locus_tag:RUM_09690 213810.locus_tag:RUM_09700 213810.locus_tag:RUM_09770 213810.locus_tag:RUM_12130

head IN

245014.locus_tag:CK3_01920 245014.FP929062 204248 T 0|0|0| +1|0|0|2|0|6|5|0|0|0|0|0|0|3|1|6|0|0|1|0|0|5|1|0|0|0|4|0|0|4|4|3|2|0|0 +|0|0|3|0|0|0|0|2|0|0|0|0|0|5|2|0|0|0|0|0|0|2|2|0|0|5|0|0|1|0|0|4|0|0| +2|2|1|9|0|2|2|1|5|0|1|5|0|0|0|0|1|0|0|2|0|0|0|2|1|0|1|0|0|2|0|0|0|0|0 +|0|0|0|1|2|0|0|0|0|1|0|0|1|0|0|2|7|1|0|1|1|0|0|5|1|3|0|3|2|0|10|2|1|1 +|9|2|2|0|0|2|2|0|0|0|1|0|1|0|3|0|0|1|1|0|0|0|1|3|0|1|3|2|1|0|0|4|1|1| +0|0|1|1|0|0|2|2|0|0|0|1|1|0|0|2|4|2|0|0|0|2|0|0|0|1|1|0|1|1|0|0|0|0|1 +|0|3|0|1|0|0|0|0|0|2|0|0|0|1|1|0|0|0|0|0|0|3|1|6|0|0|1|0|0|0|0|0|0|2| +0|4|1|0|0|0|0|0 . 5|C|S[AAA-AAG]|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|2|0|0|0|0|0|1|0|0|0|0|0| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 245014.locus_tag:CK3_01920 245014.FP929062 204251 C 0|0|0| +1|0|0|2|0|6|5|0|0|0|0|0|0|3|1|6|0|0|1|0|0|5|1|0|0|0|4|0|0|4|4|3|2|0|0 +|0|0|3|0|0|0|0|2|0|0|0|0|0|6|2|0|0|0|1|0|0|2|3|0|0|6|0|0|1|0|0|4|0|0| +2|1|1|9|0|2|2|1|5|0|1|5|0|0|0|0|1|0|0|2|0|0|0|2|1|0|1|0|0|2|0|0|0|0|0 +|0|0|0|1|3|0|0|0|0|1|0|0|1|0|1|2|4|1|0|1|1|0|0|5|1|3|0|3|2|0|10|2|1|1 +|9|1|2|0|0|2|3|0|0|0|1|0|1|0|4|0|0|1|1|1|0|0|1|2|0|0|3|2|0|0|1|4|1|1| +0|0|1|0|0|0|3|1|0|0|0|1|1|0|0|3|4|3|0|0|0|3|0|0|0|1|2|0|1|1|1|0|0|0|1 +|0|3|0|1|0|0|0|0|0|2|0|0|0|1|1|0|0|0|0|0|0|3|1|5|0|1|1|0|0|0|0|0|0|4| +0|4|1|0|0|0|0|0 . 9|T|S[AAG-AAA]|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|2|0|0|0|0|0|1|0|0|0|0|1| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|2|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| +0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0 +|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0

Replies are listed 'Best First'.
Re: help with loop
by Anonymous Monk on Jan 18, 2012 at 11:38 UTC

    Very very very similar question Re: Help with locating bp region in chromosome/Re^5: Help with locating bp region in chromosome.

    You might write as

    #!/usr/bin/perl -- use strict; use warnings; use autodie; use Getopt::Long qw/ GetOptionsFromArray /; use Pod::Usage; use Regex::PreSuf (); Main( @ARGV ); exit( 0 ); sub Main { GetOptionsFromArray( \@_, 'list=s' => \my $inlist, 'pos=s' => \my $inpos, 'out=s' => \my $outpos, help => sub { pod2usage(-verbose => 2); }, ) or return pod2usage(2); return pod2usage(2) unless $inlist and $inpos and $outpos; my $list_re ; { open my($listfh), '<', $inlist; $list_re = Regex::PreSuf::presuf( map { chomp; $_ } readline( +$listfh ) ); close $listfh; } open my($posfh), '<', $inpos; open my($outfh), '>', $outpos; while( <$posfh> ){ my( $geneId, $SNPpos ) = ( split /\t/, $_ )[ 0, 2]; if( $geneId =~ /$list_re/ ){ print $outfh join "\t", $geneId, $SNPpos, "\n"; } } close $posfh; close $outfh; } =head1 NAME HI - hi =head1 SYNOPSIS shaba -l foo.txt -p pos.txt -o out.txt shaba -l=foo.txt -p=pos.txt -o=out.txt shaba --help =head1 ARGUMENS =over 4 =item C<-list> -list =item C<-pos> -pos =item C<-out> -out =back =cut

    See also Tutorials: Variable Scoping in Perl: the basics, Coping with Scoping )

    Also, Use Getopt::Long even if you don't think you need to - Mechanix

Re: help with loop
by Eliya (Vicar) on Jan 18, 2012 at 11:46 UTC

    Another approach would be to use the hash both as a lookup table to filter on relevant entries (as already suggested by others), and to collect the results by $gene_id (which I guess was the idea behind looping through the file multiple times — assuming the data in IN aren't necessarily already sorted by ID).  Something like this:

    while(<INlist>){ chomp; $scGeneids{$_} = ""; } ... while (<IN>){ chomp; my @columns=split /\t/; my $gene_id=$columns[0]; my $SNP_pos=$columns[2]; if ( exists $scGeneids{$gene_id} ){ $scGeneids{$gene_id} .= "$gene_id\t$SNP_pos\n"; # collect by +ID } } for my $key (sort keys %scGeneids) { print OUT $scGeneids{$key}; }
      Thanks for your help. yes better use of the hash. But what does the ".=" mean in $scGeneids{$gene_id} .= "$gene_id\t$SNP_pos\n"; Thanks
        $foo .= $bar;
        means 
        $foo = $foo . $bar;
        means
        $foo = "$foo$bar";
        

        As already pointed out by Anonymous, the .= effectively appends to the string.

        Just to elaborate a little, consider the following (simplified) IN file:

        B ... 204248 ... A ... 204249 ... D ... 204250 ... C ... 204251 ... E ... 204252 ... A ... 204253 ... F ... 204254 ... C ... 204255 ... B ... 204256 ...

        If you go looking for the IDs A, B, C one after the other in several passes (as in your original approach), the hits will automatically be sorted by ID:

        A 204249 # first pas A 204253 B 204248 # second pass B 204256 C 204251 # third pass C 204255

        While if you do it in one pass, checking every entry against "if ( exists $scGeneids{$gene_id} )" to see if it's of interest, you'll get the IDs in the order they were encountered in IN, when you print them out immediately

        B 204248 A 204249 C 204251 A 204253 C 204255 B 204256

        This won't be an issue, of course, in case the entries are already sorted by ID.  Otherwise, it might not be what you want.

        You haven't said anything about the desired ordering of the results, nor whether IN is already sorted... so I thought it's worth pointing out anyway that if you wanted output similar to what your original multi-pass approach would have produced (if you had correctly reset the file in between passes), you could collect the results by ID before writing them out. I.e., every time you encounter a hit, you append the line to the respective hash entry.

        In other words, after having gone through the file, $scGeneids{A} will hold a long string with all A records, $scGeneids{B} all B records, and so on, so you can nicely print them out in sorted order.

        The same result could have been achieved by collecting the records in arrays (i.e. @{ $scGeneids{$gene_id} } ) and then joining them on output, but I thought the string-append variant is a tad easier to understand...

Re: help with loop
by RichardK (Parson) on Jan 18, 2012 at 11:39 UTC

    while(<IN>) {...} will read all the lines in the file, and then stops leaving the file at the end, so you use up all of the lines when checking the first key and then on the next key there are no more lines in <IN> so the while loop does nothing.

    But there's no need to loop through each key in turn, you can just test if an id exists in the hash.

    while(<IN>) { my @cols = split(/\t/,$_); my $id = $col[0]; if ($GeneIDS{$id} ) { say "found $id"; } }

    BTW split usually takes a pattern as the first parameter, so passing a string might not do what you expect ;)

Re: help with loop
by ansh batra (Friar) on Jan 18, 2012 at 10:44 UTC
    please provide some sample data of both the files