lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a tab-delimited input file that looks like this:
#chr  loc	rs	   observ 129S1 129S4 129X1 A/J
1     3.013441	rs31192577   A/T   T      T     T    =
1     3.036178	rs32166183   A/C   C      C     C    A
1     3.036265	rs30543887   A/G   G      G	A    A
1     3.039187	rs6365082    G/T   =      =     T    T
1     3.051362	rs30717399   A/G   G      G     G    A
1     3.051854	rs32156135   A/G   G      G     G    A
1     3.062749	rs31606309   A/C   A      =     A    C
1     3.063538	rs30884626   C/G   G      G     G    C
1     3.093816	rs31797356   C/T   T      T     T    C
1     3.093903	rs31986282   A/T   T      T     T    A
1     3.095984	rs30462182   A/C   A      A     A    C
1     3.108194	rs32782895   A/G   A      A     A    G
1     3.11911	rs31819935   A/G   G      G     G    A
1     3.119136	rs31147132   C/T   T      T     T    C
I need to count and print out each occurrence of the rs values. Foreach rs value I need to check if an (A|G|C|T|=)
exist in arrays 4 thru 7. If an '=' sign exist, then I need to skip that array and still keep count of the array number.
So, arrays 4-7 would have values eq to 1 thru 4. The output would look like the following:
snp_id    strain_id    rs_value
1             1        rs31192577
.              .           . 
.              .           .
.              .           .
1             4       rs31192577
2             1        rs32166183
.              .           .
.              .           .
.              .           .
2             4       rs32166183

This iteration will continue for every 'rs' value and each of
the arrays 4-7. The actual file has 48 of the arrays
that need to be checked
The is the code that I have written thus far:

#!/usr/bin/perl
 
use strict;
use warnings;

open( my $in, "20264500_snp_48strains_b37.txt");
open( my $out,">SNP_Strain_join.txt");

my $first_line = <$in>;
chomp $first_line;
my $snp_id_count = 0;

#my $num_of_strains = 48;
my $skip_strain = '=';
while(<$in>){
	chomp;
	my $strain_count = 0;
	my @fields = split /\t/;
	my $rs = $fields2;
	$snp_id_count++;
	$strain_count++;
	print $out "$snp_id_count\t$rs\t$strain_count\n";
}
close($in);
close($out);

Replies are listed 'Best First'.
Re: Looping issue...
by GrandFather (Saint) on Aug 25, 2008 at 20:57 UTC

    So you want something like:

    use strict; use warnings; my $first_line = <DATA>; chomp $first_line; my $snp_id_count; #my $num_of_strains = 48; print "snp_id\tstrain_id\trs_value\n"; my $skip_strain = '='; while (<DATA>) { chomp; my @fields = split /\t/; my $rs = $fields[2]; my @ids = @fields[4 .. $#fields]; my $strainId; $snp_id_count++; for my $id (@ids) { ++$strainId; next if $id eq '='; print "$snp_id_count\t$strainId\t$rs\n"; } } __DATA__ #chr loc rs observ 129S1 129S4 129X1 A/J 1 3.013441 rs31192577 A/T T T T = 1 3.036178 rs32166183 A/C C C C A 1 3.036265 rs30543887 A/G G G A A 1 3.039187 rs6365082 G/T = = T T

    Prints:

    snp_id strain_id rs_value 1 1 rs31192577 1 2 rs31192577 1 3 rs31192577 2 1 rs32166183 2 2 rs32166183 2 3 rs32166183 2 4 rs32166183 3 1 rs30543887 3 2 rs30543887 3 3 rs30543887 3 4 rs30543887 4 3 rs6365082 4 4 rs6365082

    If you add further columns for more strain Ids it will 'Just Work'™.


    Perl reduces RSI - it saves typing
Re: Looping issue...
by toolic (Bishop) on Aug 25, 2008 at 20:33 UTC
    I see one problem: $strain_count will always print out as 1 because you declare the variable with my inside your while loop, then set it to 0, then unconditionally increment it by 1.

    Perhaps you could clarify what you mean by "arrays 4-7".

    It would also be helpful if you showed the exact output you are looking for, inside "readmore" tags. Also, change your pre tags to code tags. Your tabs got lost in the translation. Re-read Writeup Formatting Tips.

Re: Looping issue...
by Illuminatus (Curate) on Aug 25, 2008 at 20:31 UTC
    First off, please use the <code> tag when including code.
    Second, We need a little terminology consistency :). You talk about looking for A/C/G/T in 'arrays 4-7'. I assume that you really mean 'columns' here, but need to be sure. The statement " If an '=' sign exist, then I need to skip that array and still keep count of the array number." does not seem to correlate at all to the output example you have provided. You then mention that 48 arrays will be present. Do you really mean 48 rows of data, or 48 blocks of rows of data?
    Third, your code does not appear to do anything like what you ask. You dereference an array without brackets, you define $skip_strain but never apply it.