Finding elements without a counter...

bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello friends!
I have a small problem. I am parsing a text file, which looks like:

position patient<-not in file, but for your reference!
34562    1048    t    t
30873    1048    t    t
25912    1048    N    N
25779    1048    t    t
26827    1048    t    t
24607    1048    t    t
22301    1048    t    t
24835    1048    t    t
24955    1048    N    N
25916    1048    N    N
22131    1048    t    t
22203    1048    t    t
23759    1048    t    t
23749    1048    t    t
18504    1048    t    t
20839    1048    c    t
34562    1082    c    c
30873    1082    c    c
25912    1082    N    N
25779    1082    c    c
26827    1082    c    c
24607    1082    c    c
22301    1082    c    c
24835    1082    c    c
24955    1082    N    N
25916    1082    N    N
22131    1082    c    c
22203    1082    c    c
23759    1082    c    c
23749    1082    c    c
18504    1082    c    c
20839    1082    c    c
[download]

As you can see, there is a pattern there, specifically in the left two columns. In the far left column, there are a series of numbers that repeat themselves once the number in the second column changes. I am using a loop to pull in and split up each line, with the split line: ($position, $patient, $letter1, $letter2)=split(/\t/, $prettybase[$i]);. Due to the nature of this loop, I can't use value of $i to say which element of @prettybase to retrieve. Hence, is there a way to (without using a counter) have the loop parse @prettybase one element at a time from 1-until eof, all while keeping its place? I'm somwhat familiar with keeping one's place while pattern matching, but unfortunately this doesn't work here, unless there is something I'm unaware of.
Since most people like to see the big picture, here is the code on the whole:

#! C:\Perl\bin\perl 

use warnings;
use strict;

chomp(my $input=<STDIN>);
open (INPUT, "$input") or die "Cannot open file: $!";

chomp(my $number_positions=<STDIN>);

chomp(my $patient_number=<STDIN>);

chomp(my $output=<STDIN>);
open (OUTPUT, ">$output.txt") or die "Cannot open file: $!";

my @prettybase=<INPUT>;
close (INPUT);
my (@a,@c,@t,@g,@n,@a2,@c2,@t2,@g2,@n2);
my ($position, $letter1, $letter2, $patient);

for (my $k=0;$k<$number_positions; $k++) {
   @a=@c=@t=@g=@n=@a2=@c2=@t2=@g2=@n2=();
   $position="0";
   for (my $i=0;$i<=$patient_number; $i++) {
      ($position, $patient, $letter1, $letter2)=split(/\t/, $prettybas
+e[$i]);
      if ($letter1=~ /[aA]/) {push ( @a, $letter1);}
      if ($letter1=~ /[cC]/) {push ( @c, $letter1);}
      if ($letter1=~ /[tT]/) {push ( @t, $letter1);}
      if ($letter1=~ /[gG]/) {push ( @g, $letter1);}
      if ($letter1=~ /[nN]/) {push ( @n, $letter1);}
      if ($letter2=~ /[aA]/) {push ( @a2, $letter2);}
      if ($letter2=~ /[cC]/) {push ( @c2, $letter2);}
      if ($letter2=~ /[tT]/) {push ( @t2, $letter2);}
      if ($letter2=~ /[gG]/) {push ( @g2, $letter2);}
      if ($letter2=~ /[nN]/) {push ( @n2, $letter2);}
      }
   my $number_a=scalar @a;
   #print "$number_a";
   my $number_c=scalar @c;
   #print "$number_c";
   my $number_t=scalar @t;
   my $number_g=scalar @g;
   my $number_n=scalar @n;
   my $number_a2=scalar @a2;
   my $number_c2=scalar @c2;
   my $number_t2=scalar @t2;
   my $number_g2=scalar @g2;
   my $number_n2=scalar @n2;

   my $percent_a=((("$number_a")/("$number_a"+"$number_c"+"$number_t"+
+"$number_g"+"$number_n")) * 100);
   my $percent_c=((("$number_c")/("$number_a"+"$number_c"+"$number_t"+
+"$number_g"+"$number_n")) * 100);
   my $percent_t=((("$number_t")/("$number_a"+"$number_c"+"$number_t"+
+"$number_g"+"$number_n")) * 100);
   my $percent_g=((("$number_g")/("$number_a"+"$number_c"+"$number_t"+
+"$number_g"+"$number_n")) * 100);
   my $percent_n=((("$number_n")/("$number_a"+"$number_c"+"$number_t"+
+"$number_g"+"$number_n")) * 100);
   my $percent_a2=((("$number_a2")/("$number_a2"+"$number_c2"+"$number
+_t2"+"$number_g2"+"$number_n2")) * 100);
   my $percent_c2=((("$number_c2")/("$number_a2"+"$number_c2"+"$number
+_t2"+"$number_g2"+"$number_n2")) * 100);
   my $percent_t2=((("$number_t2")/("$number_a2"+"$number_c2"+"$number
+_t2"+"$number_g2"+"$number_n2")) * 100);
   my $percent_g2=((("$number_g2")/("$number_a2"+"$number_c2"+"$number
+_t2"+"$number_g2"+"$number_n2")) * 100);
   my $percent_n2=((("$number_n2")/("$number_a2"+"$number_c2"+"$number
+_t2"+"$number_g2"+"$number_n2")) * 100);
   
   print OUTPUT "Position\#\: $patient\n";
   print OUTPUT "\tAllele 1  Allele 1 percentage\n";
   print OUTPUT "A:\t$number_a\t$percent_a\%\n\n";
   print OUTPUT "C:\t$number_c\t$percent_c\%\n\n";
   print OUTPUT "T:\t$number_t\t$percent_t\%\n\n";
   print OUTPUT "G:\t$number_g\t$percent_g\%\n\n";
   print OUTPUT "N:\t$number_n\t$percent_n\%\n\n";
   print OUTPUT "\tAllele 2  Allele 2 percentage\n";
   print OUTPUT "A:\t$number_a2\t$percent_a2\%\n\n";
   print OUTPUT "C:\t$number_c2\t$percent_c2\%\n\n";
   print OUTPUT "T:\t$number_t2\t$percent_t2\%\n\n";
   print OUTPUT "G:\t$number_g2\t$percent_g2\%\n\n";
   print OUTPUT "N:\t$number_n2\t$percent_n2\%\n\n";
}
close (INPUT);
close (OUTPUT);
_____ output_____
Position#: 1048
    Allele 1  Allele 1 percentage
A:    0    0%

C:    1    1.5625%

T:    56    87.5%

G:    0    0%

N:    7    10.9375%

    Allele 2  Allele 2 percentage
A:    0    0%

C:    0    0%

T:    57    89.0625%

G:    0    0%

N:    7    10.9375%
etc....
[download]

Thanks for all your help!!

Bioinformatics

Comment on Finding elements without a counter... Select or Download Code

Replies are listed 'Best First'.
OT: program repair shop (was: Finding elements without a counter) by Aristotle (Chancellor) on Oct 13, 2003 at 18:38 UTC
I haven't understood your problem, but there's a lot of things about your style that are better done different, and some that outright need fixing. So I'll just take notes along the way as I'm cleaning it up. First thing I notice: `@a=@c=@t=@g=@n=@a2=@c2=@t2=@g2=@n2=();` [download] If you actually needed that line, you should better write it as `(@a, @c, @t, @g, @n, @a2, @c2, @t2, @g2, @n2) = ();` [download] But you're not using the variables outside the loop; so you can just move the my inside the loop. `for (my $k=0;$k<$number_positions; $k++) { my (@a,@c,@t,@g,@n,@a2,@c2,@t2,@g2,@n2);` [download] Next is the pretty hideous train of test like `if ($letter1=~ /[aA]/) { ... } ...` [download] You're only testing a single character; so use uc to normalize them to upper case (or lc for lower case if you prefer). `if (uc $letter1 eq 'A') { ... } if (uc $letter1 eq 'C') { ... } if (uc $letter1 eq 'T') { ... } if (uc $letter1 eq 'G') { ... } # ...` [download] But since you're doing that a zillion times, you can just do it once and be done with: `$letter1 = uc $letter1; if ($letter1 eq 'A') { ... } if ($letter1 eq 'C') { ... } if ($letter1 eq 'T') { ... } # ...` [download] Still too much repetition. Since all these arrays are collecting related information, you should use hashes instead: `for (my $k=0;$k<$number_positions; $k++) { my (%base1, %base2); # ... $letter1 = uc $letter1; $letter2 = uc $letter2; push @${$base1{$letter1}}, $letter1; push @${$base2{$letter2}}, $letter2;` [download] But it gets better! Because you only want the number of items in the array anyway. So why keep the elements? Just use the hash elements as counters. Note that I got rid of the separate uppercasing step since we're only using the variables once now. `for (my $k=0;$k<$number_positions; $k++) { my (%base1, %base2); # ... $base1{uc $letter1}++; $base2{uc $letter2}++;` [download] Now, the `$number_` variables were already crying out to become elements of a `%number` hash. But we've already taken care of that with the `%base` counter hashes. Next we have stuff like `"$number_a" + "$number_c" + "$number_t" + "$number_g" + "$number_n"` [download] Perl is not shell code. Don't use quotes were needless. You are forcing Perl to stringify the numbers to create the strings for your quotes, then it has to numify the created strings in order to do math on them. Why? Just say `$number_a + $number_c + $number_t + $number_g + $number_n` [download] Or, now that we have the `%base` hashes: `$base1{A} + $base1{C} + $base1{T} + $base1{G} + $base1{N}` [download] Also note how you're doing this calculation over and over and over. Just do it once! `my $total1 = $base1{A} + $base1{C} + $base1{T} + $base1{G} + $base1{N} +;` [download] Also note how you are declaring `$number_a`, `$number_g`, `$number_t`, etc. That's a huge blinking sign that you should declare `%number` and work with `$number{a}`, `$number{g}`, etc, instead. Together with the fact that you can assign a list of values to a hash when you create it, that makes `my %percent1 = ( A => ($base{A} 100) / $total1, C => ($base{C} * 100) / $total1, T => ($base{T} * 100) / $total1, G => ($base{G} * 100) / $total1, N => ($base{N} * 100) / $total1, );` [download] There is still too much repetition: `my %percent1 = map { $_ => ( $base{$_} * 100) / $total1 } 'A', 'C', 'T +', 'G', 'N';` [download] Since we have a list of strings that don't contain whitespace we can make it a little more readable: `my %percent1 = map { $_ => ( $base1{$_} * 100) / $total1 } qw(A C T G +N);` [download] The same change (using a `for` loop) can be applied to your output code. The bottom line is: #!/usr/bin/perl use warnings; use strict; chomp(my $input = <STDIN>); open(INPUT, "$input") or die "Cannot open file: $!"; chomp(my $number_positions = <STDIN>); chomp(my $patient_number = <STDIN>); chomp(my $output = <STDIN>); open(OUTPUT, ">$output.txt") or die "Cannot open file: $!"; my @prettybase = <INPUT>; close(INPUT); for my $k (0 .. $number_positions - 1) { my (%base1, %base2); for my $i (0 .. $patient_number) { my ($position, $patient, $letter1, $letter2) = split (/\t/, $prettybase[$i]); $base1{uc $letter1}++; $base2{uc $letter2}++; } my $total1 = $base1{A} + $base1{C} + $base1{T} + $base1{G} + $base +1{N}; my %percent1 = map { $_ => ( $base1{$_} * 100) / $total1 } qw(A C +T G N); my $total2 = $base2{A} + $base2{C} + $base2{T} + $base2{G} + $base +2{N}; my %percent2 = map { $_ => ( $base2{$_} * 100) / $total2 } qw(A C +T G N); print OUTPUT "Position\#\: $patient\n"; print OUTPUT "\tAllele 1 Allele 1 percentage\n"; for(qw(A C T G N)) { print OUTPUT "$_:\t$base1{$_}\t$percent1{$_}\%\n\n"; } print OUTPUT "\tAllele 2 Allele 2 percentage\n"; for(qw(A C T G N)) { print OUTPUT "$_:\t$base2{$_}\t$percent2{$_}\%\n\n"; } } close(OUTPUT); [download] Further reading: Mark-Jason Dominus' excellent Program Repair Shop and Red Flags article series on Perl.com. Makeshifts last the longest.	[reply] [d/l] [select]
Re: OT: program repair shop (was: Finding elements without a counter) by bioinformatics (Friar) on Oct 13, 2003 at 19:21 UTC
Aristotle, all I can say is wow. Thanks so much for that...I just learned a tremendous amount, and I realize that I need to learn to pare down all my variables to make things cleaner. Bioinformatics	[reply]
Re: Finding elements without a counter... by ptkdb (Monk) on Oct 13, 2003 at 18:51 UTC
I'm not to clear on what you're trying to do, but here are some things to consider... Since the third column, corresponding to the $letter1 variable seems to remain a single character, and you don't seem to be doing anything more with it than counting it, try this technique instead. I'm unclear as to how you want to scan your @prettybase array, but you might want to consider making the processing of an input record into a subroutine like this: ## ## Will process the records between $start and $end ## and count the occurances of the third ## field, returning the total count and a ## has containing the individual counts keyed ## by the letter. ## sub processRec { my($arrayRef, $start, $end) = @_ ; my($i, $total, $letter_counts) ; my($position, $patient, $letter1, $letter2) ; for( $i = $start ; $i <= $end ; $i++ ) { ($position, $patient, $letter1, $letter2)=split(/\t/, $arrayRef->[ +$i]); $letter_counts->{uc $letter1} += 1 ; } $total = 0 ; for( values %$letter_counts ) { $total += $_ ; } # for return($total, $letter_counts) ; } @prettybase = map chomp, <INPUT> ; # # do something to determine start and stop points # ($total, $counts) = processRec(\@prettybase, start, end) ; while( ($letter, $count) = each %$counts ) { print "$letter => ", ($count*100)/$total, "\n" ; } [download] You may want to consider 'pre-splitting' the @prettybase array into an array of array refs just so you don't have to keep resplitting the lines. This would be good if your input files are huge, and you're going back and forth over it many times.	[reply] [d/l]
Re: Finding elements without a counter... by dragonchild (Archbishop) on Oct 13, 2003 at 18:25 UTC
If I understand your question correctly, you want something that will only handle the items in @prettybase whose second column matches the patient number you pass in. Is that what you're looking for? ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply]
Re: Re: Finding elements without a counter... by bioinformatics (Friar) on Oct 13, 2003 at 18:30 UTC
Yes, that is correct. Technically, I could have used pattern matching rather than a counter to go through @prettybase, but normally a counter works well for these type of programs. Unfortunately, not in this case. Bioinformatics	[reply]
Re2: Finding elements without a counter... by dragonchild (Archbishop) on Oct 13, 2003 at 18:42 UTC
`my @prettybase; my $seen = 0; while (<PRETTYBASE>) { chomp; my @info = split /\t/; unless ($info[1] == $patient_number) { # Use this to short-cut out of the file once we've found our p +atient info. last if $seen; next; } $seen = 1; push @prettybase, $_; } # At this point, everything in @prettybase corresponds to the patient +specified. # Modify your algorithms accordingly.` [download] ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]