bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello friends!
I have a small problem. I am parsing a text file, which looks like:
position patient<-not in file, but for your reference! 34562 1048 t t 30873 1048 t t 25912 1048 N N 25779 1048 t t 26827 1048 t t 24607 1048 t t 22301 1048 t t 24835 1048 t t 24955 1048 N N 25916 1048 N N 22131 1048 t t 22203 1048 t t 23759 1048 t t 23749 1048 t t 18504 1048 t t 20839 1048 c t 34562 1082 c c 30873 1082 c c 25912 1082 N N 25779 1082 c c 26827 1082 c c 24607 1082 c c 22301 1082 c c 24835 1082 c c 24955 1082 N N 25916 1082 N N 22131 1082 c c 22203 1082 c c 23759 1082 c c 23749 1082 c c 18504 1082 c c 20839 1082 c c
As you can see, there is a pattern there, specifically in the left two columns. In the far left column, there are a series of numbers that repeat themselves once the number in the second column changes. I am using a loop to pull in and split up each line, with the split line: ($position, $patient, $letter1, $letter2)=split(/\t/, $prettybase[$i]);. Due to the nature of this loop, I can't use value of $i to say which element of @prettybase to retrieve. Hence, is there a way to (without using a counter) have the loop parse @prettybase one element at a time from 1-until eof, all while keeping its place? I'm somwhat familiar with keeping one's place while pattern matching, but unfortunately this doesn't work here, unless there is something I'm unaware of.
Since most people like to see the big picture, here is the code on the whole:
#! C:\Perl\bin\perl use warnings; use strict; chomp(my $input=<STDIN>); open (INPUT, "$input") or die "Cannot open file: $!"; chomp(my $number_positions=<STDIN>); chomp(my $patient_number=<STDIN>); chomp(my $output=<STDIN>); open (OUTPUT, ">$output.txt") or die "Cannot open file: $!"; my @prettybase=<INPUT>; close (INPUT); my (@a,@c,@t,@g,@n,@a2,@c2,@t2,@g2,@n2); my ($position, $letter1, $letter2, $patient); for (my $k=0;$k<$number_positions; $k++) { @a=@c=@t=@g=@n=@a2=@c2=@t2=@g2=@n2=(); $position="0"; for (my $i=0;$i<=$patient_number; $i++) { ($position, $patient, $letter1, $letter2)=split(/\t/, $prettybas +e[$i]); if ($letter1=~ /[aA]/) {push ( @a, $letter1);} if ($letter1=~ /[cC]/) {push ( @c, $letter1);} if ($letter1=~ /[tT]/) {push ( @t, $letter1);} if ($letter1=~ /[gG]/) {push ( @g, $letter1);} if ($letter1=~ /[nN]/) {push ( @n, $letter1);} if ($letter2=~ /[aA]/) {push ( @a2, $letter2);} if ($letter2=~ /[cC]/) {push ( @c2, $letter2);} if ($letter2=~ /[tT]/) {push ( @t2, $letter2);} if ($letter2=~ /[gG]/) {push ( @g2, $letter2);} if ($letter2=~ /[nN]/) {push ( @n2, $letter2);} } my $number_a=scalar @a; #print "$number_a"; my $number_c=scalar @c; #print "$number_c"; my $number_t=scalar @t; my $number_g=scalar @g; my $number_n=scalar @n; my $number_a2=scalar @a2; my $number_c2=scalar @c2; my $number_t2=scalar @t2; my $number_g2=scalar @g2; my $number_n2=scalar @n2; my $percent_a=((("$number_a")/("$number_a"+"$number_c"+"$number_t"+ +"$number_g"+"$number_n")) * 100); my $percent_c=((("$number_c")/("$number_a"+"$number_c"+"$number_t"+ +"$number_g"+"$number_n")) * 100); my $percent_t=((("$number_t")/("$number_a"+"$number_c"+"$number_t"+ +"$number_g"+"$number_n")) * 100); my $percent_g=((("$number_g")/("$number_a"+"$number_c"+"$number_t"+ +"$number_g"+"$number_n")) * 100); my $percent_n=((("$number_n")/("$number_a"+"$number_c"+"$number_t"+ +"$number_g"+"$number_n")) * 100); my $percent_a2=((("$number_a2")/("$number_a2"+"$number_c2"+"$number +_t2"+"$number_g2"+"$number_n2")) * 100); my $percent_c2=((("$number_c2")/("$number_a2"+"$number_c2"+"$number +_t2"+"$number_g2"+"$number_n2")) * 100); my $percent_t2=((("$number_t2")/("$number_a2"+"$number_c2"+"$number +_t2"+"$number_g2"+"$number_n2")) * 100); my $percent_g2=((("$number_g2")/("$number_a2"+"$number_c2"+"$number +_t2"+"$number_g2"+"$number_n2")) * 100); my $percent_n2=((("$number_n2")/("$number_a2"+"$number_c2"+"$number +_t2"+"$number_g2"+"$number_n2")) * 100); print OUTPUT "Position\#\: $patient\n"; print OUTPUT "\tAllele 1 Allele 1 percentage\n"; print OUTPUT "A:\t$number_a\t$percent_a\%\n\n"; print OUTPUT "C:\t$number_c\t$percent_c\%\n\n"; print OUTPUT "T:\t$number_t\t$percent_t\%\n\n"; print OUTPUT "G:\t$number_g\t$percent_g\%\n\n"; print OUTPUT "N:\t$number_n\t$percent_n\%\n\n"; print OUTPUT "\tAllele 2 Allele 2 percentage\n"; print OUTPUT "A:\t$number_a2\t$percent_a2\%\n\n"; print OUTPUT "C:\t$number_c2\t$percent_c2\%\n\n"; print OUTPUT "T:\t$number_t2\t$percent_t2\%\n\n"; print OUTPUT "G:\t$number_g2\t$percent_g2\%\n\n"; print OUTPUT "N:\t$number_n2\t$percent_n2\%\n\n"; } close (INPUT); close (OUTPUT); _____ output_____ Position#: 1048 Allele 1 Allele 1 percentage A: 0 0% C: 1 1.5625% T: 56 87.5% G: 0 0% N: 7 10.9375% Allele 2 Allele 2 percentage A: 0 0% C: 0 0% T: 57 89.0625% G: 0 0% N: 7 10.9375% etc....
Thanks for all your help!!
Bioinformatics

Replies are listed 'Best First'.
OT: program repair shop (was: Finding elements without a counter)
by Aristotle (Chancellor) on Oct 13, 2003 at 18:38 UTC

    I haven't understood your problem, but there's a lot of things about your style that are better done different, and some that outright need fixing. So I'll just take notes along the way as I'm cleaning it up.

    First thing I notice:

    @a=@c=@t=@g=@n=@a2=@c2=@t2=@g2=@n2=();
    If you actually needed that line, you should better write it as
    (@a, @c, @t, @g, @n, @a2, @c2, @t2, @g2, @n2) = ();
    But you're not using the variables outside the loop; so you can just move the my inside the loop.
    for (my $k=0;$k<$number_positions; $k++) { my (@a,@c,@t,@g,@n,@a2,@c2,@t2,@g2,@n2);
    Next is the pretty hideous train of test like
    if ($letter1=~ /[aA]/) { ... } ...
    You're only testing a single character; so use uc to normalize them to upper case (or lc for lower case if you prefer).
    if (uc $letter1 eq 'A') { ... } if (uc $letter1 eq 'C') { ... } if (uc $letter1 eq 'T') { ... } if (uc $letter1 eq 'G') { ... } # ...
    But since you're doing that a zillion times, you can just do it once and be done with:
    $letter1 = uc $letter1; if ($letter1 eq 'A') { ... } if ($letter1 eq 'C') { ... } if ($letter1 eq 'T') { ... } # ...
    Still too much repetition. Since all these arrays are collecting related information, you should use hashes instead:
    for (my $k=0;$k<$number_positions; $k++) { my (%base1, %base2); # ... $letter1 = uc $letter1; $letter2 = uc $letter2; push @${$base1{$letter1}}, $letter1; push @${$base2{$letter2}}, $letter2;
    But it gets better! Because you only want the number of items in the array anyway. So why keep the elements? Just use the hash elements as counters. Note that I got rid of the separate uppercasing step since we're only using the variables once now.
    for (my $k=0;$k<$number_positions; $k++) { my (%base1, %base2); # ... $base1{uc $letter1}++; $base2{uc $letter2}++;

    Now, the $number_* variables were already crying out to become elements of a %number hash. But we've already taken care of that with the %base counter hashes.

    Next we have stuff like

    "$number_a" + "$number_c" + "$number_t" + "$number_g" + "$number_n"
    Perl is not shell code. Don't use quotes were needless. You are forcing Perl to stringify the numbers to create the strings for your quotes, then it has to numify the created strings in order to do math on them. Why? Just say
    $number_a + $number_c + $number_t + $number_g + $number_n
    Or, now that we have the %base hashes:
    $base1{A} + $base1{C} + $base1{T} + $base1{G} + $base1{N}

    Also note how you're doing this calculation over and over and over. Just do it once!

    my $total1 = $base1{A} + $base1{C} + $base1{T} + $base1{G} + $base1{N} +;
    Also note how you are declaring $number_a, $number_g, $number_t, etc. That's a huge blinking sign that you should declare %number and work with $number{a}, $number{g}, etc, instead. Together with the fact that you can assign a list of values to a hash when you create it, that makes
    my %percent1 = ( A => ($base{A} * 100) / $total1, C => ($base{C} * 100) / $total1, T => ($base{T} * 100) / $total1, G => ($base{G} * 100) / $total1, N => ($base{N} * 100) / $total1, );
    There is still too much repetition:
    my %percent1 = map { $_ => ( $base{$_} * 100) / $total1 } 'A', 'C', 'T +', 'G', 'N';
    Since we have a list of strings that don't contain whitespace we can make it a little more readable:
    my %percent1 = map { $_ => ( $base1{$_} * 100) / $total1 } qw(A C T G +N);
    The same change (using a for loop) can be applied to your output code. The bottom line is:
    #!/usr/bin/perl use warnings; use strict; chomp(my $input = <STDIN>); open(INPUT, "$input") or die "Cannot open file: $!"; chomp(my $number_positions = <STDIN>); chomp(my $patient_number = <STDIN>); chomp(my $output = <STDIN>); open(OUTPUT, ">$output.txt") or die "Cannot open file: $!"; my @prettybase = <INPUT>; close(INPUT); for my $k (0 .. $number_positions - 1) { my (%base1, %base2); for my $i (0 .. $patient_number) { my ($position, $patient, $letter1, $letter2) = split (/\t/, $prettybase[$i]); $base1{uc $letter1}++; $base2{uc $letter2}++; } my $total1 = $base1{A} + $base1{C} + $base1{T} + $base1{G} + $base +1{N}; my %percent1 = map { $_ => ( $base1{$_} * 100) / $total1 } qw(A C +T G N); my $total2 = $base2{A} + $base2{C} + $base2{T} + $base2{G} + $base +2{N}; my %percent2 = map { $_ => ( $base2{$_} * 100) / $total2 } qw(A C +T G N); print OUTPUT "Position\#\: $patient\n"; print OUTPUT "\tAllele 1 Allele 1 percentage\n"; for(qw(A C T G N)) { print OUTPUT "$_:\t$base1{$_}\t$percent1{$_}\%\n\n"; } print OUTPUT "\tAllele 2 Allele 2 percentage\n"; for(qw(A C T G N)) { print OUTPUT "$_:\t$base2{$_}\t$percent2{$_}\%\n\n"; } } close(OUTPUT);
    Further reading: Mark-Jason Dominus' excellent Program Repair Shop and Red Flags article series on Perl.com.

    Makeshifts last the longest.

      Aristotle, all I can say is wow. Thanks so much for that...I just learned a tremendous amount, and I realize that I need to learn to pare down all my variables to make things cleaner.
      Bioinformatics
Re: Finding elements without a counter...
by ptkdb (Monk) on Oct 13, 2003 at 18:51 UTC
    I'm not to clear on what you're trying to do, but here are some things to consider... Since the third column, corresponding to the $letter1 variable seems to remain a single character, and you don't seem to be doing anything more with it than counting it, try this technique instead. I'm unclear as to how you want to scan your @prettybase array, but you might want to consider making the processing of an input record into a subroutine like this:
    ## ## Will process the records between $start and $end ## and count the occurances of the third ## field, returning the total count and a ## has containing the individual counts keyed ## by the letter. ## sub processRec { my($arrayRef, $start, $end) = @_ ; my($i, $total, $letter_counts) ; my($position, $patient, $letter1, $letter2) ; for( $i = $start ; $i <= $end ; $i++ ) { ($position, $patient, $letter1, $letter2)=split(/\t/, $arrayRef->[ +$i]); $letter_counts->{uc $letter1} += 1 ; } $total = 0 ; for( values %$letter_counts ) { $total += $_ ; } # for return($total, $letter_counts) ; } @prettybase = map chomp, <INPUT> ; # # do something to determine start and stop points # ($total, $counts) = processRec(\@prettybase, start, end) ; while( ($letter, $count) = each %$counts ) { print "$letter => ", ($count*100)/$total, "\n" ; }
    You may want to consider 'pre-splitting' the @prettybase array into an array of array refs just so you don't have to keep resplitting the lines. This would be good if your input files are huge, and you're going back and forth over it many times.
Re: Finding elements without a counter...
by dragonchild (Archbishop) on Oct 13, 2003 at 18:25 UTC
    If I understand your question correctly, you want something that will only handle the items in @prettybase whose second column matches the patient number you pass in. Is that what you're looking for?

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Yes, that is correct. Technically, I could have used pattern matching rather than a counter to go through @prettybase, but normally a counter works well for these type of programs. Unfortunately, not in this case.
      Bioinformatics
        my @prettybase; my $seen = 0; while (<PRETTYBASE>) { chomp; my @info = split /\t/; unless ($info[1] == $patient_number) { # Use this to short-cut out of the file once we've found our p +atient info. last if $seen; next; } $seen = 1; push @prettybase, $_; } # At this point, everything in @prettybase corresponds to the patient +specified. # Modify your algorithms accordingly.

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.