Code optimization help and troubleshooting

newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks

I am trying to selectively color a sequence of 100 characters and writing a html file.I have written the code and but its somewhat too many loops and is giving some errors.

I was wondering if I get some suggestions in troubleshooting and reducing the code

Error: In the output, position number 11 (T). I am also trying to color the last 10 characters as blue instead of red.

Any help will be greatly appreciated.
Regards


#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;

my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC
+CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT";

############ Make array of rejected length ###############
my $seqlen = length ($sequence);
my $input = "fragments.txt";
open(K,">","temp_hash.txt");
open( my $infile, "<", $input ) || die "Check the $input $!\n";
my $old = 0;
while ( my $line = <$infile> ){
    my $gap = $line - $old;
    if ($gap > 2){
    my $start = $old+1;
        my $end = $line-1;
        print K "$start\t$end\n"; 
    }
    elsif ($gap == 2){
        my $start = $old+1;
        print K "$start\t$start\n"; 
    }
    $old = $line;
}
close ($infile);

my $line = $seqlen+1;
my $gap = $line - $old;
if ($gap > 2){
    my $start = $old+1;
    my $end = $line-1;
    print K "$start\t$end\n"; 
}
elsif ($gap == 2){
    my $start = $old+1;
    print K "$start\t$start\n"; 
}
close (K); close ($infile);


####################### Make a Hash ########################

my $tempfile = "temp_hash.txt";
open(my $file, "<", $tempfile) or die "Check the file $!";
my $cnt = 1;my %split;
while (my $line = <$file>){
    chomp $line;
    if ($line =~/(\S+)\s+(\S+)/){
        for (my $s=$1;$s<=$2;$s++){
            push @{$split{$cnt}}, $s;
        }
    }
    $cnt++;
}
close ($file); 

############# Write HTML file #############################
#my $header;
open(AA, ">fragments.html") or die $!;
print AA "<!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD XHTML 1.0 Strict\/\/
+EN\" \"http:\/\/www.w3.org\/TR\/xhtml1\/DTD\/xhtml1-strict.dtd\">\n";
print AA "<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\" lang=\"en\"
+ xml:lang=\"en\">\n";
print AA "<head>\n";
print AA "<title> Colored Gene Walk</title> <meta http-equiv=\"content
+-type\" content=\"text/html;charset=utf-8\"/>\n";
print AA "</head>\n";
print AA "<body>\n";
print AA "<pre class=\"monofont\">\n";
print AA "<a style=\"font-size: 12pt\">\n";
#print AA "$header\n";

for (my $pos=1;$pos<=length($sequence);$pos++){
    my $FLAG=0;my @temp=();
    foreach my $k(sort {$a <=> $b} keys %split){
        my @pos=@{$split{$k}};
        my $p1 =$pos[0];
        my $p2 =$pos[$#pos];
        if($pos == $p1){
            push (@temp, "<span style=\"color:red\">");
            for(my $p=$p1;$p<=$p2;$p++){
                my $s=substr($sequence,$p-1,1);
                push (@temp,$s);
                if($p==int($p/50)*50){
                    push (@temp,"<br />");
                }
                $FLAG=1; $pos++;
            }
            push (@temp, "</span>");
        }
    }
    if($FLAG ==0){
        my $s=substr($sequence,$pos-1,1);
        print AA "$s";
    }
    else{
        printf AA join("",@temp);
    }
    if($pos==int($pos/50)*50){
       printf  AA "<br />";
    }
}

print AA "\n";
print AA "</a>\n";
print AA "</pre>\n";
print AA "</body>\n";
print AA "</html>\n";

close (AA);
[download]

fragments.txt
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
51
52
53
54
55
56
57
58
59
60
86
87
88
89
90
91
92
93
94
95
[download]

Comment on Code optimization help and troubleshooting Select or Download Code

Replies are listed 'Best First'.

Re: Code optimization help and troubleshooting
by Athanasius (Archbishop) on Oct 07, 2014 at 07:51 UTC

Hello newtoperlprog,

... suggestions in ... reducing the code

Your code reads the input file, extracts the data you need, and stores this to a temporary file. It then constructs a hash by reading that file and extracting its data! So the first improvement is obviously to remove this unnecessary middle step, and create the hash as the input file is read. Here is my re-write of the code between “Make array of rejected length” and “Write HTML file”:

my $input = 'fragments.txt';
my %split;

open(my $infile, '<', $input)
    or die "Cannot open file '$input' for reading: $!";

build_hash($_) while <$infile>;

close $infile
    or die "Cannot close file '$input': $!";

build_hash(length($sequence) + 1);

sub build_hash
{
    use feature 'state';

    state $old     =  0;
    state $cnt     =  1;
    my   ($line)   =  @_;
    my    $gap     =  $line - $old;
    my    $start   =  $old  + 1;
    my    $end     = ($gap  > 2) ? $line - 1 : $start;
    $split{$cnt++} = [$start .. $end] if $gap >= 2;
          $old     =  $line;
}
[download]

As you can see, the rewritten code is about half as long as the original. Duplicate code has been moved into a subroutine, and variables have been given reduced scope, which will aid in debugging. (Note also that the original code closes $infile twice.)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Code optimization help and troubleshooting
by AnomalousMonk (Archbishop) on Oct 06, 2014 at 22:33 UTC

... reducing the code ...

Instead of a million
print AA "HTML string \n";
statements, consider a here-doc:

print AA <<"EndOfHTML";
<html>blah blah blah</html>
...
<stuff>yada $scalars and @arrays can be interpolated in some cases yad
+a</stuff>
EndOfHTML
[download]

Quote and Quote-like Operators

Quote-Like Operators

<<EOF

(update: just before the Gory details of parsing quoted constructs section)

[reply]
[d/l]
[select]

Re: Code optimization help and troubleshooting
by AnomalousMonk (Archbishop) on Oct 06, 2014 at 21:20 UTC

My HTML-fu is nil, but as to coloring the last 1-10 characters in a string blue and all preceding characters red, the following might be useful. It seems to work if I re-direct to a x.html file and look at it in IE. I leave to you putting the <br/> tags where you want them and other such trivial details.

c:\@Work\Perl\monks>perl -wMstrict -le
"my $red  = '<span style=\"color:red\">';
 my $blue = '<span style=\"color:blue\">';
 my $post = '</span>';
 ;;
 my $bases = qr{ [ATCG] }xms;
 ;;
 for my $s (qw(
   GGCGCAACGCTGAGGCCCCAGTTCTGCGCAGCTGCAGGGCATT
   ATCGATCGATC
   ATCGATCGAT
   A
   )) {
   my $seq = $s;
   $seq =~ s{ ($bases{1,10} \z) }{$blue$1$post}xms;
   $seq = qq{$red$seq$post};
   print $seq;
   }
"
<span style="color:red">GGCGCAACGCTGAGGCCCCAGTTCTGCGCAGCT<span style="
+color:blue">GCAGGGCATT</span></span>
<span style="color:red">A<span style="color:blue">TCGATCGATC</span></s
+pan>
<span style="color:red"><span style="color:blue">ATCGATCGAT</span></sp
+an>
<span style="color:red"><span style="color:blue">A</span></span>
[download]

[reply]
[d/l]
[select]

Re^2: Code optimization help and troubleshooting

by AnomalousMonk (Archbishop) on Oct 06, 2014 at 22:09 UTC

Ok, here's breaking into groups of 50 (had to make some changes to HTML-ize it for Perlmonks display, but you should get the idea):

c:\@Work\Perl\monks>perl -wMstrict -le
"my $red  = '<font color=\"red\">';
 my $blue = '<font color=\"blue\">';
 my $post = '</font>';
 my $brk  = qq{<br> \n};
 ;;
 my $base = qr{ [ATCG] }xms;
 ;;
 my $sequence = 'GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCT
+TCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCTGCAGGGCATT';
 ;;
 my $colorized_seq =
   join '',
   map qq{$red$_$post$brk},
   map { s{ ($base{1,10} \z) }{$blue$1$post}xms;  $_; }
   $sequence =~ m{ $base{1,50} }xmsg
   ;
 print qq{[[ $colorized_seq]]};
"
[[ <font color="red">GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCA<font col
+or="blue">GTTCTGCGCA</font></font><br>
<font color="red">GCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCT<font color="
+blue">GCAGGGCATT</font></font>
<br>
]]
[download]

GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCA

GCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCTGCAGGGCATT

[reply]
[d/l]

Re: Code optimization help and troubleshooting
by roboticus (Chancellor) on Oct 07, 2014 at 18:18 UTC

newtoperlprog:

You're working on the sequence on a character-by-character basis, and trying to solve multiple subproblems at once. You're trying to figure out whether to turn the color to black, blue or red, and simultaneously trying to add line breaks every 50 characters. The logic to do that can be complicated, and there's too much to keep track of to debug it simply.

I suggest you find a way to break the problem down into pieces that are independent, and easier to debug in isolation. There are plenty of ways to do it, but I came up with the following method: First, build a list of operations to you want to perform at varying places in the string, and then write some code to perform each operation. To do so, I built a list called @ops, where each entry contains a column number and the operation I want to perform. Doing it this way, you can look at the list of operations to see if you've got it right. Then, you can generate your final string and see if it's correct.

The general approach I took is this:

BUILD OPERATION LIST:
- Given the length of the string, add 'BREAK' operations every $break characters to the @ops list.
- Reading the 'fragments.txt' file, add 'RED' and 'BLACK' operations at the appropriate locations (similar to how you were building your hash).
- Remove all color operations after 10 chars before the end of the string.
- Add a 'BLUE' operation 10 chars before the end of the string.
- Add a 'DONE' operation at the end of the string.
- Sort the list by column number
BUILD HTML STRING:
- While we have operations on the list:
  - Print the substring from last position to the position for the operation
  - Perform the operation:
    - BREAK: insert a line break
    - BLUE: insert a color=blue span
    - RED: insert a color=red span
    - BLACK: end current span
    - DONE: end the pre formatted string

The code I came up with to do the job is:

#!/usr/bin/perl
use strict;
use warnings;

# Turn these to 0 to turn off debug display, or delete the appropriate
+ lines below
my $dbg_OPS_LIST=1;
my $dbg_OUTPUT=1;

my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC
+CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT";

# Break output after every Nth character
my $break = 50;

# Array of operations to perform at particular character positions in 
+the sequence
my @ops;

# Add line breaks to our operations list
{
    my $i=1;
    while (length($sequence) > $i*$break+1) {
        push @ops, [ $i*$break+1, 'BREAK' ];
        ++$i;
    }
}

### Build color-switch operations:  Each column position in the input
### file must be RED, all others should be black.
my $seqlen = length ($sequence);
my $input = "pm1103009.data";
open( my $infile, "<", $input ) || die "Check the $input $!\n";
my $old = -1;
while ( my $line = <$infile> + 0 ){

    if ($line > $old+1) {
        # skip sequential numbers (don't need to switch to 'red' on
        # *every* character in reject sequence)

        # add "switch to black" operation (unless we're still at start
+)
        push @ops, [ $old, 'BLACK' ] unless $old < 0;

        push @ops, [ $line, 'RED' ];
    }
    $old = $line;
}
close ($infile);

### remove any color switches after the switch to blue
@ops = grep {
            # Keep all ops before 10 chars before the end
            $_->[0] < length($sequence)-10
           
            # Keep only 'BREAK' (noncolor) entries from then on
            or $_->[1] eq 'BREAK'
        } @ops;

### Add operation to switch to BLUE at 10 chars before the end
push @ops, [ length($sequence)-10, 'BLUE' ];

### Add DONE operation at end
push @ops, [ length($sequence), 'DONE' ];

### Sort the operations by column
@ops = sort { $a->[0] <=> $b->[0] } @ops;

### Display our operation list
if ($dbg_OPS_LIST) {
    print "\nFinal operations list:\n";
    for my $r (@ops) {
        printf "% 3u: <%s>\n", $r->[0], $r->[1];
    }
    print "\n";
}

### Now build our HTML file

open(AA, ">fragments.html") or die $!;
print AA q(
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.d
+td">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <title>Colored Gene Walk</title>
        <meta http-equiv="content-type" content="text/html;charset=utf
+-8"/>
    </head>
    <body>
        <pre class=\"monofont\">
        <a style="font-size: 12pt">
);

my $pos = 1;
while ($pos < length($sequence)) {
    # Get the next operation from the list:
    my ($col, $op) = @{shift @ops};

    $dbg_OUTPUT && print "LOOP: POS:$pos, COL:$col, OP:$op\n";

    # Print the bit of the sequence up to the operation (if any)
    if ($pos < $col) {
        my $len = $col-$pos;
        my $subseq = substr($sequence, $pos-1, $len);
        $dbg_OUTPUT && print "\tTEXT($pos-$col): <$subseq>\n";
        print AA $subseq;
    }

    # perform the operation
    if ($op eq 'BLACK') {
        $dbg_OUTPUT && print "\tBLACK($col)\n";
        print AA '</span>';
    }
    elsif ($op eq 'BREAK') {
        $dbg_OUTPUT && print "\tBREAK($col)\n";
        print AA '<br/>';
    }
    elsif ($op eq 'BLUE') {
        $dbg_OUTPUT && print "\tBLUE($col)\n";
        print AA '<span style="color:blue">';
    }
    elsif ($op eq 'RED') {
        $dbg_OUTPUT && print "\tRED($col)\n";
        print AA '<span style="color:red">';
    }
    elsif ($op eq 'DONE') {
        $dbg_OUTPUT && print "END LOOP\n";
        print AA '</a></pre></body></html>', "\n";
    }
    $pos = $col;
}
[download]

I left in some debugging code, just in case it helps. Notice that I tried to make each section stand on its own, so you can look at what's happening in each step. Running it as-is gives me:

$ perl pm1103009.pl 

Final operations list:
 11: <RED>
 25: <BLACK>
 51: <BREAK>
 51: <RED>
 60: <BLACK>
 86: <RED>
 90: <BLUE>
100: <DONE>

LOOP: POS:1, COL:11, OP:RED
    TEXT(1-11): <GGCGCAACGC>
    RED(11)
LOOP: POS:11, COL:25, OP:BLACK
    TEXT(11-25): <TGAGCAGCTGGCGC>
    BLACK(25)
LOOP: POS:25, COL:51, OP:BREAK
    TEXT(25-51): <GTCCCGCGCGGCCCCAGTTCTGCGCA>
    BREAK(51)
LOOP: POS:51, COL:51, OP:RED
    RED(51)
LOOP: POS:51, COL:60, OP:BLACK
    TEXT(51-60): <GCTTCCCGA>
    BLACK(60)
LOOP: POS:60, COL:86, OP:RED
    TEXT(60-86): <GGCTCCGCACCAGCCGCGCTTCTGTC>
    RED(86)
LOOP: POS:86, COL:90, OP:BLUE
    TEXT(86-90): <CGCC>
    BLUE(90)
LOOP: POS:90, COL:100, OP:DONE
    TEXT(90-100): <TGCAGGGCAT>
END LOOP
[download]

The code isn't any shorter than yours, but it should be easy to debug. Additionally, since it's broken into sections, you can use the same operations list to convert it to PDF, a Word document, or anything else, if you like. You can also add new operation types (add bold, hyperlinks or anything else).

I intended to go into more discussion, but I ran out of time (I'm at work right now), so if you're interested, examine the code. If you have any specific questions, follow up with a question, and I'll get back to you when I can.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

[reply]
[d/l]
[select]

Re: Code optimization help and troubleshooting
by newtoperlprog (Sexton) on Oct 07, 2014 at 15:55 UTC

Dear AnomalousMonk and Athanasius

Thank you for your suggestions and comments, I have incorporated suggestions and it really shortened the code.

Could I get some help in the debugging, somehow, in the sequence after 1-10 character the 11th character is getting truncated. I checked the code for substr and couldn't figure out why this is happening.

Again thank you for your help.


#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use Data::Dumper;

my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC
+CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT";

############ Make hash of rejected length ###############
my $input = "fragments.txt";
my %split;

open (my $infile, "<", $input) or die "Cannot open file '$input' for r
+eading: $!";
build_hash($_) while <$infile>;
close ($infile) or die "Cannot close file '$input': $!";

build_hash(length($sequence) + 1);
sub build_hash{
    use feature 'state';
    state $old = 0;
    state $cnt = 1;
    my ($line) = @_;
    my $gap = $line - $old;
    my $start = $old + 1;
    my $end = ($gap > 2) ? $line - 1 : $start;
    $split{$cnt++} = [$start .. $end] if $gap >= 2;
    $old = $line;
}
############# Write HTML file #############################
#my $header;
open (AA, ">fragments.html") or die "Cannot open file:$!";
print AA <<"EndOfHTML";
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w
+3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
+>
    <head>
    <title> Colored Gene Walk</title> <meta http-equiv="content-type" 
+content="text/html;charset=utf-8"/>
    </head>
    <body>
    <pre class="monofont">
    <a style="font-size: 12pt">
EndOfHTML
#print AA "$header\n";
my $red  = '<span style="color:red">';
my $span = '</span>';
my $brk  = qq{<br />\n};

    for (my $pos=1;$pos<=length($sequence);$pos++){
    my $FLAG=0;my @temp=();
    foreach my $k(sort {$a <=> $b} keys %split){
        my @pos=@{$split{$k}};
        my $p1 =$pos[0];
        my $p2 =$pos[$#pos];
        if($pos == $p1){
        push (@temp, $red);
        for(my $p=$p1;$p<=$p2;$p++){
            my $s=substr($sequence,$p-1,1);
            push (@temp,$s);
            if($p==int($p/50)*50){
            push (@temp, $brk);
            }
            $FLAG=1; $pos++;
        }
        push (@temp, $span);
        }
    }
    if($FLAG ==0){
        my $s=substr($sequence,$pos-1,1);
        print AA "$s";
    }
    else{
        printf AA join("",@temp);
    }
    if($pos==int($pos/50)*50){
        printf  AA "$brk";
    }
}

print AA <<"EndOfHTML";
</a>
    </pre>
    </body>
    </html>
EndOfHTML
    close (AA);
[download]

[reply]
[d/l]

Re^2: Code optimization help and troubleshooting

by Athanasius (Archbishop) on Oct 08, 2014 at 03:31 UTC

Could I get some help in the debugging, somehow, in the sequence after 1-10 character the 11th character is getting truncated.

The relevant part of the code has the following structure:

for (my $pos = 1; $pos <= length($sequence); $pos++) {
    ...
    foreach my $k (...) {
        ...
        if ($pos == $p1) {
            ...
            for (my $p = ...) {
                ...
                $pos++;
            }
        }
    }
    ...
}
[download]

Incrementing the same variable (viz. $pos) in two places like this is asking for trouble. And, sure enough, within the inner for loop there are one too many increments, producing a typical off-by-one error. You can fix this by decrementing $pos immediately after the inner loop has completed:

if ($pos == $p1)
{
    $FLAG = 1;
    push @temp, $red;

    for my $p ($p1 .. $p2)
    {
        my $s = substr($sequence, $p - 1, 1);
        push @temp, $s;
        push @temp, $brk if $p == int($p / 50) * 50;

        $pos++;
    }

    $pos--;    # <-- ADD THIS

    push @temp, $span;
}
[download]

Note that it is not necessary to set $FLAG each time through the loop.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]