newtoperlprog has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks

I am trying to selectively color a sequence of 100 characters and writing a html file.I have written the code and but its somewhat too many loops and is giving some errors.

I was wondering if I get some suggestions in troubleshooting and reducing the code

.

Error: In the output, position number 11 (T). I am also trying to color the last 10 characters as blue instead of red.

Any help will be greatly appreciated.
Regards

#!/usr/bin/perl use strict; use warnings; use LWP::Simple; use Data::Dumper; my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC +CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT"; ############ Make array of rejected length ############### my $seqlen = length ($sequence); my $input = "fragments.txt"; open(K,">","temp_hash.txt"); open( my $infile, "<", $input ) || die "Check the $input $!\n"; my $old = 0; while ( my $line = <$infile> ){ my $gap = $line - $old; if ($gap > 2){ my $start = $old+1; my $end = $line-1; print K "$start\t$end\n"; } elsif ($gap == 2){ my $start = $old+1; print K "$start\t$start\n"; } $old = $line; } close ($infile); my $line = $seqlen+1; my $gap = $line - $old; if ($gap > 2){ my $start = $old+1; my $end = $line-1; print K "$start\t$end\n"; } elsif ($gap == 2){ my $start = $old+1; print K "$start\t$start\n"; } close (K); close ($infile); ####################### Make a Hash ######################## my $tempfile = "temp_hash.txt"; open(my $file, "<", $tempfile) or die "Check the file $!"; my $cnt = 1;my %split; while (my $line = <$file>){ chomp $line; if ($line =~/(\S+)\s+(\S+)/){ for (my $s=$1;$s<=$2;$s++){ push @{$split{$cnt}}, $s; } } $cnt++; } close ($file); ############# Write HTML file ############################# #my $header; open(AA, ">fragments.html") or die $!; print AA "<!DOCTYPE html PUBLIC \"-\/\/W3C\/\/DTD XHTML 1.0 Strict\/\/ +EN\" \"http:\/\/www.w3.org\/TR\/xhtml1\/DTD\/xhtml1-strict.dtd\">\n"; print AA "<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\" lang=\"en\" + xml:lang=\"en\">\n"; print AA "<head>\n"; print AA "<title> Colored Gene Walk</title> <meta http-equiv=\"content +-type\" content=\"text/html;charset=utf-8\"/>\n"; print AA "</head>\n"; print AA "<body>\n"; print AA "<pre class=\"monofont\">\n"; print AA "<a style=\"font-size: 12pt\">\n"; #print AA "$header\n"; for (my $pos=1;$pos<=length($sequence);$pos++){ my $FLAG=0;my @temp=(); foreach my $k(sort {$a <=> $b} keys %split){ my @pos=@{$split{$k}}; my $p1 =$pos[0]; my $p2 =$pos[$#pos]; if($pos == $p1){ push (@temp, "<span style=\"color:red\">"); for(my $p=$p1;$p<=$p2;$p++){ my $s=substr($sequence,$p-1,1); push (@temp,$s); if($p==int($p/50)*50){ push (@temp,"<br />"); } $FLAG=1; $pos++; } push (@temp, "</span>"); } } if($FLAG ==0){ my $s=substr($sequence,$pos-1,1); print AA "$s"; } else{ printf AA join("",@temp); } if($pos==int($pos/50)*50){ printf AA "<br />"; } } print AA "\n"; print AA "</a>\n"; print AA "</pre>\n"; print AA "</body>\n"; print AA "</html>\n"; close (AA);
fragments.txt 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 51 52 53 54 55 56 57 58 59 60 86 87 88 89 90 91 92 93 94 95

Replies are listed 'Best First'.
Re: Code optimization help and troubleshooting
by Athanasius (Archbishop) on Oct 07, 2014 at 07:51 UTC

    Hello newtoperlprog,

    ... suggestions in ... reducing the code

    Your code reads the input file, extracts the data you need, and stores this to a temporary file. It then constructs a hash by reading that file and extracting its data! So the first improvement is obviously to remove this unnecessary middle step, and create the hash as the input file is read. Here is my re-write of the code between “Make array of rejected length” and “Write HTML file”:

    my $input = 'fragments.txt'; my %split; open(my $infile, '<', $input) or die "Cannot open file '$input' for reading: $!"; build_hash($_) while <$infile>; close $infile or die "Cannot close file '$input': $!"; build_hash(length($sequence) + 1); sub build_hash { use feature 'state'; state $old = 0; state $cnt = 1; my ($line) = @_; my $gap = $line - $old; my $start = $old + 1; my $end = ($gap > 2) ? $line - 1 : $start; $split{$cnt++} = [$start .. $end] if $gap >= 2; $old = $line; }

    As you can see, the rewritten code is about half as long as the original. Duplicate code has been moved into a subroutine, and variables have been given reduced scope, which will aid in debugging. (Note also that the original code closes $infile twice.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Code optimization help and troubleshooting
by AnomalousMonk (Archbishop) on Oct 06, 2014 at 22:33 UTC
Re: Code optimization help and troubleshooting
by AnomalousMonk (Archbishop) on Oct 06, 2014 at 21:20 UTC

    My HTML-fu is nil, but as to coloring the last 1-10 characters in a string blue and all preceding characters red, the following might be useful. It seems to work if I re-direct to a  x.html file and look at it in IE. I leave to you putting the  <br/> tags where you want them and other such trivial details.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $red = '<span style=\"color:red\">'; my $blue = '<span style=\"color:blue\">'; my $post = '</span>'; ;; my $bases = qr{ [ATCG] }xms; ;; for my $s (qw( GGCGCAACGCTGAGGCCCCAGTTCTGCGCAGCTGCAGGGCATT ATCGATCGATC ATCGATCGAT A )) { my $seq = $s; $seq =~ s{ ($bases{1,10} \z) }{$blue$1$post}xms; $seq = qq{$red$seq$post}; print $seq; } " <span style="color:red">GGCGCAACGCTGAGGCCCCAGTTCTGCGCAGCT<span style=" +color:blue">GCAGGGCATT</span></span> <span style="color:red">A<span style="color:blue">TCGATCGATC</span></s +pan> <span style="color:red"><span style="color:blue">ATCGATCGAT</span></sp +an> <span style="color:red"><span style="color:blue">A</span></span>

      Ok, here's breaking into groups of 50 (had to make some changes to HTML-ize it for Perlmonks display, but you should get the idea):

      c:\@Work\Perl\monks>perl -wMstrict -le "my $red = '<font color=\"red\">'; my $blue = '<font color=\"blue\">'; my $post = '</font>'; my $brk = qq{<br> \n}; ;; my $base = qr{ [ATCG] }xms; ;; my $sequence = 'GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCT +TCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCTGCAGGGCATT'; ;; my $colorized_seq = join '', map qq{$red$_$post$brk}, map { s{ ($base{1,10} \z) }{$blue$1$post}xms; $_; } $sequence =~ m{ $base{1,50} }xmsg ; print qq{[[ $colorized_seq]]}; " [[ <font color="red">GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCA<font col +or="blue">GTTCTGCGCA</font></font><br> <font color="red">GCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCT<font color=" +blue">GCAGGGCATT</font></font> <br> ]]
      Which looks like:
      GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCA
      GCTTCCCGAGGCTCCGCACCAGCCGCGCTTCTGCCGCCTGCAGGGCATT

Re: Code optimization help and troubleshooting
by roboticus (Chancellor) on Oct 07, 2014 at 18:18 UTC

    newtoperlprog:

    You're working on the sequence on a character-by-character basis, and trying to solve multiple subproblems at once. You're trying to figure out whether to turn the color to black, blue or red, and simultaneously trying to add line breaks every 50 characters. The logic to do that can be complicated, and there's too much to keep track of to debug it simply.

    I suggest you find a way to break the problem down into pieces that are independent, and easier to debug in isolation. There are plenty of ways to do it, but I came up with the following method: First, build a list of operations to you want to perform at varying places in the string, and then write some code to perform each operation. To do so, I built a list called @ops, where each entry contains a column number and the operation I want to perform. Doing it this way, you can look at the list of operations to see if you've got it right. Then, you can generate your final string and see if it's correct.

    The general approach I took is this:

    • BUILD OPERATION LIST:
      • Given the length of the string, add 'BREAK' operations every $break characters to the @ops list.
      • Reading the 'fragments.txt' file, add 'RED' and 'BLACK' operations at the appropriate locations (similar to how you were building your hash).
      • Remove all color operations after 10 chars before the end of the string.
      • Add a 'BLUE' operation 10 chars before the end of the string.
      • Add a 'DONE' operation at the end of the string.
      • Sort the list by column number
    • BUILD HTML STRING:
      • While we have operations on the list:
        • Print the substring from last position to the position for the operation
        • Perform the operation:
          • BREAK: insert a line break
          • BLUE: insert a color=blue span
          • RED: insert a color=red span
          • BLACK: end current span
          • DONE: end the pre formatted string

    The code I came up with to do the job is:

    #!/usr/bin/perl use strict; use warnings; # Turn these to 0 to turn off debug display, or delete the appropriate + lines below my $dbg_OPS_LIST=1; my $dbg_OUTPUT=1; my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC +CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT"; # Break output after every Nth character my $break = 50; # Array of operations to perform at particular character positions in +the sequence my @ops; # Add line breaks to our operations list { my $i=1; while (length($sequence) > $i*$break+1) { push @ops, [ $i*$break+1, 'BREAK' ]; ++$i; } } ### Build color-switch operations: Each column position in the input ### file must be RED, all others should be black. my $seqlen = length ($sequence); my $input = "pm1103009.data"; open( my $infile, "<", $input ) || die "Check the $input $!\n"; my $old = -1; while ( my $line = <$infile> + 0 ){ if ($line > $old+1) { # skip sequential numbers (don't need to switch to 'red' on # *every* character in reject sequence) # add "switch to black" operation (unless we're still at start +) push @ops, [ $old, 'BLACK' ] unless $old < 0; push @ops, [ $line, 'RED' ]; } $old = $line; } close ($infile); ### remove any color switches after the switch to blue @ops = grep { # Keep all ops before 10 chars before the end $_->[0] < length($sequence)-10 # Keep only 'BREAK' (noncolor) entries from then on or $_->[1] eq 'BREAK' } @ops; ### Add operation to switch to BLUE at 10 chars before the end push @ops, [ length($sequence)-10, 'BLUE' ]; ### Add DONE operation at end push @ops, [ length($sequence), 'DONE' ]; ### Sort the operations by column @ops = sort { $a->[0] <=> $b->[0] } @ops; ### Display our operation list if ($dbg_OPS_LIST) { print "\nFinal operations list:\n"; for my $r (@ops) { printf "% 3u: <%s>\n", $r->[0], $r->[1]; } print "\n"; } ### Now build our HTML file open(AA, ">fragments.html") or die $!; print AA q( <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.d +td"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>Colored Gene Walk</title> <meta http-equiv="content-type" content="text/html;charset=utf +-8"/> </head> <body> <pre class=\"monofont\"> <a style="font-size: 12pt"> ); my $pos = 1; while ($pos < length($sequence)) { # Get the next operation from the list: my ($col, $op) = @{shift @ops}; $dbg_OUTPUT && print "LOOP: POS:$pos, COL:$col, OP:$op\n"; # Print the bit of the sequence up to the operation (if any) if ($pos < $col) { my $len = $col-$pos; my $subseq = substr($sequence, $pos-1, $len); $dbg_OUTPUT && print "\tTEXT($pos-$col): <$subseq>\n"; print AA $subseq; } # perform the operation if ($op eq 'BLACK') { $dbg_OUTPUT && print "\tBLACK($col)\n"; print AA '</span>'; } elsif ($op eq 'BREAK') { $dbg_OUTPUT && print "\tBREAK($col)\n"; print AA '<br/>'; } elsif ($op eq 'BLUE') { $dbg_OUTPUT && print "\tBLUE($col)\n"; print AA '<span style="color:blue">'; } elsif ($op eq 'RED') { $dbg_OUTPUT && print "\tRED($col)\n"; print AA '<span style="color:red">'; } elsif ($op eq 'DONE') { $dbg_OUTPUT && print "END LOOP\n"; print AA '</a></pre></body></html>', "\n"; } $pos = $col; }

    I left in some debugging code, just in case it helps. Notice that I tried to make each section stand on its own, so you can look at what's happening in each step. Running it as-is gives me:

    $ perl pm1103009.pl Final operations list: 11: <RED> 25: <BLACK> 51: <BREAK> 51: <RED> 60: <BLACK> 86: <RED> 90: <BLUE> 100: <DONE> LOOP: POS:1, COL:11, OP:RED TEXT(1-11): <GGCGCAACGC> RED(11) LOOP: POS:11, COL:25, OP:BLACK TEXT(11-25): <TGAGCAGCTGGCGC> BLACK(25) LOOP: POS:25, COL:51, OP:BREAK TEXT(25-51): <GTCCCGCGCGGCCCCAGTTCTGCGCA> BREAK(51) LOOP: POS:51, COL:51, OP:RED RED(51) LOOP: POS:51, COL:60, OP:BLACK TEXT(51-60): <GCTTCCCGA> BLACK(60) LOOP: POS:60, COL:86, OP:RED TEXT(60-86): <GGCTCCGCACCAGCCGCGCTTCTGTC> RED(86) LOOP: POS:86, COL:90, OP:BLUE TEXT(86-90): <CGCC> BLUE(90) LOOP: POS:90, COL:100, OP:DONE TEXT(90-100): <TGCAGGGCAT> END LOOP

    The code isn't any shorter than yours, but it should be easy to debug. Additionally, since it's broken into sections, you can use the same operations list to convert it to PDF, a Word document, or anything else, if you like. You can also add new operation types (add bold, hyperlinks or anything else).

    I intended to go into more discussion, but I ran out of time (I'm at work right now), so if you're interested, examine the code. If you have any specific questions, follow up with a question, and I'll get back to you when I can.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Code optimization help and troubleshooting
by newtoperlprog (Sexton) on Oct 07, 2014 at 15:55 UTC

    Dear AnomalousMonk and Athanasius

    Thank you for your suggestions and comments, I have incorporated suggestions and it really shortened the code.

    Could I get some help in the debugging, somehow, in the sequence after 1-10 character the 11th character is getting truncated. I checked the code for substr and couldn't figure out why this is happening.

    and I am trying to color the last 10 (91-100) characters in blue while reading in for loop.

    Again thank you for your help.

    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use Data::Dumper; my $sequence="GGCGCAACGCTGAGCAGCTGGCGCGTCCCGCGCGGCCCCAGTTCTGCGCAGCTTCC +CGAGGCTCCGCACCAGCCGCGCTTCTGTCCGCCTGCAGGGCATT"; ############ Make hash of rejected length ############### my $input = "fragments.txt"; my %split; open (my $infile, "<", $input) or die "Cannot open file '$input' for r +eading: $!"; build_hash($_) while <$infile>; close ($infile) or die "Cannot close file '$input': $!"; build_hash(length($sequence) + 1); sub build_hash{ use feature 'state'; state $old = 0; state $cnt = 1; my ($line) = @_; my $gap = $line - $old; my $start = $old + 1; my $end = ($gap > 2) ? $line - 1 : $start; $split{$cnt++} = [$start .. $end] if $gap >= 2; $old = $line; } ############# Write HTML file ############################# #my $header; open (AA, ">fragments.html") or die "Cannot open file:$!"; print AA <<"EndOfHTML"; <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w +3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" +> <head> <title> Colored Gene Walk</title> <meta http-equiv="content-type" +content="text/html;charset=utf-8"/> </head> <body> <pre class="monofont"> <a style="font-size: 12pt"> EndOfHTML #print AA "$header\n"; my $red = '<span style="color:red">'; my $span = '</span>'; my $brk = qq{<br />\n}; for (my $pos=1;$pos<=length($sequence);$pos++){ my $FLAG=0;my @temp=(); foreach my $k(sort {$a <=> $b} keys %split){ my @pos=@{$split{$k}}; my $p1 =$pos[0]; my $p2 =$pos[$#pos]; if($pos == $p1){ push (@temp, $red); for(my $p=$p1;$p<=$p2;$p++){ my $s=substr($sequence,$p-1,1); push (@temp,$s); if($p==int($p/50)*50){ push (@temp, $brk); } $FLAG=1; $pos++; } push (@temp, $span); } } if($FLAG ==0){ my $s=substr($sequence,$pos-1,1); print AA "$s"; } else{ printf AA join("",@temp); } if($pos==int($pos/50)*50){ printf AA "$brk"; } } print AA <<"EndOfHTML"; </a> </pre> </body> </html> EndOfHTML close (AA);
      Could I get some help in the debugging, somehow, in the sequence after 1-10 character the 11th character is getting truncated.

      The relevant part of the code has the following structure:

      for (my $pos = 1; $pos <= length($sequence); $pos++) { ... foreach my $k (...) { ... if ($pos == $p1) { ... for (my $p = ...) { ... $pos++; } } } ... }

      Incrementing the same variable (viz. $pos) in two places like this is asking for trouble. And, sure enough, within the inner for loop there are one too many increments, producing a typical off-by-one error. You can fix this by decrementing $pos immediately after the inner loop has completed:

      if ($pos == $p1) { $FLAG = 1; push @temp, $red; for my $p ($p1 .. $p2) { my $s = substr($sequence, $p - 1, 1); push @temp, $s; push @temp, $brk if $p == int($p / 50) * 50; $pos++; } $pos--; # <-- ADD THIS push @temp, $span; }

      Note that it is not necessary to set $FLAG each time through the loop.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,