comment on

I've have access to the following script:

#!/usr/bin/perl
$filename = "sample";
open (TEXT, "sample.txt")||die"Cannot";
$line = " ";
$count = 0;
for $n (5..50)
   {
   $re = qr /[CAGT]{$n}/;
   $regexes[$n-5] = $re;
   }
NEXTLINE: while ($count < 1000)
   {
   $line = <TEXT> ;
   $count++;
   foreach my $value (@regexes)
      {
      $start = 0;
      while ($line =~ /$value/g)
         {
         $endline = $';
         $match = $&;
         $revmatch = reverse($match);
         $revmatch =~ tr/CAGT/GTCA/;
         if ($endline =~ /^([CAGT]{0,15})($revmatch)/)
            {
            $start = 1;
            $palindrome = $match . "*" . $1 . "*" . $2;
            $palhash{$palindrome}++;
            }
         }
      if ($start == 0)
         {
         goto NEXTLINE;
         }
      }
   }
open my $out, ">/DIR/results.txt"; 
close TEXT;
while(($key, $value) = each (%palhash))
   {
   print $out "$key => $value\n";
   }
exit;
[download]

Which identifies and outputs identified DNA palindromes. A biological palindrome in DNA is defined as a sequence which when read on both strands in the same direction (5' to 3') is identical:

http://imageshack.com/a/img835/2787/de98.png (as shown by the blue/red regions)

I feel like the above script is rather messy and the output is confusing and unclear. I was wondering if anybody could offer some help, tips, guidance or code itself to accomplish the following:

(1) Identify palindromic DNA sequences

(2) Be able to specify a minimum and maximum length of match

(3) An optional parameter to set whereby only results containing a certain sequence within the length of the palindrome, for example 'GATC', are printed to the output file but where this can also be left blank causing the program to print every single palindrome it finds

(4) The inputs will be DNA sequences of only 1 strand (and not both) - the output needs to be the full palindromic sequence identified for just a single strand - for example in the above photo the input would be:

AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG

The program would then calculate the other strand's sequence and see if there any palindromes contained within this and if so, output:

GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC

Many thanks!

In reply to Finding biological palindromes by TJCooper

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.