Re: look for substrings and getting their location

It seems that a number of the posts took the original question and changed it somewhat, consequently, not giving full and thorough solutions. For instance, the original question states that the data are in the following format:

YBL027W
GUAUGUUUAACAGU...

Yet, a couple of the solutions begin by setting

$var = 'GUAUGUUUAACAGU...'

How does one get the line name from the solution above? A solution which leaves the data in the original format and gives the line name, number of matches, and their zero-based offsets is as follows:

#!/usr/bin/perl
use warnings;
use strict;
 
my $pat = 'GUAUG';
my ($line, $times, @at);
 
while (<DATA>) {
  if (/^[CGUA]+$/) {
    $times = () = m/$pat/g; 
    if ($times) {
      eval('/^' . ('.*?($pat)' x $times) . '.*?$/; @at = @-;');
      shift @at;
    }
  } else {
    ($line) = /^(\w+)$/;
  }
 
  if ($line and $times) {
    print "$line: $times match", $times>1 ? 'es' : '  ', " at @at\n";
    $line = $times = 0;
  }
}
 
__DATA__
YBL027W
GUAUGUUUAACAGUGAUAGUAUGUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA
BBL111C
UAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAGUAUGGUAUGAAUAUGUUAUGAG
ABC456T
AUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGU
DEF789U
UGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGUA
GHI012V
GUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGUAUGU
[download]

Perl was created to manipulate text. A solution to a problem such as this should be compact and easy to understand.

I made a few assumptions:
• All DNA sequences comprise CGUA. (I thought it was CGAT. I am not a scientist but I play one on TV.)
• The search strings do NOT overlap.
• The line name has at least one character that is not C, G, U, or A.
• All lines alternate between line name and DNA sequence with the former before the latter.

Comment on Re: look for substrings and getting their location Download Code