in reply to look for substrings and getting their location
YBL027W
GUAUGUUUAACAGU...
Yet, a couple of the solutions begin by setting
$var = 'GUAUGUUUAACAGU...'How does one get the line name from the solution above? A solution which leaves the data in the original format and gives the line name, number of matches, and their zero-based offsets is as follows:
Perl was created to manipulate text. A solution to a problem such as this should be compact and easy to understand.#!/usr/bin/perl use warnings; use strict; my $pat = 'GUAUG'; my ($line, $times, @at); while (<DATA>) { if (/^[CGUA]+$/) { $times = () = m/$pat/g; if ($times) { eval('/^' . ('.*?($pat)' x $times) . '.*?$/; @at = @-;'); shift @at; } } else { ($line) = /^(\w+)$/; } if ($line and $times) { print "$line: $times match", $times>1 ? 'es' : ' ', " at @at\n"; $line = $times = 0; } } __DATA__ YBL027W GUAUGUUUAACAGUGAUAGUAUGUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA BBL111C UAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAGUAUGGUAUGAAUAUGUUAUGAG ABC456T AUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGU DEF789U UGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGUA GHI012V GUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGUAUGU
I made a few assumptions:
• All DNA sequences comprise CGUA. (I thought it was CGAT. I am not a scientist but I play one on TV.)
• The search strings do NOT overlap.
• The line name has at least one character that is not C, G, U, or A.
• All lines alternate between line name and DNA sequence with the former before the latter.
|
|---|