comment on

Recently brian_d_foy contributed a really interesting (and cathartic) meditation titled Stupid mistakes I repeatedly make. Well, this meditation here is in the same vein, except that here I'm talking about BIG mistakes. Specifically: doing a great job at solving the wrong problem. And more specifically still: focusing on the wrong problem because I'm such a Perl junkie. It happens to me all the time: I'm so enamored of the power of Perl that I overlook simpler solutions that those without Perl are forced to find. (Yes, my sin is not being sufficiently lazy.)

To build up some suspense, I'll first tell you the problem that I latched on to and set to solve. Later I'll tell you what the real problem was.

The task is to find all the matches to the motif 'gccrccaugg' in a long sequence of RNA (i.e. 'a's, 'c's, 'g's, and 'u's), subject to the following restrictions:

'r' can match either 'a' or 'g';
there can be up to 3 mismatches.

Well, I couldn't think of a simple regex to allow for 'up to 3 mismatches', but I thought it was a pretty cool problem. (Granted, Perl is the favorite language of bioinformaticians, so I'm sure that there is a Perl module somewhere, if not a dozen of them, to take care of this very problem, but I was more interested in the puzzle than in finding a ready-made solution.)

I figured that I could build a mongo regex consisting of a bazillion alternatives. Basically it was a matter of generating all the subsets of up to length 3 of offsets in the original sequence, and then generating the individual alternatives by replacing the character at each offset with the string '[acgu]'. Then I could join them with a '|', and run the resulting string through qr//. The business with 'r' matching 'a' or 'g' is just a minor wrinkle that can easily be accommodated into this strategy.

If anyone's interested in my solution, click the <readmore> link (but read the rest of the meditation first, otherwise some aspects of the code won't make sense). I should mention that the regexps I used in my solution are more general than they need to be; they don't take advantage of the specifics of the problem; this is intentional.

use strict;

my $seq = lc do { local @ARGV = ( 'seq' ); <> };
$seq =~ tr/t/u/;

my @matches = with_regex( $seq );
print "$_\n" for @matches;

sub with_regex {
  my $ks   = 'gccrccaugg';
  my @ks1  = split //, $ks;
  my @ks2  = map { s/^r$/[ag]/; $_ } @ks1;

  my @pos;
  push @pos, [ $-[ 1 ], $+[ 1 ] ] while $ks =~ /(?=(aug))/g;

  my @ks3 = @ks2;
  splice @ks3, $_->[0], $_->[1]-$_->[0] for reverse @pos;

  my @alts =
    map join( '', @$_ ),
      map {
        my $a_ref = $_;
        splice @$a_ref, $_->[0], 0, qw( a u g ) for @pos;
        $a_ref
      }
        map make_alts( $_, @ks3 ),
          map choose( scalar @ks3, $_ ), 0 .. 3;

  my $re_string = join '|', @alts;
  my $re = qr/(?=($re_string))/i;

  my $seq = shift;
  my @m = ( $seq =~ /$re/g );

  return @m;
}

sub make_alts {
  my $indices = shift;
  my @p = @_;
  $p[ $_ ] = '[acgu]' for @$indices;
  return \@p
}

sub choose {
  my ( $n, $m ) = @_;
  return _choose( $m, 0..$n-1 );
}

sub _choose {
  my ( $m, @n ) = @_;
  return []     if  0 == $m;
  return [ @n ] if @n == $m;

  my @ret;
  for my $i ( 0 .. $#n - $m + 1 ) {
    push @ret, [ $n[ $i ], @$_ ]
      for _choose( $m - 1, @n[ $i+1 .. $#n ] );
  }
  return @ret;
}
[download]

OK, the real problem had one more constraint, one that I mistakenly lumped along with the 'r' constraint as just another inconsequential "nuisance". Well, far from it, it made the problem much easier, downright boring even. This third constraint was that

There can be no mismatches on the 'aug' substring of the target motif

As some sharp colleagues pointed out to me, given this constraint, the simplest thing to do is to find all the substrings in the sequence that have 'aug' in the right position, and then filter out those with more than the allowed number of mismatches. A much simpler and faster solution that does not require a single regular expression! It can all be handled with index. Sigh.

I try to derive some comfort from the fact that maybe my code will be useful in the general case (i.e. of finding all matches with up to n mismatches, where there isn't a fast way to extract a small set of candidates subsequences). And solemnly resolve to work harder at being lazier.

the lowliest monk

In reply to I think I like Perl too much... by tlm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.