Recently brian_d_foy contributed a really interesting (and cathartic) meditation titled Stupid mistakes I repeatedly make. Well, this meditation here is in the same vein, except that here I'm talking about BIG mistakes. Specifically: doing a great job at solving the wrong problem. And more specifically still: focusing on the wrong problem because I'm such a Perl junkie. It happens to me all the time: I'm so enamored of the power of Perl that I overlook simpler solutions that those without Perl are forced to find. (Yes, my sin is not being sufficiently lazy.)

To build up some suspense, I'll first tell you the problem that I latched on to and set to solve. Later I'll tell you what the real problem was.

The task is to find all the matches to the motif 'gccrccaugg' in a long sequence of RNA (i.e. 'a's, 'c's, 'g's, and 'u's), subject to the following restrictions:

Well, I couldn't think of a simple regex to allow for 'up to 3 mismatches', but I thought it was a pretty cool problem. (Granted, Perl is the favorite language of bioinformaticians, so I'm sure that there is a Perl module somewhere, if not a dozen of them, to take care of this very problem, but I was more interested in the puzzle than in finding a ready-made solution.)

I figured that I could build a mongo regex consisting of a bazillion alternatives. Basically it was a matter of generating all the subsets of up to length 3 of offsets in the original sequence, and then generating the individual alternatives by replacing the character at each offset with the string '[acgu]'. Then I could join them with a '|', and run the resulting string through qr//. The business with 'r' matching 'a' or 'g' is just a minor wrinkle that can easily be accommodated into this strategy.

If anyone's interested in my solution, click the <readmore> link (but read the rest of the meditation first, otherwise some aspects of the code won't make sense). I should mention that the regexps I used in my solution are more general than they need to be; they don't take advantage of the specifics of the problem; this is intentional.

use strict; my $seq = lc do { local @ARGV = ( 'seq' ); <> }; $seq =~ tr/t/u/; my @matches = with_regex( $seq ); print "$_\n" for @matches; sub with_regex { my $ks = 'gccrccaugg'; my @ks1 = split //, $ks; my @ks2 = map { s/^r$/[ag]/; $_ } @ks1; my @pos; push @pos, [ $-[ 1 ], $+[ 1 ] ] while $ks =~ /(?=(aug))/g; my @ks3 = @ks2; splice @ks3, $_->[0], $_->[1]-$_->[0] for reverse @pos; my @alts = map join( '', @$_ ), map { my $a_ref = $_; splice @$a_ref, $_->[0], 0, qw( a u g ) for @pos; $a_ref } map make_alts( $_, @ks3 ), map choose( scalar @ks3, $_ ), 0 .. 3; my $re_string = join '|', @alts; my $re = qr/(?=($re_string))/i; my $seq = shift; my @m = ( $seq =~ /$re/g ); return @m; } sub make_alts { my $indices = shift; my @p = @_; $p[ $_ ] = '[acgu]' for @$indices; return \@p } sub choose { my ( $n, $m ) = @_; return _choose( $m, 0..$n-1 ); } sub _choose { my ( $m, @n ) = @_; return [] if 0 == $m; return [ @n ] if @n == $m; my @ret; for my $i ( 0 .. $#n - $m + 1 ) { push @ret, [ $n[ $i ], @$_ ] for _choose( $m - 1, @n[ $i+1 .. $#n ] ); } return @ret; }

OK, the real problem had one more constraint, one that I mistakenly lumped along with the 'r' constraint as just another inconsequential "nuisance". Well, far from it, it made the problem much easier, downright boring even. This third constraint was that

As some sharp colleagues pointed out to me, given this constraint, the simplest thing to do is to find all the substrings in the sequence that have 'aug' in the right position, and then filter out those with more than the allowed number of mismatches. A much simpler and faster solution that does not require a single regular expression! It can all be handled with index. Sigh.

I try to derive some comfort from the fact that maybe my code will be useful in the general case (i.e. of finding all matches with up to n mismatches, where there isn't a fast way to extract a small set of candidates subsequences). And solemnly resolve to work harder at being lazier.

the lowliest monk

Replies are listed 'Best First'.
Re: I think I like Perl too much...
by Mugatu (Monk) on Mar 31, 2005 at 01:25 UTC

    Maybe the title of your post should be I think I like Regular Expressions too much...

    And maybe the title of my response should be I think sometimes I ramble on about unrelated things too much... :-)

    Back when I had learned just enough Perl to be functional, I used to think in Regular Expressions all the time. They are simple, convenient, and powerful. They make a great hammer. They bend your mind into seeing all problems as nails. And, honestly, most problems can be bent into nails, when the hammer is Perl's regular expression engine.

    After a while, though, I developed another sense of Perl. I discovered objects, and coderefs, and functional techniques, and index, and substr, and pack, and typeglobs, and tie, and a whole world of other things. I discovered that these other things can solve problems in ways that a regular expression anchored worldview is blind to.

    Sure, I knew about of some of these things before I learned how to use regexes, but I didn't really understand them. I didn't realize that regexes were both a blessing and a curse.

    Nowadays, I try to think about the problem first, and the solution second. I find that after clearly thinking about the problem, a solution will often simply drop out of the sky. This solution is then easily translatable into Perl, using one of the myriad of expressive tools available. Regexes are part of the toolset, and a very powerful tool indeed, but there are many other things there.

Re: I think I like Perl too much...
by zentara (Cardinal) on Mar 31, 2005 at 12:57 UTC
    Dear tlm, who are you really? You've appeared 2 weeks ago, with more Perl knowledge than most, yet you claim to be "the lowliest monk". Are you "Abigail's reincarnation?, "curious minds want to know". :-)

    I'm not really a human, but I play one on earth. flash japh

      Thanks that's very kind, but I'm a little embarrassed to be compared to someone like Abigail; I'm not in the same league, or the next one below.

      the lowliest monk

Re: I think I like Perl too much...
by jdporter (Paladin) on Mar 31, 2005 at 16:05 UTC
    I really don't think an index-based solution is going to be easier to code than a regex-based solution; and anyway, the number of patterns is not that huge. It's "N choose R", where R is 3 (in your example) and N is the length of the motif, not including the anchor ("aug"). For your example, that's only 35 patterns. Increase the length of the motif by 1, and it's 56. Then 84, then 120. Not explosive.
    sub N_choose_R { my( $l, $n, ) = @_; $n > $l and die "$n > $l !!!\n"; $n == 1 and return( reverse map { [$_] } 0 .. $l-1 ); $n == $l and return [ reverse( 0 .. $l-1 ) ]; my @lists; for my $s ( reverse( $n-1 .. $l-1 ) ) { push @lists, map { [ $s, @$_ ] } N_choose_R( $s, $n-1 ); } @lists } my $motif = 'gccrccaugg'; my $anchor = 'aug'; my $max_mismatches = 3; my( $pre, $post ) = split $anchor, $motif, 2; my $anchor_ofs = length $pre; my @patterns = map { my $p = $motif; for my $i ( @$_ ) { substr( $p, $i < $anchor_ofs ? $i : $i + length($anchor), 1 ) += '.'; } $p =~ s/r/[ag]/g; $p } N_choose_R( length( $pre.$post ), $max_mismatches ); my $re = join '|', @patterns;

      Agreed, the difference in terms of complexity of the code is not big, but the difference in speed is. At least my implementations of both solutions differed by one order of magnitude. Besides, this is all kids' play compared to the algorithms discussed in earlier threads like this one and this one (thanks to dmerphq for the heads-up).

      the lowliest monk