Recently brian_d_foy contributed a really interesting (and cathartic) meditation titled Stupid mistakes I repeatedly make. Well, this meditation here is in the same vein, except that here I'm talking about BIG mistakes. Specifically: doing a great job at solving the wrong problem. And more specifically still: focusing on the wrong problem because I'm such a Perl junkie. It happens to me all the time: I'm so enamored of the power of Perl that I overlook simpler solutions that those without Perl are forced to find. (Yes, my sin is not being sufficiently lazy.)
To build up some suspense, I'll first tell you the problem that I latched on to and set to solve. Later I'll tell you what the real problem was.
The task is to find all the matches to the motif 'gccrccaugg' in a long sequence of RNA (i.e. 'a's, 'c's, 'g's, and 'u's), subject to the following restrictions:
I figured that I could build a mongo regex consisting of a bazillion alternatives. Basically it was a matter of generating all the subsets of up to length 3 of offsets in the original sequence, and then generating the individual alternatives by replacing the character at each offset with the string '[acgu]'. Then I could join them with a '|', and run the resulting string through qr//. The business with 'r' matching 'a' or 'g' is just a minor wrinkle that can easily be accommodated into this strategy.
If anyone's interested in my solution, click the <readmore> link (but read the rest of the meditation first, otherwise some aspects of the code won't make sense). I should mention that the regexps I used in my solution are more general than they need to be; they don't take advantage of the specifics of the problem; this is intentional.
use strict; my $seq = lc do { local @ARGV = ( 'seq' ); <> }; $seq =~ tr/t/u/; my @matches = with_regex( $seq ); print "$_\n" for @matches; sub with_regex { my $ks = 'gccrccaugg'; my @ks1 = split //, $ks; my @ks2 = map { s/^r$/[ag]/; $_ } @ks1; my @pos; push @pos, [ $-[ 1 ], $+[ 1 ] ] while $ks =~ /(?=(aug))/g; my @ks3 = @ks2; splice @ks3, $_->[0], $_->[1]-$_->[0] for reverse @pos; my @alts = map join( '', @$_ ), map { my $a_ref = $_; splice @$a_ref, $_->[0], 0, qw( a u g ) for @pos; $a_ref } map make_alts( $_, @ks3 ), map choose( scalar @ks3, $_ ), 0 .. 3; my $re_string = join '|', @alts; my $re = qr/(?=($re_string))/i; my $seq = shift; my @m = ( $seq =~ /$re/g ); return @m; } sub make_alts { my $indices = shift; my @p = @_; $p[ $_ ] = '[acgu]' for @$indices; return \@p } sub choose { my ( $n, $m ) = @_; return _choose( $m, 0..$n-1 ); } sub _choose { my ( $m, @n ) = @_; return [] if 0 == $m; return [ @n ] if @n == $m; my @ret; for my $i ( 0 .. $#n - $m + 1 ) { push @ret, [ $n[ $i ], @$_ ] for _choose( $m - 1, @n[ $i+1 .. $#n ] ); } return @ret; }
OK, the real problem had one more constraint, one that I mistakenly lumped along with the 'r' constraint as just another inconsequential "nuisance". Well, far from it, it made the problem much easier, downright boring even. This third constraint was that
I try to derive some comfort from the fact that maybe my code will be useful in the general case (i.e. of finding all matches with up to n mismatches, where there isn't a fast way to extract a small set of candidates subsequences). And solemnly resolve to work harder at being lazier.
the lowliest monk
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: I think I like Perl too much...
by Mugatu (Monk) on Mar 31, 2005 at 01:25 UTC | |
|
Re: I think I like Perl too much...
by zentara (Cardinal) on Mar 31, 2005 at 12:57 UTC | |
by tlm (Prior) on Mar 31, 2005 at 14:34 UTC | |
|
Re: I think I like Perl too much...
by jdporter (Paladin) on Mar 31, 2005 at 16:05 UTC | |
by tlm (Prior) on Mar 31, 2005 at 16:28 UTC |