lostcause has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I'm trying to generate all the possible DNA sequences that fall between two defined lengths.
The code below works but is not very extensible. I could keep adding loops for each additional base and as the maximum length that I'm interested in is around 10 bases its quite doable, but its not very efficient programing. It should be possible to do this with dynamic references or something but I don't really know where to start. Can anyone give me some clues?

Thanks

my @Bases = qw(A C G T); my $SequenceMinLenght = 1; my $SequenceMaxLength = 3; my @SequenceLenghts = ($SequenceMinLenght...$SequenceMaxLength); #print "@SequenceLenghts\n"; my $SequenceListLength = $#SequenceLenghts; $SequenceListLength ++; print "Number of Sequence lenghts to permute is $SequenceListLength\n" +; my @allSequence = (); my $sequence; my $currLenght; foreach $currLenght (@SequenceLenghts){ foreach my $base1 (@Bases) { if ($currLenght == 1){ $sequence = $base1; push (@allSequence, $sequence); } else{ foreach my $base2 (@Bases) { if ($currLenght == 2){ $sequence = $base1 . $base2; push (@allSequence, $sequence); } else{ foreach my $base3 (@Bases) { if ($currLenght == 3){ $sequence = $base1 . $base2. $base3; push (@allSequence, $sequence); } } } } } } } #Prints sequences to screen #- Too many entries after lenght of sequence is grater than 4. my $i=0; foreach my $finalsequence (@allSequence){ $i ++; if ($i == 4){ print "$finalsequence\n"; $i= 0; } else{ print "$finalsequence\t"; } }

Replies are listed 'Best First'.
(jeffa) Re: Generate all possibilities
by jeffa (Bishop) on Jun 21, 2001 at 00:22 UTC

      None of those do what was asked for. larryk's fine suggestion of (Golf) Ordered Combinations does what was asked for but requires that all 1,048,576 ten-character strings be stored in memory for you to get the answer. With Perl arrays, this takes several hundred MB (in my quick test) and so could be a real problem. If you just write the combinations to a file, then you end up with about a 10MB file, which isn't a big deal.

      So here is a quick hack to do that:

      sub count { my( $length, $alphabet )= @_; my @alphabet= split //, $alphabet; my @digit= (0) x $length; my $seq= $alphabet[0] x $length; while( 1 ) { print $seq,$/; my $pos= 0; while( @alphabet <= ++$digit[$pos] ) { substr( $seq, $pos, 1 )= $alphabet[ $digit[$pos]= 0 ]; return if $length <= ++$pos; } substr( $seq, $pos, 1 )= $alphabet[ $digit[$pos] ]; } } count( @ARGV ); # count( $_, "ACGT" ) for 8..10;

              - tye (but my friends call me "Tye")

      Update: And here is a context-free iterator solution:

        Tye,

        Thanks for the alternative piece of code, I ran into the very problem you described and was quite amazed to discover that a 10MB hash took up so much RAM.
        The new solution, of course, brings up a new problem This is a little bit unusual. My program generates all the combinations of the bases via the subroutine and saves them to disk. It later calls the file and reads through it one line at a time with a while loop. With larger sets of combination's, 5 or more bases long, this works fine but with a smaller sets, 3 or 4 bases I encounter a strange error. After running the sub routine the program reopens the file but the file is empty so the program continues not doing my stuff with each sequence then quits. Opening the resulting output of the combination sub routine shows that all the combinations are indeed there. The file size is just 320 bytes.
        I have followed this in a debugger and its reproducible. My guess is that the contents of the file are not been flushed to disk until the program ends. This happens on both Linux and Win2k. I am also having problems deleting the file after use. I've never run into this problem before. Ideas anyone?

        Here is a script that illustrates the problem.
        Try varying the commentating on the two unlink lines to see different effects and the changing the length of the sequence that is generated

        #!/usr/local/bin/perl -w $| = 1; #generate promoters my $Bases = "ACGT"; #my @PromoterLenghts = ($PromoterMinLenght...$PromoterMaxLength); my @PromoterLenghts = (4); my $PromotroListLength = $#PromoterLenghts; $PromotroListLength ++; #unlink 'permute.txt'; my ($currLenght, $permutelist_filename); foreach $currLenght (@PromoterLenghts){ $permutelist_filename = &permute_bases ($currLenght, $Bases); } #print "Sequences file $permutelist_filename\n"; open (PIN, "permute.txt")||die "Can't open file: $!\n"; open (OUTPUT, ">output.txt")||die "Can't open file: $!\n"; while (<PIN>){ $promoter = $_; chomp $promoter; #do intrestering stuff with promoter sequence print OUTPUT "$promoter\n"; } close PIN ||die "Could not close sequences file: $!"; close OUTPUT ||die "Could not close output file: $!"; unlink 'permute.txt'; ##################################################################### sub permute_bases { my( $length, $bases )= @_; my @bases= split //, $bases; my @digit= (0) x $length; my $seq= $bases[0] x $length; my $permutelist_filename = "permute.txt"; open (PERMUTEOUT, ">>$permutelist_filename"); while( 1 ) { print PERMUTEOUT $seq,$/; print $seq,$/; my $pos= 0; while( @bases <= ++$digit[$pos] ) { substr( $seq, $pos, 1 )= $bases[ $digit[$pos]= 0 ]; return if $length <= ++$pos; } substr( $seq, $pos, 1 )= $bases[ $digit[$pos] ]; } close PERMUTEOUT||die"Could not close $permutelist_filename: $!"; return ($permutelist_filename); }
Re: Generate all possibilities
by larryk (Friar) on Jun 21, 2001 at 00:37 UTC
Re: Generate all possibilities
by lostcause (Monk) on Jun 21, 2001 at 02:08 UTC

    Jeff and larryk,
    Thanks for your suggestions.
    Jeff, I had already checked out your suggestions but they didn't cut the mustard. I probably wasn't specific enough in my original question but I need all the possibilities not the permutations (I'm not sure if that is the correct usage of possibilities/permutations . . .). Specifically I also need to generate sequences containing all the same base e.g. AAA, TTT etc.

    The Ordered Combinations solution as suggested by larryk does what I need, an awesome little piece of code. Thanks to MeowChow, sachmet and tilly.

    Here is the final piece of code for the record:

    use strict; #generate all posibilites of a DNA sequence my @Bases = qw(A C G T); my $SequenceMinLenght = 1; my $SequenceMaxLength = 3; my @SequenceLenghts = ($SequenceMinLenght...$SequenceMaxLength); #print "@SequenceLenghts\n"; my $SequenceListLength = $#SequenceLenghts; $SequenceListLength ++; print "Number of Sequence lenghts to permute is $SequenceListLength\n" +; my @allSequence = (); my $sequence; my $currLenght; my @Ordered_Combinations_input = (); foreach $currLenght (@SequenceLenghts){ @Ordered_Combinations_input = (); @Ordered_Combinations_input = ($currLenght, @Bases); @allSequence = (@allSequence, (&Ordered_Combinations (@Ordered_Com +binations_input))); } #Prints sequences to screen # - Too many entries after lenght of sequence is grater than 4. my $i=0; foreach my $finalsequence (@allSequence){ $i ++; if ($i == 4){ print "$finalsequence\n"; $i= 0; } else{ print "$finalsequence\t"; } } #Completely blow the GOLF chalange by using a sub name almost as long +as the sub #See http://www.perlmonks.com/index.pl?node_id=75261 for details sub Ordered_Combinations {my$n=shift;--$n?map{my$d=$_;map$d.$_,Ordered +_Combinations($n,@_)}@_:@_};

      Unless you need some of those variables later, we could eliminate some clutter.

      use strict; use warnings; #generate all posibilites of a DNA sequence my @Bases = qw(A C G T); my $SequenceMinLength = 1; my $SequenceMaxLength = 3; print 'Number of Sequence lengths to permute is ', $SequenceMaxLength - $SequenceMinLength + 1, "\n"; my @allSequence; foreach my $currLength ($SequenceMinLength .. $SequenceMaxLength){ push @allSequence, Ordered_Combinations($currLength, @Bases); } #Prints sequences to screen # - Too many entries after length of sequence is greater than 4. my $i; foreach my $finalsequence (@allSequence){ ++$i % @Bases ? print "$finalsequence\t" : print "$finalsequence\n +"; }

      By using @Bases instead of 4, we get @Base columns, aiding readability.

      sub Ordered_Combinations { my $n = shift; --$n ? map { my $d = $_; map "$d$_", Ordered_Combinations($n,@_) } @_ : @_ }
      HTH,
      Charles K. Clarkson
Re: Generate all possibilities
by Madams (Pilgrim) on Jun 23, 2001 at 05:47 UTC
    I had a similar question awhile back about using a dynamic number of nested for loops (Topic thread here,tilly's answer here).

    I wanted to loop through x sets of items and process the combinations:

    i.e.: the sets {1,2,3},{a,b,c},{!@#} yield (inorder) 1a!, 1a@, 1a#, 1b!, 1b@, 1b#...3c#

    tilly's answer had it all AND made me REALLY understand closures and iterators and how powerfull they really are.


    _________________
    madams@scc.net
    (__) (\/) /-------\/ / | 666 || * ||----||