Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I have a string of text and i wnat to acess substrings of it using substr. e.g say I have a string of length 100, i want to access substrings of length 10 whose start positions moves along by 3 positions each time (e.g they are overlapping). I have the basic theory working but can't seem to access the correct length of string, heres my code:
@dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my $dna = join ('', @dna); my $movement = 3; for (my $pos = 10; $pos <= @dna; $pos +=$movement+1) { substr ($dna, $pos,0, "\n"); } + my @windows; push @windows, $dna; + print "@windows\n";
# thanks!

Replies are listed 'Best First'.
Re: substr help
by BrowserUk (Patriarch) on May 12, 2004 at 17:24 UTC

    This method should be a bit more efficient that either substr in a loop or using the regex engine.

    #! perl -slw use strict; my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my $n = int( ( length( $dna ) - ( 10 - 3 ) ) / 3 ); print for unpack "(A10 X7)$n", $dna; __END__ C:\Perl\test>test accatgagct atgagctgta agctgtacgt tgtacgtagc acgtagcatc tagcatctga catctgagcg ctgagcgcgc agcgcgcatg gcgcatgact catgactgtg gactgtgact tgtgactgac gactgacgta tgacgtaggc cgtaggcagc

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: substr help
by davido (Cardinal) on May 12, 2004 at 15:58 UTC
    my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my $increment = 3; my @windows; for ( my $loc=0; $loc <= (length($dna)-10); $loc+=$increment ){ push @windows, substr($dna, $loc, 10); } print "$_\n" for @windows;

    If you have multiple $dna sequences you'll probably want an outer loop to iterate over an array holding them. Otherwise, this code ought to do what you're looking for.

    It's one of the few instances where I would actually use a C-style 'for' loop.

    You could also do it with a regexp.

    Update: Replaced (length($dna)-$increment) with (length($dna)-10) per duff's comment. Good catch!


    Dave

      Surely that should be

      for ( my $loc = 0; $loc <= length($dna) - 10; $loc += $increment ) {

      Otherwise his last few strings won't be 10 characters long. Even though I have done this exact thing (sliding window with overlaps) in the past using a C-style for loop, I think I'd probably write it like this these days:

      my $end = int((length($dna) - 10)/3); for my $i (0..$end) { push @windows, substr($dna,$i*3,10); }
      or more likely
      my @windows = map { substr($dna,$_*3,10) } 0..int((length($dna)-10)/3) +;
      You could also do it with a regexp.
      Which would look something like:
      my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my $increment = 3; my $substr = 10; my @windows = $dna=~/(?=(.{$substr})).{$increment}/gs; print "$_\n" for @windows;
Re: substr help
by blokhead (Monsignor) on May 12, 2004 at 16:10 UTC
    As davido says, you can do this with a regex too. Either play with pos a little bit:
    ## capture 10 chars (with advancing), then move back 7: while ( $dna =~ /(.{10})/g ) { print "current window = $1\n"; pos $dna -= 7; }
    Or use a capture within a lookahead:
    ## capture next 10 chars without advancing, then advance by 3: while ( $dna =~ /(?= (.{10}) ) .{3}/gx ) { print "current window: $1\n"; }
    For maintenance/readability reasons, you may be better off using a for loop and substr. I don't know, though; I kinda like the lookahead solution... it's cute.

    blokhead

Re: substr help
by geekgrrl (Pilgrim) on May 12, 2004 at 17:50 UTC
    if you want to use Bioperl, here's an option. I couldn't find a module that would do this automatically, but I bet there is one out there.
    use strict; use Bio::Seq; my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my $seq = Bio::Seq->new( -seq => $dna); my $end = $seq->length -10; my @windows; for(my $i= 1; $i < $end; $i+=2) #increase by one codon each time. { push @windows, $seq->subseq($i, $i+9); } print join "\n", @windows;
Re: substr help
by sgifford (Prior) on May 12, 2004 at 16:02 UTC
    It's not clear exactly where you expect your output to be. You're calling substr inside the loop, but throwing away the results, then putting the original string $dna into the @windows list.

    Also, the third argument to substr is the number of characters you want; using 0 will always return an empty string. And in your loop, you're starting at 10 and stopping when the position is greater than the number of elements in @dna, but there's only one element in that list, so the loop never executes.

    I think something closer to what you mean is:

    use constant MOVEMENT => 3; use constant WINDOWSIZE => 10; my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my @windows=(); for (my $pos = 0; $pos <= (length($dna) - WINDOWSIZE); $pos += MOVEMEN +T) { push(@windows,substr($dna,$pos,WINDOWSIZE)); } print "@windows\n";
Re: substr help
by Zaxo (Archbishop) on May 12, 2004 at 21:24 UTC

    Here's a little different way to do it.

    With PerlIO, Perl 5.8+,

    my $dna = q(accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca); my $skip = 10; { local ($/, $\) = (\10, "\n"); open my $chain, '<', \substr($dna, $skip) or die $!; print while <$chain>; close $chain or die $!; }
    That makes use of the PerlIO trick of opening a scalar as a file by putting a reference to it in open's filespec slot. Setting $/ to a reference to constant three makes any filehandle be read three characters at a time. Setting $\, the output record seperator, to newline inserts one after every print statement executed.

    Update: Oops, I also misread. Not a solution to what was asked.

    After Compline,
    Zaxo

Re: substr help
by Not_a_Number (Prior) on May 12, 2004 at 16:13 UTC

    Just another way to do it:

    my $dna = 'accatgagctgtacgtagcatctgagcgcgcatgactgtgactgacgtaggcagca'; my @windows = unpack 'A3' x length $dna, $dna; print "@windows";

    Update:Sorry, ignore the above, I read the question too quickly.

    dave

      That doesn't actualy produce the output the OP was asking for. It produces groups of 3 which is a nice trick.


      ___________
      Eric Hodges