monkfan has asked for the wisdom of the Perl Monks concerning the following question:

This is a reverse problem to my earlier posting. What I need to do is to recover the substrings back to the string with gap (N) for those non-dotted substrings. This is the working code I have:


#!/usr/bin/perl -w use strict; my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 my @ar4 = ('GG','GG'); print append_n($s1,\@ar),"\n"; print append_n($s1,\@ar2),"\n"; print append_n($s1,\@ar3),"\n"; print append_n($s1,\@ar4),"\n"; sub append_n { my ( $str, $array ) = @_; my $nstring = "N" x length($str); foreach my $sbstr ( @$array ) { my $pos = index $str, $sbstr; substr ($nstring, $pos, length ($sbstr)) = $sbstr; } return $nstring; }


The code gives a correct result, except for the case where there are duplicates in the arrays (@ar4). It only returns one instance instead of all. So here is the result:
GATTACGNNNNGCGCTCGNNNAACGGCA #correct GATTACGAGNNNNNNNCGTGTAANNNNN #correct NNNTACGAGTGGCGCTCGTGNNNNNNNN #correct NNNNNNNNNNGGNNNNNNNNNNNNNNNN #wrong
supposedly it should return :NNNNNNNNNNGGNNNNNNNNNNNNGGNN
Is there a way to overcome this bug?
Regards,
Edward

Replies are listed 'Best First'.
Re: Recovering Substrings to String with Gap (use m//g in scalar context)
by demerphq (Chancellor) on Apr 16, 2005 at 12:47 UTC

    UPDATE: whoops a closer inspection reveals this doesnt match your output requirements, but i leave it here as it is a useful technique.

    This is a situation where For some problems like this m//g in scalar context comes in useful:

    sub append_n { my ( $str, $array ) = @_; my $nstring = "N" x length($str); pos($str)=0; # reset the search position on $str foreach my $sbstr ( @$array ) { $str=~/$sbstr/g and substr ($nstring, $-[0], length($sbstr)) = $sbstr } return $nstring; }

    Outputs (when combined with a modified version of your code)

    m//g: GATTACGNNNNGCGCTCGNNNAACGGCA Orig: GATTACGNNNNGCGCTCGNNNAACGGCA m//g: GATTACGNNNNNNNNNCGTGTAANNNNN Orig: GATTACGAGNNNNNNNCGTGTAANNNNN m//g: NNNTACGAGTNNNGCTCGTGNNNNNNNN Orig: NNNTACGAGTGGCGCTCGTGNNNNNNNN m//g: NNNNNNNNNNGGNNNNNNNNNNNNGGNN Orig: NNNNNNNNNNGGNNNNNNNNNNNNNNNN
    ---
    demerphq

Re: Recovering Substrings to String with Gap
by tlm (Prior) on Apr 16, 2005 at 13:42 UTC

    Update: This solution, though correct (I think), is unnecessarily complicated and slow. The idea of using bit vectors, which is ideal for counting the size of embedded substrings, is not particularly well-suited for this new problem. See the one in my follow-up to this node instead.

    This is a straightforward adaptation of the solution proposed by Roy Johnson (and minimally tweaked by YT) to your earlier problem:

    sub append_n { my ($str, $array) = @_; my $vec = ''; for (@$array) { my $ofs = 0; while ( ( my $idx = index $str, $_, $ofs ) > -1 ) { # Set bits at each matched location vec( $vec, $_, 1 ) = 1 for $idx .. $idx + length() - 1; $ofs = $idx + 1; } } # Count set bits vec( $vec, $_, 1 ) or substr( $str, $_, 1 ) = 'N' for 0 .. -1 + length $str; return $str; } __END__ GATTACGNNNNGCGCTCGNNNAACGGCA GATTACGAGNNNNNNNCGTGTAANNNNN NNNTACGAGTGGCGCTCGTGNNNNNNNN NNNNNNNNNNGGNNNNNNNNNNNNGGNN

    Update: Edited the name of the sub to match the one given in the head node.

    the lowliest monk

      ...but come to think of it, it is simpler to just fix your original version:

      sub append_n { my ( $str, $array ) = @_; my $nstring = "N" x length($str); foreach my $sbstr ( @$array ) { my $ofs = 0; while ( ( my $pos = index $str, $sbstr, $ofs ) > -1 ) { substr ($nstring, $pos, length ($sbstr)) = $sbstr; $ofs = $pos + 1; } } return $nstring; } __END__ GATTACGNNNNGCGCTCGNNNAACGGCA GATTACGAGNNNNNNNCGTGTAANNNNN NNNTACGAGTGGCGCTCGTGNNNNNNNN NNNNNNNNNNGGNNNNNNNNNNNNGGNN
      ...and about 3X faster too:
      Rate bitvec substr bitvec 18618/s -- -74% substr 70275/s 277% --
      In hindsight this is not surprising, since my (vec-based) solution does everything that yours does plus the unnecessary bit-vector stuff. Doh!

      the lowliest monk

Re: Recovering Substrings to String with Gap
by salva (Canon) on Apr 16, 2005 at 11:07 UTC
    it's because you always search the substring from the beginning instead of from the point after the last match.
    my $after = 0; foreach my $sbstr ( @$array ) { my $pos = index $str, $sbstr, $after; substr ($nstring, $pos, length ($sbstr)) = $sbstr; $after=$pos+length($sbstr); }
      No it wont' do, cause it will "over append" those overlapping substrings (@arr2 and @arr3) which gives:
      GATTACGNNNNNNNNNCGTGTAANNNNTTACGAG NNNTACGAGTNNNGCTCGTGNNNNNNNGTGGCGC
      Instead of OP (already) correct answers.
      Regards,
      Edward
        oops, you are right, I had supposed that the strings didn't overlap...

        but if the substrings are sorted by their position on the original string, it's easy to solve, just make $after=$pos+1 or:

        my $pos=0; foreach my $sbstr ( @$array ) { $pos = index $str, $sbstr, $pos; substr ($nstring, $pos, length ($sbstr)) = $sbstr; $pos++ }
        and if they are not sorted, well, then you can use some hash to remember the last $pos for a given $sbstr and start searching after that...
        my %after; foreach my $sbstr ( @$array ) { my $pos = index $str, $sbstr, $after{$sbstr}||0; substr ($nstring, $pos, length ($sbstr)) = $sbstr; $after{$sbstr}=$pos+1; }