monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a following problem which try to score the substrings distance.
The score is basically the number of positions of substrings that occur (dotted count)

And these are the three instances:
Instance 1 (score 21)
....... ....... ...... GATTACGAGTGGCGCTCGTGTAACGGCA GATTACG GCGCTCG AACGGCA
Instance 2 (score 16)
......... ....... GATTACGAGTGGCGCTCGTGTAACGGCA GATTACG TTACGAG CGTGTAA
Instance 3 (score 17)
GCTCGTG .................. GATTACGAGTGGCGCTCGTGTAACGGCA TACGAGT GTGGCGC
UPDATE: Instance 4 (score 4)
.. .. GATTACGAGTGGCGCTCGTGTAACGGCA GG GG
I wonder if there is a way to solve this.
Currently, what I have is:
#!/usr/bin/perl -w use strict; use Data::Dumper; my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; # Here are the array of the substrings, note that the number of # substrings may differ from array to array # and length of the substring may vary (defined as a parameter) my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 my @ar4 = ('GG','GG'); #4 score($s1,\@ar2); sub score{ my ($str,$array) = @_; my @pos; my $sub_len = length @$array[1]; foreach (@$array){ my $idx = index $s1, $_; push @pos, $idx; print "$idx\n"; } # I'm not sure where to go from here for (0..$#pos){ my $sum = $pos[$_]+$sub_len; } }
UPDATE 2: Renamed title.
Regards,
Edward

Replies are listed 'Best First'.
Re: Substring Distance Problem
by Roy Johnson (Monsignor) on Apr 08, 2005 at 12:11 UTC
    Using vec leads to a rather elegant solution:
    sub score { my ($str, $array) = @_; my $vec = ''; for (@$array) { my $idx = index $str, $_; # Set bits at each matched location vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1; } # Count set bits unpack '%32b*', $vec; }

    Caution: Contents may have been coded under pressure.
      Hi Roy,

      Your 'vec' approach is 5 times faster than our original approach!

      Rate bare vec bare 24142/s -- -78% vec 110542/s 358% --
      How can I accomodate the duplicate case (please see update above) to your snippet?

      i.e. Given: my @ar4 = ('GG','GG'); Returns: 4 not 2

      Update: Thanks to everybody for their great insights and helps.
      Regards,
      Edward

        If I may tinker with Roy's code, here is the fix you need. I give the modification for both the original and the xor versions of the algorithm:

        use strict; sub score { my ($str, $array) = @_; my $vec = ''; for (@$array) { my $ofs = 0; while ( ( my $idx = index $str, $_, $ofs ) > -1 ) { # Set bits at each matched location vec($vec, $_, 1)= 1 for $idx..$idx+length($_)-1; $ofs = $idx + 1; } } # Count set bits unpack '%32b*', $vec; } sub score_xor { my ($str, $array) = @_; my $vec = "\0" x length($str); for (@$array) { my $ofs = 0; while ( ( my $idx = index $str, $_, $ofs ) > -1 ) { # Matching substrings are padded into position with nulls $vec |= ("\0" x $idx) . $_; $ofs = $idx + 1; } } # Matching characters become nulls; others non-nulls $vec ^= $str; # Count nulls $vec =~ tr/\0//; }

        I was curious to see how these two versions compared, and was surprised to learn that the original one is faster by well over a factor of 2:

        Rate xor v1 xor 112734/s -- -59% v1 275692/s 145% --

        the lowliest monk

      Good stuff. I think it (or the xor variant, or both) would be a good addition to the Snippets Section.

      the lowliest monk

Re: Substring Distance Problem
by hv (Prior) on Apr 08, 2005 at 11:05 UTC

    Here's a fairly simply approach:

    sub score{ my($str, $array) = @_; # ensure that longer string comes before its prefix my @substring = sort { $b cmp $a } @$array; my $re = join '|', map "(?=($_))", @substring; my($count, $next) = (0, 0); while ($str =~ /$re/g) { # $-[0] is the position at which we matched # @- describes the matched captures, so $#- is the actual capture +matched my($start, $which) = ($-[0], $#- - 1); my $end = $start + length($substring[$which]); $next = $start if $next < $start; next if $end < $next; $count += $end - $next; $next = $end; } return $count; }

    This assumes that you may want substrings of varying lengths within the array, and that you may even have one substring in a set that is an exact prefix of another - if you don't need to allow for one or both of those possibilities, the code could be simplified a bit further.

    It does assume however that the substrings are simple strings to match directly rather than regexps in their own right: it would otherwise need a different approach to discovering the length of each match.

    The idea is to construct a regexp that will match any of the strings at any position by turning each into a lookahead; and to make each substring a capture so that we know which matched, and can therefore work out the length of the match.

    The rest of the code remembers what positions have already been catered for to avoid double-counting.

    Hugo

Re: Substring Distance Problem
by BrowserUk (Patriarch) on Apr 08, 2005 at 11:13 UTC

    This produces the required results for the testcases provided reasonably efficiently. Generalising the implementation is left as an exercise.

    Basically, you replace spaces with nulls, OR the n-1 shorter strings together and then XOR the result with the longest string. You then count the number of nulls in the result.

    The definition of "shorter" and "longest in that description is fuzzy :).

    #! perl -slw use strict; my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s2 = 'GATTACG GCGCTCG AACGGCA'; my $masked = $s1 ^ $s2; print scalar $masked =~ tr[\0][0]; my $s3 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s4 = 'GATTACG '; my $s5 = ' TTACGAG CGTGTAA '; tr[ ][\0] for $s3, $s4, $s5; $masked = $s3 ^ ( $s4 | $s5 ); print scalar $masked =~ tr[\0][0]; my $s6 = ' GCTCGTG '; my $s7 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my $s8 = ' TACGAGT '; my $s9 = ' GTGGCGC '; tr[ ][\0] for $s6, $s7, $s8, $s9; $masked = $s7 ^ ( $s6 | $s8 | $s9 ); print scalar $masked =~ tr[\0][0]; __END__ P:\test>junk2 21 16 17

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
      I was going to include the generalized version of your solution with my vec solution, as they're functionally fairly similar, but I couldn't get it to work. Finally found my mistake, so here's a solution to the exercise:
      sub score { my ($str, $array) = @_; my $vec = "\0" x length($str); for (@$array) { my $idx = index $str, $_; # Matching substrings are padded into position with nulls $vec |= ("\0" x $idx) . $_; } # Matching characters become nulls; others non-nulls $vec ^= $str; # Count nulls $vec =~ tr/\0//; }
      Update:
      An interesting (possibly quite useful) thing for the OP to note is that the vec solution effectively builds the list of dots as its vector (ones in matched positions), and your solution gives one string with the actual matched characters in position.

      Caution: Contents may have been coded under pressure.
      In my understanding, ewijaya should automate the substring positioning as well as counting - what I think you have left as an exercise. This part would probably spoil your efficiency claim, because you're solving the problem from quite a convenient starting point :)

      This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years. There's always to learn, luckly, provided I'll be able to remember it when I'll need :)

      Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")

      Don't fool yourself.

        This said, I find the final computation quite elegant and istructive - I wouldn't have thought of XORing letters even in 100 years.

        BrowserUk is The XOR Meister.

        I had the same reaction as yours when I first encountered the "infamous xor trick" (in this case to find the first position at which two strings differ, or equivalently, the length of the longest common prefix):

        $a = "foobar"; $b = "foobAr"; ($a ^ $b) =~ /^(\0*)/ and print length $1; __END__ 4
        Way cool, though it works as written only if the characters are 1 byte long.

        There's always to learn, luckly, provided I'll be able to remember it when I'll need :)

        Yep, that's the rub.

        the lowliest monk

Re: Substring Distance Problem
by monkey_boy (Priest) on Apr 08, 2005 at 10:58 UTC
    On first look, id use a hash to flag up positions that match
    then just add up the total number of matches...
    e.g.

    my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; # Here are the array of the substrings, note that the number of # substrings may differ from array to array # and length of the substring may vary (defined as a parameter) my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 print &score($s1,\@ar) , "\n"; print &score($s1,\@ar2) , "\n"; print &score($s1,\@ar3) , "\n"; sub score { my ($str,$array) = @_; my %position_score; for my $frag (@{$array}) { my $idx = index($s1, $frag) + 1; for my $pos ($idx .. $idx + (length($frag) - 1)) { $position_score{$pos} = 1; }; }; my $total_score = 0; for my $score (values %position_score) { $total_score += $score; }; return $total_score; };



    This is not a Signature...
Re: Substring Distance Problem
by monkey_boy (Priest) on Apr 08, 2005 at 11:04 UTC
    Just a thought ....
    What should happen when one sub-string can match multiple times?
    e.g. GG can match twice
    my above implimentation will only handle match the first time & score 2 insetead of 4.



    This is not a Signature...
      Hi monkey_boy,

      Thanks so much for your answer.
      However, if there is a repeated substring cases as you mentioned, it should be scored as 4.

      Update: With slight modification to your code -by including offset- I managed to accomodate the duplicated case:
Re: Substring Distance Problem
by tlm (Prior) on Apr 08, 2005 at 11:53 UTC

    I haven't tested the code below very much; you should if you think you many want to use it.

    #!/usr/bin/perl -wl use strict; use List::Util qw( min max sum ); sub score { my ($str, @array) = @_; my @pos = sort { $a->[ 0 ] <=> $b->[ 0 ] } map [ $_->[ 0 ], $_->[ 0 ] + length $_->[ 1 ] ], grep $_->[ 0 ] > -1, map [ index( $str, $_ ), $_ ], @array; for my $i ( reverse 0 .. $#pos - 1 ) { my $lefts = [ sort { $a <=> $b } map $_->[ 0 ], @pos[ $i, $i + 1 +] ]; my $rights = [ sort { $b <=> $a } map $_->[ 1 ], @pos[ $i, $i + 1 +] ]; if ( $rights->[ 1 ] - $lefts->[ 1 ] > 0 ) { splice @pos, $i, 2, [ map $_->[ 0 ], $lefts, $rights ] } } return sum map $_->[ 1 ] - $_->[ 0 ], @pos; } my $s1 = 'GATTACGAGTGGCGCTCGTGTAACGGCA'; my @ar = ('GATTACG','GCGCTCG','AACGGCA'); #21 (0,11,21) my @ar2 = ('GATTACG','TTACGAG','CGTGTAA'); #16 my @ar3 = ('TACGAGT','GTGGCGC','GCTCGTG'); #17 print score( $s1, @$_ ) for \@ar, \@ar2, \@ar3; __END__ % perl score.pl 21 16 17

    the lowliest monk

Re: Substring Distance Problem
by NateTut (Deacon) on Apr 08, 2005 at 15:53 UTC