Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have two following strings:
my $s1 = "1,0,CTGCCACCGCTGT"; my $s2 = "1,5,ACCGCTGTGTTTCGGCCGGCGA"; #The format show this: sequence_index,string_pos,$string #So the two strings above are in the same sequence 1.
How can I intersect the above two strings so that we obtain:
CTGCCACCGCTGT (s1) ACCGCTGTGTTTCGGCCGGCGA (s2) ******** ACCGCTGT (intersected string) and also to identify the position of the interesected string, in this case 6

Replies are listed 'Best First'.
Re: Intersecting Two Strings
by tinita (Parson) on Aug 22, 2007 at 11:53 UTC

      A quick search of CPAN turns up: String::LCSS, though I've never used it. There were several other hits as well.


      I humbly seek wisdom.
      It seems that OP's problem is an easy special case of the LCS problem. We have one big string S and 2 substrings (seq1, seq2) with known coordinates within S:
      S: CTGCCACCGCTGTGTTTCGGCCGGCGAXXXXXXXXXXXXXXXXXXXXX seq1: CTGCCACCGCTGT (Pos: 0) seq2: ACCGCTGTGTTTCGGCCGGCGA (Pos: 5)
      So it's a trivial O(1) algorithm to calculate the overlap with the substring positions and lengths.
Re: Intersecting Two Strings
by akho (Hermit) on Aug 22, 2007 at 08:44 UTC
    use strict; use warnings; my $start = 'CTGCCACCGCTGT'; my $end = 'ACCGCTGTGTTTCGGCCGGCGA'; my $pos = 0; $pos++ while (!(substr($start, $pos, (length $start) - $pos) eq substr($end, 0, (length $start) - $pos))); print substr($start, $pos, (length $start) - $pos), ' ', $pos+1, "\n";
Re: Intersecting Two Strings
by lima1 (Curate) on Aug 22, 2007 at 11:15 UTC
    You have the coordinates of the subsequences? Then this should work:
    use strict; use warnings; use List::Util qw(min); use Data::Dumper; my $seq1 = { id => 1, start => 0, seq => 'CTGCCACCGCTGT' }; my $seq2 = { id => 1, start => 5, seq => 'ACCGCTGTGTTTCGGCCGGCGA'}; print Dumper [ intersection($seq1, $seq2) ]; sub intersection { # make sure that seq1 is always the first # subsequence my ($seq1, $seq2) = sort {$a->{start} <=> $b->{start} }@_; # determine the start of intersected string ... my $i_start = $seq2->{start} - $seq1->{start}; # ... and the length my $i_length = min(length(substr($seq1->{seq},$i_start)), length($seq2->{seq})); return ($i_start+1, substr($seq2->{seq}, 0, $i_length)); }
Re: Intersecting Two Strings
by BrowserUk (Patriarch) on Aug 23, 2007 at 03:01 UTC

    It seems to me that no searching is required, as all the relevant information already present in the input data.

    A grossly over elaborate coding is:

    #! perl -slw use strict; my $s1 = "1,0,CTGCCACCGCTGT"; my $s2 = "1,5,ACCGCTGTGTTTCGGCCGGCGA"; my @s1 = split ',', $s1; my @s2 = split ',', $s2; if( $s1[0] == $s2[0] ) { if( $s1[1] > $s2[1] ) { my @temp = @s1; @s1 = @s2; @s2 = @temp; } if( $s2[1] < length( $s1 ) ) { ## Updated to correct the error pointed out by Lima1++ below my $l1 = length( $s1[2] ) - $s1[1] - $s2[1]; my $l2 = length( $s2[2] ); $l1 = $l2 if $l2 < $l1; print "The intersection between the strings:\n", $s1[2], "\n", ' ' x ( $s2[1] - $s1[1] ), $s2[2], "\nis\n", ' ' x $s2[1], substr $s1[2], $s2[1] - $s1[1], $l1; ; print "\nand it occurs at a 1-based offset into the sequence o +f ", $s2[1] + 1; } else { print "The strings do not intersect"; } } else { print 'Strings are not from the same sequence'; } __END__ c:\test>junk The intersection between the strings: CTGCCACCGCTGT ACCGCTGTGTTTCGGCCGGCGA is ACCGCTGT and it occurs at a 1-based offset into the sequence of 6

    The bits you are probably interested in are

    1. The bit to extract the overlap:
      substr $s1[2], $s2[ 1 ] - $s1[ 1 ], length( $s1[2] ) - $s1[1] - $s2[1] +;
    2. And the bit to calculate the 1-based offset:
      $s2[1] + 1;

    but they won't work correctly unless you split the input strings into their constituant parts and ensure that the first is the 'lesser' of the two.

    It could undoubtedly be golfed some.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You forgot one corner case:
      my $s1 = "1,0,CTGCCACCGCTGT"; my $s2 = "1,5,ACCG";
      Apart from that, our solutions are identical-but your code is nicer I must admit ;)
        You forgot one corner case:

        I did, didn't I :(

        I apologise for missing that you'd seen the same anomaly as I, amongst all the LCS posts.

        but your code is nicer I must admit ;)

        Actually, I think yours perfectly fine. Missing a couple of steps and tests, but fine. More importantly it works.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Intersecting Two Strings
by GrandFather (Saint) on Aug 22, 2007 at 08:25 UTC

    Will the intersection always be an overlap as shown, or could one string occur completely inside the other?

    What is the minimum intersection length you are interested in?

    If intersections happen only at the ends (overlaps), can they happen at either end?

    How long are the strings in practice and how long is the longest expected intersection?

    How time efficient does the process need to be?


    DWIM is Perl's answer to Gödel
Re: Intersecting Two Strings
by Anonymous Monk on Aug 23, 2007 at 01:56 UTC
    if you're strings are large, you may want to look at Tree::Suffix, which is a binding to a c library, which will probably be faster.