in reply to Identifying Overlapping Area in a Set of Strings

Perhaps not the best efficient way, but it does what you ask:

use warnings; use Data::Dumper; my $fseq = 'CCCCGCGC'; my @nsub = ('CCCCG', 'CCCGC', 'CGCGC'); my @results; for (1 .. $#nsub) { my $current = $nsub[$_]; my $previous = $nsub[$_ -1]; # the "#" is used as a separator. # put a different character (or sequence of characters) # if you believe that it could also be in your strings if ( "$previous#$current" =~ /(\w+)#\1/ ) { my $found = $1; printf "%d -> %s (%s) %d -> %s \n", $_ -1, $previous, $found, $_, $current; $current =~ s/^$found/"-" x length($found)/e; $previous =~ s/$found$/"-" x length($found)/e; push @results, [ $_ -1, $previous]; push @results, [ $_, $current]; } else { printf "%d -> no overlap\n", $_ } } print Data::Dumper->Dump([ \@results], ['result']);

Result: (adjusted)

0 -> CCCCG (CCCG) 1 -> CCCGC 1 -> CCCGC (CGC) 2 -> CGCGC $result = [ [ 0, 'C----' ], [ 1, '----C' ], [ 1, 'CC---' ], [ 2, '---GC' ] ];

Replies are listed 'Best First'.
Re^2: Identifying Overlapping Area in a Set of Strings
by monkfan (Curate) on Jul 29, 2005 at 14:43 UTC
    Hi rnahi,
    Thanks so much for your answers. Your solutions provide the correct result but also more.
    my $fseq1= 'CCCCGCGC'; my @nsub1= ('CCCCG', 'CCCGC', 'CGCGC'); #produces $result1 = [ [ 0, 'C----' ], [ 1, '----C' ], [ 1, 'CC---' ], # but this is extra [ 2, '---GC' ] ];
    And this

    How can I modify your code such that it simply gives:
    Regards,
    Edward
      How can I modify your code such that it simply gives: ...

      Quite simple:

      my @results; my %seen; for (1 .. $#nsub) { my $current = $nsub[$_]; my $previous = $nsub[$_ -1]; if ( "$previous#$current" =~ /(\w+)#\1/ ) { my $found = $1; printf "%d -> %s (%s) %d -> %s \n", $_ -1, $previous, $found, $_, $current; $current =~ s/^$found/"-" x length($found)/e; $previous =~ s/$found$/"-" x length($found)/e; push @results, [ $_ -1, $previous] unless $seen{$_ -1}++; push @results, [ $_, $current] unless $seen{$_}++; } else { printf "%d -> no overlap\n", $_ } } print Data::Dumper->Dump([ \@results], ['result']);
        Thanks so much again rnahi.
        I hope you don't mind looking at my other instances. I'm really sorry, I didn't mentioned it before because I thought it may appear too complex and too discouraging to read.
        Suppose I have this:

        I would like to produce this:

        Basically 'skipping' the asterisk(*) but yet still keep its position in array in place.

        Update: I've finally succeeded in improving your code such that it can take care those situations. It is not entirely neat and 'super-naive' but it does the job. I think I can't use "grep" function in this case because I still need to keep '*' in its position.

        My sincere thanks, for providing an excellent starting point to me.
        Here is the final code:

        Please kindly advice. Really hope to hear from you again.
        Regards,
        Edward