Re: Identifying Overlapping Area in a Set of Strings

Perhaps not the best efficient way, but it does what you ask:

use warnings;
use Data::Dumper;

my $fseq =  'CCCCGCGC';
my @nsub =  ('CCCCG', 'CCCGC', 'CGCGC');

my @results;

for (1 .. $#nsub)
{
    my $current = $nsub[$_];
    my $previous = $nsub[$_ -1];
    # the "#" is used as a separator.
    # put a different character (or sequence of characters)
    # if you believe that it could also be in your strings
    if ( "$previous#$current" =~ /(\w+)#\1/ )
    {
        my $found = $1;
        printf "%d -> %s (%s) %d -> %s \n",
            $_ -1,
            $previous,
            $found,
            $_,
            $current;
        $current =~ s/^$found/"-" x length($found)/e;
        $previous =~ s/$found$/"-" x length($found)/e;
        push @results, [ $_ -1, $previous];
        push @results, [ $_, $current];

    }
    else
    {
        printf "%d -> no overlap\n", $_
    }
}

print Data::Dumper->Dump([ \@results], ['result']);
[download]

Result: (adjusted)

0 -> CCCCG (CCCG) 1 -> CCCGC
1 -> CCCGC (CGC) 2 -> CGCGC

$result = [
            [ 0, 'C----' ],
            [ 1,  '----C' ],
            [ 1, 'CC---' ],
            [   2, '---GC' ]
          ];
[download]

Comment on Re: Identifying Overlapping Area in a Set of Strings Select or Download Code

Replies are listed 'Best First'.
Re^2: Identifying Overlapping Area in a Set of Strings by monkfan (Curate) on Jul 29, 2005 at 14:43 UTC
Hi rnahi, Thanks so much for your answers. Your solutions provide the correct result but also more. `my $fseq1= 'CCCCGCGC'; my @nsub1= ('CCCCG', 'CCCGC', 'CGCGC'); #produces $result1 = [ [ 0, 'C----' ], [ 1, '----C' ], [ 1, 'CC---' ], # but this is extra [ 2, '---GC' ] ];` [download] And this Read more... (688 Bytes) How can I modify your code such that it simply gives: Read more... (751 Bytes) Regards, Edward	[reply] [d/l] [select]
Re^3: Identifying Overlapping Area in a Set of Strings by rnahi (Curate) on Jul 29, 2005 at 17:35 UTC
How can I modify your code such that it simply gives: ... Quite simple: my @results; my %seen; for (1 .. $#nsub) { my $current = $nsub[$_]; my $previous = $nsub[$_ -1]; if ( "$previous#$current" =~ /(\w+)#\1/ ) { my $found = $1; printf "%d -> %s (%s) %d -> %s \n", $_ -1, $previous, $found, $_, $current; $current =~ s/^$found/"-" x length($found)/e; $previous =~ s/$found$/"-" x length($found)/e; push @results, [ $_ -1, $previous] unless $seen{$_ -1}++; push @results, [ $_, $current] unless $seen{$_}++; } else { printf "%d -> no overlap\n", $_ } } print Data::Dumper->Dump([ \@results], ['result']); [download]	[reply] [d/l]
Re^4: Identifying Overlapping Area in a Set of Strings by monkfan (Curate) on Jul 30, 2005 at 04:01 UTC
Thanks so much again rnahi. ~~I hope you don't mind looking at my other instances. I'm really sorry, I didn't mentioned it before because I thought it may appear too complex and too discouraging to read.~~ Suppose I have this: Read more... (669 Bytes) I would like to produce this: Read more... (769 Bytes) Basically 'skipping' the asterisk() but yet still keep its position in array in place. Update:* I've finally succeeded in improving your code such that it can take care those situations. It is not entirely neat and 'super-naive' but it does the job. I think I can't use "grep" function in this case because I still need to keep '*' in its position. My sincere thanks, for providing an excellent starting point to me. Here is the final code: Read more... (1387 Bytes) ~~Please kindly advice. Really hope to hear from you again.~~ Regards, Edward	[reply] [d/l] [select]