BhariD has asked for the wisdom of the Perl Monks concerning the following question:

I am working with a string consisting of characters A-Z and four characters A,C,G,and T are of particular interest. I am trying to extract substrings from the string that qualifies the following conditions:

1] Distance between A,C,G, or T to any other A,C.G, or T should be 0 to 8 characters.

for example: string: AGRTGAXWXX

substrings: AG, AGRT, AGRTGA, GRT, GRTG, GRTGA, TG, TGA, GA

2] I want the maximum length substring possible, in above example,

string: AGRTGAXWXX

I would just want the substring: AGRTGA

as all the other substrings are part of this longest substring and this has the maximum distance between A and A within the distance allowed.

I have this so far: can anyone help please?

#!/usr/bin/perl use strict; use warnings; my %uniq=(); my $string = 'ACRMGAHKMAHGTXX'; substr($string, $_, 10 ) =~ m[([AGTC].{0,8}[AGTC])] and ++$uniq{ $1 } for 0 .. length( $string )-1; for my $key (keys %uniq){ print $key, "\n"; } #above code outputs the following: GAHKMAHG CRMGAHKMA AHGT AHKMAHGT GT GAHKMAHGT ACRMGAHKMA #and I only want the following: ACRMGAHKMA GAHKMAHGT

Anyone has any suggestions? Thanks!

  • Comment on substring selection from a string on certain qualifying conditions
  • Download Code

Replies are listed 'Best First'.
Re: substring selection from a string on certain qualifying conditions
by BrowserUk (Patriarch) on Dec 08, 2010 at 18:32 UTC

    A few more examples would be good, but this produces the desired output for the two you given:

    C:\test>876075 AGRTGAXWXX : [ AGRTGA ] ACRMGAHKMAHGTXX : [ GAHKMAHGT, ACRMGAHKMA ]
    #! perl -slw use strict; use Data::Dump qw[ pp ]; sub maxMatches { my $s = shift; my %uniq; LOOP: for my $o ( 0 .. length( $s ) - 10 ) { my( $match ) = $s =~ m[.{$o}([ACGT].{0,8}[ACGT])]; m[$match] and next LOOP for keys %uniq; $uniq{ $match }++; } return keys %uniq; } while( <DATA> ) { chomp; printf "$_ : [ %s ]\n", join ', ', maxMatches( $_ ); } __DATA__ AGRTGAXWXX ACRMGAHKMAHGTXX

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: substring selection from a string on certain qualifying conditions
by ikegami (Patriarch) on Dec 08, 2010 at 19:29 UTC
    What should the following return?
    AXXAXXAXXXXXXXXXXXXXXAXXA

    "AXXAXXA" and "AXXA", or just "AXXAXXA"?

      If this is the input string: AXXAXXAXXXXXXXXXXXXXXAXXA

      Output should be:

      AXXAXXA and

      AXXA (AXXA from the end of the string)

        Any more unstated rules? :)

        C:\test>876075 AGRTGAXWXX : [ AGRTGA ] ACRMGAHKMAHGTXX : [ ACRMGAHKMA, GAHKMAHGT ] AXXAXXAXXXXXXXXXXXXXXAXXA : [ AXXAXXA, AXXA ]
        #! perl -slw use strict; use Data::Dump qw[ pp ]; sub maxMatches { my $s = shift; my @matches; my $vec = ''; for my $o ( 0 .. length( $s ) - 10 ) { my( $match ) = $s =~ m[.{$o}([ACGT].{0,8}[ACGT])] or next; my $mask = ''; vec( $mask , $_, 1 ) = 1 for $-[1] .. $+[1]-1; next if ( $vec | $mask ) eq $vec; $vec |= $mask; push @matches, $match; } return @matches; } while( <DATA> ) { chomp; printf "$_ : [ %s ]\n", join ', ', maxMatches( $_ ); } __DATA__ AGRTGAXWXX ACRMGAHKMAHGTXX AXXAXXAXXXXXXXXXXXXXXAXXA

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: substring selection from a string on certain qualifying conditions
by anonymized user 468275 (Curate) on Dec 08, 2010 at 18:26 UTC
    Somebody already mentioned using chop for high performance traversal of such strings. In this case, the requirement can just as well traverse backwards through the big string. I see no need for pattern-matching - just store the last positions and strings accumulated for each letter in a hash of hash and the current winning string in a scalar and just chop back through, updating the positions, trial strings and current winning string as appropriate.

    One world, one people

Re: substring selection from a string on certain qualifying conditions
by suhailck (Friar) on Dec 09, 2010 at 02:21 UTC
    For the first one,

    perl -MData::Dumper -le '$_=q[AGRTGAXWXX]; my %match; 1 while m/([ACGT][A-Z]{0,8}[ACGT])(??{$match{$1}++})(*FAIL)/; print Dumper(\%match);' $VAR1 = { 'AGRTG' => 1, 'AG' => 1, 'TG' => 1, 'TGA' => 1, 'AGRT' => 1, 'GA' => 1, 'GRTG' => 1, 'GRT' => 1, 'AGRTGA' => 1, 'GRTGA' => 1 };
Re: substring selection from a string on certain qualifying conditions
by Khen1950fx (Canon) on Dec 09, 2010 at 08:13 UTC
    For AGRTGA, I used String::Substrings.
    #!/usr/bin/perl use strict; use warnings; use String::Substrings; my @trip = substrings 'AGRTGAXWXX', 6; print "$trip[0]\n";