dr_jgbn has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perlmonks,

I have somewhat of a complicated (as far as I am concerned) pattern match problem that has me stumped.

For example, take two DNA strings, A and B. The letters w,x,y, and z can denote either A,T,G, or C
$A = ATGGAGTCGACGAATTTGAAGAAT
$B = xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT
As you will notice, $A is contained within $B but is discontinous, separated by x's.
The basic premise is to use the sequence in $A and match this against the sequence of $B. In the process of doing this, each of the 'x' strings (4 in total) need to be extracted and categorized separately i.e. each on its own line (with an identifier # ideally).
The output from such a comparison would be;
xxxxxx
yxxx
zxxxx
xxw

I was trying some split stuff but got nowhere! Any help would be greatly! appreciated. I hope this makes sense.
Thanks,

Dr.J

Replies are listed 'Best First'.
Re: Complicated pattern match
by BrowserUk (Patriarch) on Jan 19, 2003 at 07:40 UTC

    This seems to do what you requested--if I understood you correctly:^).

    #! perl -slw use strict; my $needle = 'ATGGAGTCGACGAATTTGAAGAAT'; my $haystack = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my @needles; #! While we've still needle to process while ($needle) { #! Try shorter and shorter substring of needle for my $start (0 .. length($needle) - 1) { my $bit = substr $needle, $start; #! move of if no match next unless $haystack =~ m[$bit]; #! Got a match, save it $haystack =~ s[$bit][ push @needles, $bit; $bit ]ge; #! and remove it $needle = substr $needle, 0, $start; #! repeat last; } } #! Sort the needles longest first @needles = sort{ length $b <=> length $a } @needles; #! mark their places in the haystack $haystack =~ s[($_)(?!\})][{$1}]g for @needles; #! remove nested marks #! (where a shorter needle was found inside a longer one) $haystack =~ s[ ({[^{}]*?) #! capture everything after the first { ({) #! until we find a second (also captured) ([^{}]*?) #! Then capture everything before the close } (}) #! and the close (nested) } ([^}]*?}) #! and everything from there, to and including th +e final } ] [$1$3$5]gx; #! Throw away the inner {}. #! Finally print out the none needle parts, and the needle that follow +ed them. print "$1 preceeded $2" while $haystack =~ m[(\G[^{]+){([^}]+)}]g; __END__ C:\test>228122 xxxxxx preceeded ATGGAG yxxx preceeded TCGA zxxxx preceeded CGAATTTGAA xxw preceeded GAAT C:\test>

    Update:Subsequent to posting, I noticed several gross inefficiencies (leftovers from earlier attempts) in the above code.

    This seems to be a rather more efficient implementation. It is also a better testbed for further tuning should anyone care to take that on.


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Flawless!!! ? !!! Remarkable!!
Re: Complicated pattern match
by theorbtwo (Prior) on Jan 19, 2003 at 05:53 UTC

    In general, it seems like you want to take the sequence of charaters in $A, make sure $B contains those characters, in that sequence, but with other stuff inbetween them. Then you want a list of the other stuff.

    This should do it:

    #!/usr/bin/perl use warnings; use strict; my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my @A = split //, $A; my $Are = '^(.*)' . join('(.*)', @A) . '(.*)$'; my @introns = ($B =~ m/$Are/); if (!@introns) { die "There was no possible way to match the input."; } foreach (0..$#introns) { print "$_: $introns[$_] $A[$_]\n"; }
    That should give you @introns as a list of everything that /didn't/ match. There will be zero-length strings for each position in which there was no junk, and otherwise the junk. Look at the output, and you'll see what I mean. Also, if there is more then one way of matching $B against $A, this will take each section of @introns to be as long as possible. If you'd prefer it as short as possible, that's easily possible. In either case, it will try it's hardest to get them to match. It will allow both leading and trailing junk, though your example only has trailing junk. These are left as exercises for the reader.

    Update: Major revisions to code, changed caveats, tested.


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      You can simplify this:
      my $Are = '^(.*)' . join('(.*)', @A) . '(.*)$';
      Like so:
      my $Are = join '(.*)', '^', @A, '$';

      I'd also use a nongreedy .*? there instead.

      You provided a code base I liked quite a bit and couldn't stop tinkering with though - that temporary @A was annoying. :) Eventually it occured to me I could just capture the matches from $A as well:

      my $Are = join '(.*?)', '^', (map "($_)", $A =~ /(.)/sg), '$';
      Now it all goes in a single array where every odd element is guaranteed to be one character long, while every even element may have any length incl zero. With that in mind, I wrote a cleanup loop and eventually arrived at this:
      #!/usr/bin/perl -w use strict; my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my $Are = join '(.*?)', '^', (map "($_)", $A =~ /(.)/sg), '$'; (my @introns = $B =~ m/$Are/) or die "There was no possible way to match the input."; my @seq; while(@introns) { my $match = shift @introns; if(@seq and not $match) { $seq[-1] .= shift @introns || ''; next; } push @seq, $match; } print "$A\n@seq\n"; __END__ ATGGAGTCGACGAATTTGAAGAAT xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT

      Update: hrm.. it doesn't work so well anymore once you substitute "real" sequences for the dummy characters in $B..

      ATGGAGTCGACGAATTTGAAGAAT
      ATGGAG A T GGAGT CGA T CGAA G T CACCGAA TT T GAA TTT GAAT

      I suspect nearly the same is true for theorbtwo's code due to the nature of pattern matching.

      Makeshifts last the longest.

Re: Complicated pattern match
by runrig (Abbot) on Jan 19, 2003 at 05:14 UTC
    Surround every character in string A with capturing parens (warning: untested code ahead):
    my $re = join("([wxyz]*)", "", split("", $A), ""); if (my @array = $B =~ /^$re$/) { print "$_\n" for grep length, @array; }
    On a related note, I wrote Amino acid sequence builder awhile ago, which doesn't really fit this problem, but you might find it interesting anyway.

    Updated. And the code is tested now. Though due to misunderstanding the problem (clarified in the CB), the "[wxyz]*" should be "[ACTG]*?"

Re: Complicated pattern match
by Aristotle (Chancellor) on Jan 19, 2003 at 13:11 UTC
    After tinkering about for a while trying to make a pattern match produce sensible results, I stepped back and tried to think of the problem at core. What you really want to find is a "longest common subsequence" between the strings. There's a very good module on CPAN which does just this - Algorith::Diff.
    #!/usr/bin/perl -w use strict; use Algorithm::Diff qw(traverse_sequences); # construct arrayrefs containing an array of single chars my ($A, $B) = map [/(.)/sg], qw( ATGGAGTCGACGAATTTGAAGAAT xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT ); my $prev = ''; my @seq; traverse_sequences( $A, $B, { MATCH => sub { my ($aidx, $bidx) = @_; if('=' ne $prev) { push @seq, ''; $prev = '='; } $seq[-1] .= $A->[$aidx]; }, DISCARD_A => sub { die "Sequence in A is not fully contained i +n B" }, DISCARD_B => sub { my ($aidx, $bidx) = @_; if('!' ne $prev) { push @seq, ''; $prev = '!'; } $seq[-1] .= $B->[$bidx]; }, }, ); print "@seq\n"; __END__ xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT
    Even this falls short on "actual" data though: working against GATAGCATGGAGGCCATCGATAACGCGAATTTGAATTTGAAT it produces G AT A G CAT G G AG GCCA TCGA TAA CG CG AATTTGAA TTT GAAT..

    Makeshifts last the longest.

Re: Complicated pattern match
by Abigail-II (Bishop) on Jan 19, 2003 at 16:28 UTC
    Considering that you only have 'A', 'T', 'G', and 'C', it's certainly not impossible that there are various ways how $A can be contained in $B. Is it a given that $A has to be split into four pieces? Can there be more ways? If there are multiple ways of splitting $A which way should be choosen? There are two contraints I would find logical: the one that minizes the number of parts $A splits in, and the one that maximizes the length of the smallest part.

    Abigail

      Abigail-II, From looking at some of the results of others ( and in genereal), you are correct on your two constraints. How to impose that I am not sure, however, you are right.
      Cheers,
      Dr.J
        There's no garantee that both constraints can be satisfied both simultaniously. What's it going to be? ;-)

        Abigail

Re: Complicated pattern match (book recommendations)
by gjb (Vicar) on Jan 19, 2003 at 20:26 UTC
Re: Complicated pattern match (from CB)
by tye (Sage) on Jan 20, 2003 at 00:44 UTC

    Here is what I came up with yesterday in the chatterbox while discussing theorbtwo's solution.

    dr_jgbn confirmed that the shorter string is composed only of valid codons (sequences of three characters each representing a nucleatide base) and that introns will not be inserted into the middle of a codon (despite the example not meeting these criteria).

    theorbtwo's solution would put the introns as soon as possible (which would not make the introns as long as possible) and this would lead to lots of backtracking before the solution was found. Consider 123456 vs. 1235432456 which theorbtwo's solution splits like 12(354)3(2)456 while changing (.*) to (.*?) would split like 123(5)4(324)56. But the longest possible intron is found like this 123(5432)456.

    So I went with non-greedy (.*?) as it doesn't have to backtrack (except trivially) unless the strings don't actually match (I got the impression that the strings were already known to match, but I'll cover that later). And by keeping codons together, the splitting is sensible (it should match how a cell would interpret the DNA/RNA, I'd think).

    #!/usr/bin/perl use strict; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); $valid =~ s#(...)#(.*?)\Q$1\E#g; my @intron= $dirty =~ /^$valid(.*?)/; for my $intron ( @intron ) { next if ! length($intron); print "$intron\n"; } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
    which produces:
    GCACCG
    TAGG
    ATCTC
    TCG
    
Re: Complicated pattern match
by JamesNC (Chaplain) on Jan 19, 2003 at 23:33 UTC
    I liked your problem: If I understand your problem, the sequences in A must all be matched in order of appearance in B without repeating the sequence in A or in B. You may need to sequence through a possible 24 times, but I leave that up to you as you will see how to accomplish this. Try uncommenting the second data and you will see what I mean: My solution=)
    #!/perl/bin/perl use strict; my @chem; my %seqB; my ($key, $value,$data, $pos, $x, $num, $seq); my $DNA = "xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT"; #my $DNA = "CGATAAATGGAGTGACTCGAGATCGCGAATTTGCCAAACACGAAT"; while(<DATA>){ chomp; @chem = split //; } my $seq_prev =''; for(1..10){ $num = @chem; $num--; for (0..$num){ $key = join '', @chem[0..$_]; if($DNA =~ /($key)/i){ $seq = $`; $data = $key; $pos=$_+1; } #first sequence in A found } #remove the matched squence from future possible matches splice @chem, 0, $pos; #show the results my $chem= join '', @chem; print "FIRST SEQ MATCH FOUND: $seq_prev|< $data >|$chem\n"; print "\nDNA(B)- BEFORE < $data >: $seq"; $seq_prev = $data.$seq_prev; $seqB{$data} = $seq; #remove the matched DNA B pattern matched data $DNA =~ s/$seq$data//; # exit(0) if $DNA eq ''; #if you wanted to match more than 4 } print "AS A HASH TABLE KEY:VALUE PAIRS\n"; foreach (keys %seqB){ print "$_ = $seqB{$_}\n"; } __DATA__ ATGGAGTCGACGAATTTGAAGAAT
    It gives you what you want for both the data you provided and for any generic data sets. JamesNC :0)
Re: Complicated pattern match
by OM_Zen (Scribe) on Jan 19, 2003 at 05:24 UTC
    Hi ,

    my $A =~ s/[A-Z]/\|/g; my @values = split('\|',$A); map { if ($_ ne "" ) {print "[$_]\n";} } @values;

      No, that don't work. Your first line will substitue any A-Z character in $A with a pipe, which means all characters in this case since $A doesn't contain anything else. Then you split on pipes, which gets you a @values full of empty strings. Then you map in void context..

      If you leave out the substitution and split on nothing, it “works”:

      my @values = split //, $A; map { if ($_ ne "" ) {print "[$_]\n";} } @values;
      Except all it does is split $A into characters and print them on a line each. I have no idea how that is supposed to solve the poster's problem though, since it doesn't even look at $B. The if is superfluous as well, since every element of @values is guaranteed to contain one character from $A.

      Makeshifts last the longest.

        Hi,

        The code returns
        [xxxxxx] [yxxx] [zxxxx] [xxw]


        I guess the split converts only the caps in the $B and then I print the non_exmpty strings. Please let me know where I am totally wrong. I thought the $A was discontinuous, but the order of the characters remained the same and hence thought I could do it this way.
Re: Complicated pattern match
by pizza_milkshake (Monk) on Jan 20, 2003 at 01:11 UTC
    ok, it's more C-ish than perl-ish, but i betcha it's faster:
    #!/usr/bin/perl -l my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my ($An, $Bn) = (0, 0); # relative indices my ($Ac, $Bc); # characters my ($difpos, $difstr) = (0, ''); # keep track of differences for $Bn (0 .. length($B)-1){ $Ac = substr($A, $An, 1); $Bc = substr($B, $Bn, 1); if ($Ac eq $Bc){ print "$difpos: $difstr" if $difstr; $difstr = ''; $An++; $difpos++; } else { $difstr .= $Bc; } }
    perl -e'$_=q#: 13_2: 12/"{>: 8_4) (_4: 6/2"-2; 3;-2"\2: 5/7\_/\7: 12m m::#;s#:#\n#g;s#(\D)(\d+)#$1x$2#ge;print'
Re: Complicated pattern match
by scain (Curate) on Jan 21, 2003 at 04:03 UTC
    OK, so there have been several responses so far, and it appears that at least some of them work :-)

    In the spirit of TMTOWTDI, I thought I would take a different approach. The problems dr_jgbn presents could be used to do several things, though the most obvious is to look for exons that are included in one transcript and not another. Also, the concern that Abagail-II raises is valid, therefore, I think it is reasonable and desireable to use industry standard analysis tools to arrive at a solution. What follows is a fairly long description of how I solved that problem.

    Scott
    Project coordinator of the Generic Model Organism Database Project