Complicated pattern match

dr_jgbn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Complicated pattern match by BrowserUk (Patriarch) on Jan 19, 2003 at 07:40 UTC
This seems to do what you requested--if I understood you correctly:^). #! perl -slw use strict; my $needle = 'ATGGAGTCGACGAATTTGAAGAAT'; my $haystack = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my @needles; #! While we've still needle to process while ($needle) { #! Try shorter and shorter substring of needle for my $start (0 .. length($needle) - 1) { my $bit = substr $needle, $start; #! move of if no match next unless $haystack =~ m[$bit]; #! Got a match, save it $haystack =~ s[$bit][ push @needles, $bit; $bit ]ge; #! and remove it $needle = substr $needle, 0, $start; #! repeat last; } } #! Sort the needles longest first @needles = sort{ length $b <=> length $a } @needles; #! mark their places in the haystack $haystack =~ s[($_)(?!\})][{$1}]g for @needles; #! remove nested marks #! (where a shorter needle was found inside a longer one) $haystack =~ s[ ({[^{}]?) #! capture everything after the first { ({) #! until we find a second (also captured) ([^{}]?) #! Then capture everything before the close } (}) #! and the close (nested) } ([^}]?}) #! and everything from there, to and including th +e final } ] [$1$3$5]gx; #! Throw away the inner {}. #! Finally print out the none needle parts, and the needle that follow +ed them. print "$1 preceeded $2" while $haystack =~ m[(\G[^{]+){([^}]+)}]g; __END__ C:\test>228122 xxxxxx preceeded ATGGAG yxxx preceeded TCGA zxxxx preceeded CGAATTTGAA xxw preceeded GAAT C:\test> [download] Update:Subsequent to posting, I noticed several gross inefficiencies (leftovers from earlier attempts) in the above code. This seems to be a rather more efficient implementation. It is also a better testbed for further tuning should anyone care to take that on. Read more... (2 kB) Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.*	[reply] [d/l] [select]
Re: Re: Complicated pattern match by dr_jgbn (Beadle) on Jan 19, 2003 at 17:33 UTC
Flawless!!! ? !!! Remarkable!!	[reply]
Re: Complicated pattern match by theorbtwo (Prior) on Jan 19, 2003 at 05:53 UTC
In general, it seems like you want to take the sequence of charaters in $A, make sure $B contains those characters, in that sequence, but with other stuff inbetween them. Then you want a list of the other stuff. This should do it: `#!/usr/bin/perl use warnings; use strict; my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my @A = split //, $A; my $Are = '^(.)' . join('(.)', @A) . '(.)$'; my @introns = ($B =~ m/$Are/); if (!@introns) { die "There was no possible way to match the input."; } foreach (0..$#introns) { print "$_: $introns[$_] $A[$_]\n"; }` [download] That should give you @introns as a list of everything that /didn't/ match. There will be zero-length strings for each position in which there was no junk, and otherwise the junk. Look at the output, and you'll see what I mean. Also, if there is more then one way of matching $B against $A, this will take each section of @introns to be as long as possible. If you'd prefer it as short as possible, that's easily possible. In either case, it will try it's hardest to get them to match. It will allow both leading and trailing junk, though your example only has trailing junk. These are left as exercises for the reader. Update:* Major revisions to code, changed caveats, tested. Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply] [d/l]
Re^2: Complicated pattern match by Aristotle (Chancellor) on Jan 19, 2003 at 12:21 UTC
You can simplify this: `my $Are = '^(.)' . join('(.)', @A) . '(.)$';` [download] Like so: `my $Are = join '(.)', '^', @A, '$';` [download] I'd also use a nongreedy `.?` there instead. You provided a code base I liked quite a bit and couldn't stop tinkering with though - that temporary `@A` was annoying. `:)` Eventually it occured to me I could just capture the matches from `$A` as well: `my $Are = join '(.?)', '^', (map "($_)", $A =~ /(.)/sg), '$';` [download] Now it all goes in a single array where every odd element is guaranteed to be one character long, while every even element may have any length incl zero. With that in mind, I wrote a cleanup loop and eventually arrived at this: #!/usr/bin/perl -w use strict; my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my $Are = join '(.?)', '^', (map "($_)", $A =~ /(.)/sg), '$'; (my @introns = $B =~ m/$Are/) or die "There was no possible way to match the input."; my @seq; while(@introns) { my $match = shift @introns; if(@seq and not $match) { $seq[-1] .= shift @introns \|\| ''; next; } push @seq, $match; } print "$A\n@seq\n"; __END__ ATGGAGTCGACGAATTTGAAGAAT xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT [download] Update:* hrm.. it doesn't work so well anymore once you substitute "real" sequences for the dummy characters in `$B`.. `ATGGAGTCGACGAATTTGAAGAAT ATGGAG A T GGAGT CGA T CGAA G T CACCGAA TT T GAA TTT GAAT` I suspect nearly the same is true for theorbtwo's code due to the nature of pattern matching. Makeshifts last the longest.	[reply] [d/l] [select]
Re: Complicated pattern match by runrig (Abbot) on Jan 19, 2003 at 05:14 UTC
Surround every character in string A with capturing parens (warning: untested code ahead): `my $re = join("([wxyz])", "", split("", $A), ""); if (my @array = $B =~ /^$re$/) { print "$_\n" for grep length, @array; }` [download] On a related note, I wrote Amino acid sequence builder awhile ago, which doesn't really fit this problem, but you might find it interesting anyway. Updated. And the code is tested now. Though due to misunderstanding the problem (clarified in the CB), the "`[wxyz]`" should be "`[ACTG]*?`"	[reply] [d/l] [select]
Re: Complicated pattern match by Aristotle (Chancellor) on Jan 19, 2003 at 13:11 UTC
After tinkering about for a while trying to make a pattern match produce sensible results, I stepped back and tried to think of the problem at core. What you really want to find is a "longest common subsequence" between the strings. There's a very good module on CPAN which does just this - Algorith::Diff. #!/usr/bin/perl -w use strict; use Algorithm::Diff qw(traverse_sequences); # construct arrayrefs containing an array of single chars my ($A, $B) = map [/(.)/sg], qw( ATGGAGTCGACGAATTTGAAGAAT xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT ); my $prev = ''; my @seq; traverse_sequences( $A, $B, { MATCH => sub { my ($aidx, $bidx) = @_; if('=' ne $prev) { push @seq, ''; $prev = '='; } $seq[-1] .= $A->[$aidx]; }, DISCARD_A => sub { die "Sequence in A is not fully contained i +n B" }, DISCARD_B => sub { my ($aidx, $bidx) = @_; if('!' ne $prev) { push @seq, ''; $prev = '!'; } $seq[-1] .= $B->[$bidx]; }, }, ); print "@seq\n"; __END__ xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT [download] Even this falls short on "actual" data though: working against GATAGCATGGAGGCCATCGATAACGCGAATTTGAATTTGAAT it produces G AT A G CAT G G AG GCCA TCGA TAA CG CG AATTTGAA TTT GAAT.. Makeshifts last the longest.	[reply] [d/l]
Re: Complicated pattern match by Abigail-II (Bishop) on Jan 19, 2003 at 16:28 UTC
Considering that you only have 'A', 'T', 'G', and 'C', it's certainly not impossible that there are various ways how `$A` can be contained in `$B`. Is it a given that `$A` has to be split into four pieces? Can there be more ways? If there are multiple ways of splitting `$A` which way should be choosen? There are two contraints I would find logical: the one that minizes the number of parts `$A` splits in, and the one that maximizes the length of the smallest part. Abigail	[reply] [d/l] [select]
Re: Re: Complicated pattern match by dr_jgbn (Beadle) on Jan 19, 2003 at 17:02 UTC
Abigail-II, From looking at some of the results of others ( and in genereal), you are correct on your two constraints. How to impose that I am not sure, however, you are right. Cheers, Dr.J	[reply]
Re: Complicated pattern match by Abigail-II (Bishop) on Jan 19, 2003 at 17:27 UTC
There's no garantee that both constraints can be satisfied both simultaniously. What's it going to be? ;-) Abigail	[reply]
Re: Re: Complicated pattern match by dr_jgbn (Beadle) on Jan 19, 2003 at 17:38 UTC
Re: Complicated pattern match (book recommendations) by gjb (Vicar) on Jan 19, 2003 at 20:26 UTC
You may want to have a look at Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, although I haven't got a copy on my desk right now, I'm pretty sure some nice algorithm for this problem is given. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison might also be worth a try. Nevertheless I had a go at it, the code is shown below. Just my 2 cents, -gjb- Read more... (896 Bytes)	[reply] [d/l]
Re: Complicated pattern match (from CB) by tye (Sage) on Jan 20, 2003 at 00:44 UTC
Here is what I came up with yesterday in the chatterbox while discussing theorbtwo's solution. dr_jgbn confirmed that the shorter string is composed only of valid codons (sequences of three characters each representing a nucleatide base) and that introns will not be inserted into the middle of a codon (despite the example not meeting these criteria). theorbtwo's solution would put the introns as soon as possible (which would not make the introns as long as possible) and this would lead to lots of backtracking before the solution was found. Consider 123456 vs. 1235432456 which theorbtwo's solution splits like 12(354)3(2)456 while changing (.) to (.?) would split like 123(5)4(324)56. But the longest possible intron is found like this 123(5432)456. So I went with non-greedy (.?) as it doesn't have to backtrack (except trivially) unless the strings don't actually match (I got the impression that the strings were already known to match, but I'll cover that later). And by keeping codons together, the splitting is sensible (it should match how a cell would interpret the DNA/RNA, I'd think). `#!/usr/bin/perl use strict; while( 1 ) { my $valid= <DATA> or exit( 0 ); chomp( $valid ); my $dirty= <DATA> or die "Missing second sequence"; chomp( $dirty ); $valid =~ s#(...)#(.?)\Q$1\E#g; my @intron= $dirty =~ /^$valid(.*?)/; for my $intron ( @intron ) { next if ! length($intron); print "$intron\n"; } } __END__ ATGGAGTCGACGAATTTGAAGAAT GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT` [download] which produces: GCACCG TAGG ATCTC TCG Read more... (5 kB)	[reply] [d/l] [select]
Re: Complicated pattern match by JamesNC (Chaplain) on Jan 19, 2003 at 23:33 UTC
I liked your problem: If I understand your problem, the sequences in A must all be matched in order of appearance in B without repeating the sequence in A or in B. You may need to sequence through a possible 24 times, but I leave that up to you as you will see how to accomplish this. Try uncommenting the second data and you will see what I mean: My solution=) #!/perl/bin/perl use strict; my @chem; my %seqB; my ($key, $value,$data, $pos, $x, $num, $seq); my $DNA = "xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT"; #my $DNA = "CGATAAATGGAGTGACTCGAGATCGCGAATTTGCCAAACACGAAT"; while(<DATA>){ chomp; @chem = split //; } my $seq_prev =''; for(1..10){ $num = @chem; $num--; for (0..$num){ $key = join '', @chem[0..$_]; if($DNA =~ /($key)/i){ $seq = $`; $data = $key; $pos=$_+1; } #first sequence in A found } #remove the matched squence from future possible matches splice @chem, 0, $pos; #show the results my $chem= join '', @chem; print "FIRST SEQ MATCH FOUND: $seq_prev\|< $data >\|$chem\n"; print "\nDNA(B)- BEFORE < $data >: $seq"; $seq_prev = $data.$seq_prev; $seqB{$data} = $seq; #remove the matched DNA B pattern matched data $DNA =~ s/$seq$data//; # exit(0) if $DNA eq ''; #if you wanted to match more than 4 } print "AS A HASH TABLE KEY:VALUE PAIRS\n"; foreach (keys %seqB){ print "$_ = $seqB{$_}\n"; } __DATA__ ATGGAGTCGACGAATTTGAAGAAT [download] It gives you what you want for both the data you provided and for any generic data sets. JamesNC :0)	[reply] [d/l]
Re: Complicated pattern match by OM_Zen (Scribe) on Jan 19, 2003 at 05:24 UTC
Hi , `my $A =~ s/[A-Z]/\\|/g; my @values = split('\\|',$A); map { if ($_ ne "" ) {print "[$_]\n";} } @values;` [download]	[reply] [d/l]
Re^2: Complicated pattern match by Aristotle (Chancellor) on Jan 19, 2003 at 12:33 UTC
No, that don't work. Your first line will substitue any A-Z character in `$A` with a pipe, which means all characters in this case since `$A` doesn't contain anything else. Then you split on pipes, which gets you a `@values` full of empty strings. Then you map in void context.. If you leave out the substitution and split on nothing, it �works�: `my @values = split //, $A; map { if ($_ ne "" ) {print "[$_]\n";} } @values;` [download] Except all it does is split `$A` into characters and print them on a line each. I have no idea how that is supposed to solve the poster's problem though, since it doesn't even look at `$B`. The `if` is superfluous as well, since every element of `@values` is guaranteed to contain one character from `$A`. Makeshifts last the longest.	[reply] [d/l]
Re: Re^2: Complicated pattern match by OM_Zen (Scribe) on Jan 19, 2003 at 18:10 UTC
Hi, The code returns `[xxxxxx] [yxxx] [zxxxx] [xxw]` [download] I guess the split converts only the caps in the $B and then I print the non_exmpty strings. Please let me know where I am totally wrong. I thought the $A was discontinuous, but the order of the characters remained the same and hence thought I could do it this way.	[reply] [d/l]
Re^4: Complicated pattern match by Aristotle (Chancellor) on Jan 19, 2003 at 19:01 UTC
Re: Complicated pattern match by pizza_milkshake (Monk) on Jan 20, 2003 at 01:11 UTC
ok, it's more C-ish than perl-ish, but i betcha it's faster: `#!/usr/bin/perl -l my $A = 'ATGGAGTCGACGAATTTGAAGAAT'; my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT'; my ($An, $Bn) = (0, 0); # relative indices my ($Ac, $Bc); # characters my ($difpos, $difstr) = (0, ''); # keep track of differences for $Bn (0 .. length($B)-1){ $Ac = substr($A, $An, 1); $Bc = substr($B, $Bn, 1); if ($Ac eq $Bc){ print "$difpos: $difstr" if $difstr; $difstr = ''; $An++; $difpos++; } else { $difstr .= $Bc; } }` [download] perl -e'$_=q#: 13_2: 12/"{>: 8_4) (_4: 6/2"-2; 3;-2"\2: 5/7\_/\7: 12m m::#;s#:#\n#g;s#(\D)(\d+)#$1x$2#ge;print'	[reply] [d/l]
Re: Complicated pattern match by scain (Curate) on Jan 21, 2003 at 04:03 UTC
OK, so there have been several responses so far, and it appears that at least some of them work :-) In the spirit of TMTOWTDI, I thought I would take a different approach. The problems dr_jgbn presents could be used to do several things, though the most obvious is to look for exons that are included in one transcript and not another. Also, the concern that Abagail-II raises is valid, therefore, I think it is reasonable and desireable to use industry standard analysis tools to arrive at a solution. What follows is a fairly long description of how I solved that problem. Read more... (27 kB) Scott Project coordinator of the Generic Model Organism Database Project	[reply] [d/l]