Re: Complicated pattern match
by BrowserUk (Patriarch) on Jan 19, 2003 at 07:40 UTC
|
#! perl -slw
use strict;
my $needle = 'ATGGAGTCGACGAATTTGAAGAAT';
my $haystack = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT';
my @needles;
#! While we've still needle to process
while ($needle) {
#! Try shorter and shorter substring of needle
for my $start (0 .. length($needle) - 1) {
my $bit = substr $needle, $start;
#! move of if no match
next unless $haystack =~ m[$bit];
#! Got a match, save it
$haystack =~ s[$bit][ push @needles, $bit; $bit ]ge;
#! and remove it
$needle = substr $needle, 0, $start;
#! repeat
last;
}
}
#! Sort the needles longest first
@needles = sort{ length $b <=> length $a } @needles;
#! mark their places in the haystack
$haystack =~ s[($_)(?!\})][{$1}]g for @needles;
#! remove nested marks
#! (where a shorter needle was found inside a longer one)
$haystack =~ s[
({[^{}]*?) #! capture everything after the first {
({) #! until we find a second (also captured)
([^{}]*?) #! Then capture everything before the close }
(}) #! and the close (nested) }
([^}]*?}) #! and everything from there, to and including th
+e final }
]
[$1$3$5]gx; #! Throw away the inner {}.
#! Finally print out the none needle parts, and the needle that follow
+ed them.
print "$1 preceeded $2" while $haystack =~ m[(\G[^{]+){([^}]+)}]g;
__END__
C:\test>228122
xxxxxx preceeded ATGGAG
yxxx preceeded TCGA
zxxxx preceeded CGAATTTGAA
xxw preceeded GAAT
C:\test>
Update:Subsequent to posting, I noticed several gross inefficiencies (leftovers from earlier attempts) in the above code.
This seems to be a rather more efficient implementation. It is also a better testbed for further tuning should anyone care to take that on.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
|
|
Flawless!!! ? !!! Remarkable!!
| [reply] |
Re: Complicated pattern match
by theorbtwo (Prior) on Jan 19, 2003 at 05:53 UTC
|
In general, it seems like you want to take the sequence of charaters in $A, make sure $B contains those characters, in that sequence, but with other stuff inbetween them. Then you want a list of the other stuff.
This should do it:
#!/usr/bin/perl
use warnings;
use strict;
my $A = 'ATGGAGTCGACGAATTTGAAGAAT';
my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT';
my @A = split //, $A;
my $Are = '^(.*)' . join('(.*)', @A) . '(.*)$';
my @introns = ($B =~ m/$Are/);
if (!@introns) {
die "There was no possible way to match the input.";
}
foreach (0..$#introns) {
print "$_: $introns[$_] $A[$_]\n";
}
That should give you @introns as a list of everything that /didn't/ match. There will be zero-length strings for each position in which there was no junk, and otherwise the junk. Look at the output, and you'll see what I mean. Also, if there is more then one way of matching $B against $A, this will take each section of @introns to be as long as possible. If you'd prefer it as short as possible, that's easily possible. In either case, it will try it's hardest to get them to match. It will allow both leading and trailing junk, though your example only has trailing junk. These are left as exercises for the reader.
Update: Major revisions to code, changed caveats, tested.
Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).
| [reply] [d/l] |
|
|
my $Are = '^(.*)' . join('(.*)', @A) . '(.*)$';
Like so:
my $Are = join '(.*)', '^', @A, '$';
I'd also use a nongreedy .*? there instead.
You provided a code base I liked quite a bit and couldn't stop tinkering with though - that temporary @A was annoying. :) Eventually it occured to me I could just capture the matches from $A as well:
my $Are = join '(.*?)', '^', (map "($_)", $A =~ /(.)/sg), '$';
Now it all goes in a single array where every odd element is guaranteed to be one character long, while every even element may have any length incl zero. With that in mind, I wrote a cleanup loop and eventually arrived at this:
#!/usr/bin/perl -w
use strict;
my $A = 'ATGGAGTCGACGAATTTGAAGAAT';
my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT';
my $Are = join '(.*?)', '^', (map "($_)", $A =~ /(.)/sg), '$';
(my @introns = $B =~ m/$Are/)
or die "There was no possible way to match the input.";
my @seq;
while(@introns) {
my $match = shift @introns;
if(@seq and not $match) {
$seq[-1] .= shift @introns || '';
next;
}
push @seq, $match;
}
print "$A\n@seq\n";
__END__
ATGGAGTCGACGAATTTGAAGAAT
xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT
Update: hrm.. it doesn't work so well anymore once you substitute "real" sequences for the dummy characters in $B..
ATGGAGTCGACGAATTTGAAGAAT
ATGGAG A T GGAGT CGA T CGAA G T CACCGAA TT T GAA TTT GAAT
I suspect nearly the same is true for theorbtwo's code due to the nature of pattern matching.
Makeshifts last the longest. | [reply] [d/l] [select] |
Re: Complicated pattern match
by runrig (Abbot) on Jan 19, 2003 at 05:14 UTC
|
Surround every character in string A with capturing parens (warning: untested code ahead):
my $re = join("([wxyz]*)", "", split("", $A), "");
if (my @array = $B =~ /^$re$/) {
print "$_\n" for grep length, @array;
}
On a related note, I wrote Amino acid sequence builder awhile ago, which doesn't really fit this problem, but you might find it interesting anyway.
Updated. And the code is tested now. Though due to misunderstanding the problem (clarified in the CB), the "[wxyz]*" should be "[ACTG]*?" | [reply] [d/l] [select] |
Re: Complicated pattern match
by Aristotle (Chancellor) on Jan 19, 2003 at 13:11 UTC
|
After tinkering about for a while trying to make a pattern match produce sensible results, I stepped back and tried to think of the problem at core. What you really want to find is a "longest common subsequence" between the strings. There's a very good module on CPAN which does just this - Algorith::Diff.
#!/usr/bin/perl -w
use strict;
use Algorithm::Diff qw(traverse_sequences);
# construct arrayrefs containing an array of single chars
my ($A, $B) = map [/(.)/sg], qw(
ATGGAGTCGACGAATTTGAAGAAT
xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT
);
my $prev = '';
my @seq;
traverse_sequences(
$A, $B,
{
MATCH => sub {
my ($aidx, $bidx) = @_;
if('=' ne $prev) {
push @seq, '';
$prev = '=';
}
$seq[-1] .= $A->[$aidx];
},
DISCARD_A => sub { die "Sequence in A is not fully contained i
+n B" },
DISCARD_B => sub {
my ($aidx, $bidx) = @_;
if('!' ne $prev) {
push @seq, '';
$prev = '!';
}
$seq[-1] .= $B->[$bidx];
},
},
);
print "@seq\n";
__END__
xxxxxx ATGGAG yxxx TCGA zxxxx CGAATTTGAA xxw GAAT
Even this falls short on "actual" data though: working against GATAGCATGGAGGCCATCGATAACGCGAATTTGAATTTGAAT it produces G AT A G CAT G G AG GCCA TCGA TAA CG CG AATTTGAA TTT GAAT..
Makeshifts last the longest. | [reply] [d/l] |
Re: Complicated pattern match
by Abigail-II (Bishop) on Jan 19, 2003 at 16:28 UTC
|
Considering that you only have 'A', 'T', 'G', and 'C', it's
certainly not impossible that there are various ways
how $A can be contained in $B.
Is it a given that $A has to be split into
four pieces? Can there be more ways? If there are multiple
ways of splitting $A which way should be choosen?
There are two contraints I would find logical: the one that
minizes the number of parts $A splits in, and
the one that maximizes the length of the smallest part.
Abigail | [reply] [d/l] [select] |
|
|
Abigail-II,
From looking at some of the results of others ( and in genereal), you are correct on your two constraints. How to impose that I am not sure, however, you are right.
Cheers,
Dr.J
| [reply] |
|
|
There's no garantee that both constraints can be satisfied
both simultaniously. What's it going to be? ;-)
Abigail
| [reply] |
|
|
Re: Complicated pattern match (book recommendations)
by gjb (Vicar) on Jan 19, 2003 at 20:26 UTC
|
| [reply] [d/l] |
Re: Complicated pattern match (from CB)
by tye (Sage) on Jan 20, 2003 at 00:44 UTC
|
Here is what I came up with yesterday in the chatterbox while discussing
theorbtwo's solution.
dr_jgbn confirmed that the shorter string is composed only of valid
codons (sequences of three characters each representing a nucleatide
base) and that introns will not be inserted into the middle of a codon
(despite the example not meeting these criteria).
theorbtwo's solution would put the introns as soon as possible (which would
not make the introns as long as possible) and this would lead to lots of
backtracking before the solution was found. Consider 123456 vs. 1235432456
which theorbtwo's solution splits like 12(354)3(2)456 while changing (.*)
to (.*?) would split like 123(5)4(324)56. But the longest possible intron
is found like this 123(5432)456.
So I went with non-greedy (.*?) as it doesn't have to backtrack (except
trivially) unless the strings don't actually match (I got the impression
that the strings were already known to match, but I'll cover that later).
And by keeping codons together, the splitting is sensible (it should match
how a cell would interpret the DNA/RNA, I'd think).
#!/usr/bin/perl
use strict;
while( 1 ) {
my $valid= <DATA> or exit( 0 );
chomp( $valid );
my $dirty= <DATA> or die "Missing second sequence";
chomp( $dirty );
$valid =~ s#(...)#(.*?)\Q$1\E#g;
my @intron= $dirty =~ /^$valid(.*?)/;
for my $intron ( @intron ) {
next if ! length($intron);
print "$intron\n";
}
}
__END__
ATGGAGTCGACGAATTTGAAGAAT
GCACCGATGGAGTAGGTCGACGATCTCAATTTGTCGAAGAAT
which produces:
GCACCG
TAGG
ATCTC
TCG
| [reply] [d/l] [select] |
Re: Complicated pattern match
by JamesNC (Chaplain) on Jan 19, 2003 at 23:33 UTC
|
I liked your problem:
If I understand your problem, the sequences in A must all be matched in order of appearance in B without repeating the sequence in A or in B. You may need to sequence through a possible 24 times, but I leave that up to you as you will see how to accomplish this. Try uncommenting the second data and you will see what I mean: My solution=)
#!/perl/bin/perl
use strict;
my @chem;
my %seqB;
my ($key, $value,$data, $pos, $x, $num, $seq);
my $DNA = "xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT";
#my $DNA = "CGATAAATGGAGTGACTCGAGATCGCGAATTTGCCAAACACGAAT";
while(<DATA>){
chomp;
@chem = split //;
}
my $seq_prev ='';
for(1..10){
$num = @chem;
$num--;
for (0..$num){
$key = join '', @chem[0..$_];
if($DNA =~ /($key)/i){ $seq = $`; $data = $key; $pos=$_+1; }
#first sequence in A found
}
#remove the matched squence from future possible matches
splice @chem, 0, $pos;
#show the results
my $chem= join '', @chem;
print "FIRST SEQ MATCH FOUND: $seq_prev|< $data >|$chem\n";
print "\nDNA(B)- BEFORE < $data >: $seq";
$seq_prev = $data.$seq_prev;
$seqB{$data} = $seq;
#remove the matched DNA B pattern matched data
$DNA =~ s/$seq$data//;
# exit(0) if $DNA eq ''; #if you wanted to match more than 4
}
print "AS A HASH TABLE KEY:VALUE PAIRS\n";
foreach (keys %seqB){
print "$_ = $seqB{$_}\n";
}
__DATA__
ATGGAGTCGACGAATTTGAAGAAT
It gives you what you want for both the data you provided and for any generic data sets.
JamesNC :0)
| [reply] [d/l] |
Re: Complicated pattern match
by OM_Zen (Scribe) on Jan 19, 2003 at 05:24 UTC
|
my $A =~ s/[A-Z]/\|/g;
my @values = split('\|',$A);
map { if ($_ ne "" ) {print "[$_]\n";} } @values;
| [reply] [d/l] |
|
|
No, that don't work. Your first line will substitue any A-Z character in $A with a pipe, which means all characters in this case since $A doesn't contain anything else. Then you split on pipes, which gets you a @values full of empty strings. Then you map in void context..
If you leave out the substitution and split on nothing, it “works”:
my @values = split //, $A;
map { if ($_ ne "" ) {print "[$_]\n";} } @values;
Except all it does is split $A into characters and print them on a line each. I have no idea how that is supposed to solve the poster's problem though, since it doesn't even look at $B. The if is superfluous as well, since every element of @values is guaranteed to contain one character from $A.
Makeshifts last the longest. | [reply] [d/l] |
|
|
[xxxxxx]
[yxxx]
[zxxxx]
[xxw]
I guess the split converts only the caps in the $B and then I print the non_exmpty strings. Please let me know where I am totally wrong. I thought the $A was discontinuous, but the order of the characters remained the same and hence thought I could do it this way. | [reply] [d/l] |
|
|
Re: Complicated pattern match
by pizza_milkshake (Monk) on Jan 20, 2003 at 01:11 UTC
|
ok, it's more C-ish than perl-ish, but i betcha it's faster:
#!/usr/bin/perl -l
my $A = 'ATGGAGTCGACGAATTTGAAGAAT';
my $B = 'xxxxxxATGGAGyxxxTCGAzxxxxCGAATTTGAAxxwGAAT';
my ($An, $Bn) = (0, 0); # relative indices
my ($Ac, $Bc); # characters
my ($difpos, $difstr) = (0, ''); # keep track of differences
for $Bn (0 .. length($B)-1){
$Ac = substr($A, $An, 1);
$Bc = substr($B, $Bn, 1);
if ($Ac eq $Bc){
print "$difpos: $difstr" if $difstr;
$difstr = '';
$An++;
$difpos++;
} else {
$difstr .= $Bc;
}
}
perl -e'$_=q#: 13_2: 12/"{>: 8_4) (_4: 6/2"-2; 3;-2"\2: 5/7\_/\7: 12m m::#;s#:#\n#g;s#(\D)(\d+)#$1x$2#ge;print'
| [reply] [d/l] |
Re: Complicated pattern match
by scain (Curate) on Jan 21, 2003 at 04:03 UTC
|
OK, so there have been several responses so far, and it appears that at least some of them work :-)
In the spirit of TMTOWTDI, I thought I would take a different approach. The problems dr_jgbn presents could be used to do several things, though the most obvious is to look for exons that are included in one transcript and not another. Also, the concern that Abagail-II raises is valid, therefore, I think it is reasonable and desireable to use industry standard analysis tools to arrive at a solution. What follows is a fairly long description of how I solved that problem.
Scott
Project coordinator of the Generic Model Organism Database Project
| [reply] [d/l] |