Re: DNA Pattern Matching

I'm confused about what you want (update: please remember this is PerlMonks and not BioMonks), but a part of it seems to be a permutation of all the other middle, single-base possibilities given the one that is actually present. If so, maybe something like this:

c:\@Work\Perl\monks>perl -wMstrict -le
"my $ppi_pm_seq = 'ACTGCCT';
 ;;
 my $n = 3;
 ;;
 my $extraction =
 my ($before, $mid, $after) =
 $ppi_pm_seq =~ m{ \A (.{$n}) (.) (.{$n}) \z }xms;
 ;;
 die qq{no extraction from '$ppi_pm_seq'} unless $extraction;
 ;;
 my ($ppi_pm_id, $ppi_mm_id, $mpi_pm_id, $mpi_mm_id) =
   map qq{$before$_$after},
   $mid, grep $_ ne $mid,
   qw(A T C G)
   ;
 ;;
 print qq{'$_'} for $ppi_pm_id, $ppi_mm_id, $mpi_pm_id, $mpi_mm_id;
"
'ACTGCCT'
'ACTACCT'
'ACTTCCT'
'ACTCCCT'
[download]

Note that the $ppi_pm_id permuted sequence is the same as the "input" $ppi_pm_seq sequence (if that's what you want).

Update 1: Note also that the simpler (.{$n}) (.) or (.{n}) (.) expressions suffice if you can guarantee, as an initial step, that your input strings are only A T C G characters. This can be as simple as (tested):

my $s = 'ATCGxTGAC';
my $foreign = $s =~ tr/ATCG//c;
print qq{foreign character in '$s'} if $foreign;
[download]

Update 2: An interesting side note: If you have Perl version 5.10+ available, you can use the (?-PARNO) relative recursive subpattern regex extension to (perhaps) simplify things a bit (tested):
m{ \A (.{$n}) (.) ((?-3)) \z }xms;
in place of
m{ \A (.{$n}) (.) (.{$n}) \z }xms;
(I rather doubt this alteration would speed things up significantly, but I've not done any testing; it's just a thought.)

Give a man a fish: <%-{-{-{-<

Comment on Re: DNA Pattern Matching Select or Download Code

Replies are listed 'Best First'.
Re^2: DNA Pattern Matching by Speed_Freak (Sexton) on Jul 20, 2017 at 13:40 UTC
Apologies, I saw you comment similarly on a thread about K-mers. (BioMonks vs PerlMonks.) Is there a specific site or section for bio based questions that I have missed? Let me see if I can wrap my head around all of this and better explain myself... I'm going to walk through your reply out loud. So the $ppi_pm_seq is defined as an example sequence. (This would be filled in from each sequence in the group I have. -250k+ sequences.) The $n is defined as 3, since we are after the middle base of the 7-mer sequence. (For me, $n would be set to 12 since all of my sequences are 25-mer.) $my_extraction looks to set the pattern of the sequence into three variables that instead of being $1, $2, and $3 by default, are now $before, $mid, and $after (I didn't know that you could define the pattern variables!) - containing the first 3 bases (any single character x 3 because of . $n), the middle base, and the last three bases? The /A designates the first pattern to match only at the beginning of the string, while the \z directs the third pattern to match only at the end of the string, correct? Is the xms required? The next block is confusing to me, and that may be my fault. So in my parent list, I have "Pair" ID's. These pair ID's each correspond to another pair of ID's. So one ID becomes two ID's which become four ID's. Out of those final four ID's, two are targets, and two are mismatches with the central base being different. That is where the ppi_pm, ppi_mm, mpi_pm, and mpi_mm come into play. pm's and mm's in the same pair differ only by the central base, while ppi's and mpi's are different sequences entirely. (Not entirely true, they are complement sequences...but that's not important for this.) For simplicity, i'm guessing it is best to ignore the identified mm sequences in this code? If I'm following along correctly, you are setting each of those variables to be the evaluation of the pattern variables (map{block}) $before and $after with the system default variable $_ serving to hold the center base for each $mid? and then following it with an evaluation where the center base is not equal to the target stored in $mid? for the list of A,T,C, and G? Is this just creating the possible permutations of the target string based on the possible outcomes? The sequences are guaranteed to only contain A,T,C, and G. And during my searches I found a couple of tools in 5.10 plus that looked useful, but alas I am stuck with what I have for the time being. (We attempted running the overarching script group on a newer deployment and couldn't make it functional. (Maybe a project for the future.) Taking a crack at the code... #$ppi_pm_seq and $mpi_pm_seq are defined from database. Can I use them + as is, or would i need to push them all into a single list to run th +rough this? my ($eval_1, $eval_2) = ($ppi_pm_seq, $mpi_pm_seq); #OR foreach $condensed_sequence_list_item my $eval = $condensed_sequence_list_item ;; my $n = 12; ;; my $extraction = my ($before, $mid, $after) = $eval =~ m{ \A (.{$ +n}) (.) (.{$n}) \z }xms; ;; die qq{no extraction from '$ppi_pm_seq'} unless $extraction; ;; my ($eval_A, $eval_T, $eval_C, $eval_G) = map qq{$before$_$after} +,$mid, grep $_ ne $mid, qw(A T C G) #this is where i get lost though. It seems like this is going to force + create the 4 possible answers? Would I then do something like... my $mm1 = "select seq from seq_table where seq = $eval_A"; my $mm2 = "select seq from seq_table where seq = $eval_T"; my $mm3 = "select seq from seq_table where seq = $eval_C"; my $mm4 = "select seq from seq_table where seq = $eval_G"; #even though I am labeling them here as mm, one of them is my pm targe +t. (I'd need to identify that at some point.) Also, these sequences +may not exist. There is one guaranteed mm per pm, and possibly an add +itional two [download]	[reply] [d/l]
Re^3: DNA Pattern Matching by AnomalousMonk (Archbishop) on Jul 20, 2017 at 16:46 UTC
I don't have much time right now, so just a few general responses in no particular order. ... comment similarly on a thread about K-mers. (BioMonks vs PerlMonks.) What I mean to convey by such comments is that while I and all (?) of my fellows in this place are biological (as opposed to silicon- or whatever-else-based) Monks, I and many others are not well versed in biological and molecular biological lore. So when someone starts posting about K-mers, consensus sequences, hairpin sequences, etc., etc., I must set off on a little research program that may quickly run into a TL;DR limit. It's always best to post links to explanations of the terms used. E.g., `[wp://Consensus sequence]` for Consensus sequence. See What shortcuts can I use for linking to other information?. (Although the given link seems too verbose; a more focused definition might be better.) ... I found a couple of tools in 5.10 plus that looked useful, but alas I am stuck with what I have ... What version(s) of Perl do you have available to you? I'm still trying to wrap my head around an understanding of the effect of the database in the operations you describe, and just what you are ultimately trying to achieve. Maybe I'll have time for more later. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: DNA Pattern Matching by Sinistral (Monsignor) on Jul 20, 2017 at 15:22 UTC
Yes - there is BioPerl.org	[reply]
Re^4: DNA Pattern Matching by Speed_Freak (Sexton) on Jul 20, 2017 at 15:46 UTC
While my data falls into the bio realm, I don't think the actual operation uses anything that BioPerl provides. (We do have BioPerl on the server as it is used in other scripts.) I have sent a registration request to the BioPerl mailing list though. If it helps, think of my data as some sort of date string where I need to search for any other occurrences of the central portion. 072017, where I need to find any other "transaction" that happened with the matching month of 07 and year of 17. Basically I need to split each string into three parts, and find matches in the database that are identical only in the first and last part. (But I have 250k+ iterations to search from, through a list of over 1 million strings each time.)	[reply]
Re^3: DNA Pattern Matching by AnomalousMonk (Archbishop) on Jul 21, 2017 at 18:48 UTC
This reply may have been so long delayed as to render it otiose; for the sake of completeness, however, I'll post it. So the $ppi_pm_seq is defined as an example sequence. ... The $n is defined as 3, since we are after the middle base of the 7-mer sequence. ... $my_extraction looks to set the pattern of the sequence into three variables that instead of being $1, $2, and $3 ... containing the first 3 bases ... the middle base, and the last three bases? The /A designates the first pattern to match only at the beginning of the string, while the \z directs the third pattern to match only at the end of the string, correct? Is the xms required? This seems like a good recap of the regex parsing/extraction section of my example code. Yes, `\A \z` are absolute string-end anchors. The `/x` of the `/xms` regex modifier "tail" is certainly needed because I'm using whitespace in the regex for readability; without an `/x` modifier, the regex engine would have to match the literal regex whitespace against corresponding whitespace in the string — not what you want. The `/m /s` modifiers of the tail are not strictly needed in the given regex, but are there because every `qr// m// s///` regex expression I write uses a standard `/xms` tail. This is in line with the regex-specific recommendations of TheDamian's Perl Best Practices (PBP). Not everyone agrees with all of the PBPs and nor do I, but I warmly embrace and heartily recommend all the regex BPs. The next block is confusing to me ... you are setting each of those variables to be the evaluation of the pattern variables (map{block}) $before and $after with the system default variable $_ serving to hold the center base for each $mid? and then following it with an evaluation where the center base is not equal to the target stored in $mid? for the list of A,T,C, and G? As I have said, I really don't understand your overall intent. However, I latched on to the "... use a series of if/elsif for each of the four possible combinations of $2 (A, C, T, G) to store the seq_id's for each group ..." passage in the OP and thought I could give an example of permuting all possible combinations. Once again, your recap of the actions of my code example seems accurate. Is this just creating the possible permutations of the target string based on the possible outcomes? Yes. Whether this is what you need or not is still beyond me. Update: Made a number of small changes to enhance clarity. Since none of the changes seem to affect context, I will not bother to enumerate them. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]