Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:

The basics: I'm new to this, and I've gotten myself so far down a rabbit hole that my mind is going blank. I have searched for the basics on pattern matching, phoned a friend, and tried venturing into the BioPerl world, but can't wrap my head around the solutions involving C, or using Hex. (Also, the BioPerl How To's seem to be broken?)

The layout: I have a database with multiple tables that I am working with. One of the tables contains a list of pairs (pairs of consensus sequences and their mismatch.) These pairs correspond to ID's in another table. And those ID's correspond to sequences in a third table.

So far, (with the extensive help of a colleague) I have a script that looks through the list of pairs and finds both sequences by their id and their corresponding mismatches in the ID table (four sequence id's in total.) The script then matches those ID's with their sequences in the sequence table. I have also gone as far as to identify the central base for each sequence. (Each sequence is 25 bases in length, and I have filtered the results to only find sequences that are exactly 25 bases long.)

The problem: For each sequence, there is only one potential mismatch listed in the database. I need to find out if the other two sequences exist in the database. If my (shortened for simplicity) sequence was ACTGCCT, I would call this my pm_seq. It's identified mismatch (mm_seq) might be ACTACCT. But the sequence table itself may contain up to two more mismatches. (ACTCCCT and ACTTCCT.) I need to search each of my identified pm_id's seqences against the entire sequence list and identify the id's of all three possible mismatches. One attempt at this has been to set a pattern for the sequence like so: my $pattern = ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/). But I am lost when it comes to searching against my sequence table and identifying sequence ID's that have a sequence that matches $1 AND $3.

Ultimately I am looking to generate an output that contains each sequence, and up to three mismatches that differ at the central base for a large list of target sequences. The code below is later in the script, but it fetches the sequences for the identified pairs. (one pm and one mm per pair.) But I still need to search the entire sequence set for the other two possible mm's. (I think it will be easier to search for all 3 possible mm's based on the pm sequence rather than identify the known mm and search for the other two.)

#perl 5.8.8 #code to get data from database not included my $ppiquery1 = "select sequence from seq_table where seq_id = $ppi_pm +_id AND LENGTH(seq) = 25"; my $ppiquery2 = "select sequence from seq_table where seq_id = $ppi_mm +_id AND LENGTH(seq) = 25"; my $mpiquery1 = "select sequence from seq_table where seq_id = $mpi_pm +_id AND LENGTH(seq) = 25"; my $mpiquery2 = "select sequence from seq_table where seq_id = $mpi_mm +_id AND LENGTH(seq) = 25"; my @row3 = $dbh->selectrow_array($ppiquery1); unless (@row3) { die "No seq for $ppi_pm_id?\n"; } my @row4 = $dbh->selectrow_array($ppiquery2); unless (@row4) { die "No seq for $ppi_mm_id?\n"; } my @row5 = $dbh->selectrow_array($mpiquery1); unless (@row5) { die "No seq for $mpi_pm_id?\n"; } my @row6 = $dbh->selectrow_array($mpiquery2); unless (@row6) { die "No seq for $mpi_mm_id?\n"; } my ($ppi_pm_seq) = @row3; my ($ppi_mm_seq) = @row4; my ($mpi_pm_seq) = @row5; my ($mpi_mm_seq) = @row6;

Beyond here, I've attempted to set the pattern with ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/) and then use a series of if/elsif for each of the four possible combinations of $2 (A, C, T, G) to store the seq_id's for each group....to no avail. Any guidance would be greatly appreciated. I'd like to stay completely in perl if possible, mostly because I barely grasp this let alone something new!

Replies are listed 'Best First'.
Re: DNA Pattern Matching
by AnomalousMonk (Archbishop) on Jul 19, 2017 at 22:27 UTC

    I'm confused about what you want (update: please remember this is PerlMonks and not BioMonks), but a part of it seems to be a permutation of all the other middle, single-base possibilities given the one that is actually present. If so, maybe something like this:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $ppi_pm_seq = 'ACTGCCT'; ;; my $n = 3; ;; my $extraction = my ($before, $mid, $after) = $ppi_pm_seq =~ m{ \A (.{$n}) (.) (.{$n}) \z }xms; ;; die qq{no extraction from '$ppi_pm_seq'} unless $extraction; ;; my ($ppi_pm_id, $ppi_mm_id, $mpi_pm_id, $mpi_mm_id) = map qq{$before$_$after}, $mid, grep $_ ne $mid, qw(A T C G) ; ;; print qq{'$_'} for $ppi_pm_id, $ppi_mm_id, $mpi_pm_id, $mpi_mm_id; " 'ACTGCCT' 'ACTACCT' 'ACTTCCT' 'ACTCCCT'
    Note that the  $ppi_pm_id permuted sequence is the same as the "input"  $ppi_pm_seq sequence (if that's what you want).

    Update 1: Note also that the simpler  (.{$n}) (.) or  (.{n}) (.) expressions suffice if you can guarantee, as an initial step, that your input strings are only  A T C G characters. This can be as simple as (tested):

    my $s = 'ATCGxTGAC'; my $foreign = $s =~ tr/ATCG//c; print qq{foreign character in '$s'} if $foreign;

    Update 2: An interesting side note: If you have Perl version 5.10+ available, you can use the  (?-PARNO) relative recursive subpattern regex extension to (perhaps) simplify things a bit (tested):
        m{ \A (.{$n}) (.) ((?-3)) \z }xms;
    in place of
        m{ \A (.{$n}) (.) (.{$n}) \z }xms;
    (I rather doubt this alteration would speed things up significantly, but I've not done any testing; it's just a thought.)


    Give a man a fish:  <%-{-{-{-<

      Apologies, I saw you comment similarly on a thread about K-mers. (BioMonks vs PerlMonks.) Is there a specific site or section for bio based questions that I have missed?

      Let me see if I can wrap my head around all of this and better explain myself... I'm going to walk through your reply out loud.

      So the $ppi_pm_seq is defined as an example sequence. (This would be filled in from each sequence in the group I have. -250k+ sequences.) The $n is defined as 3, since we are after the middle base of the 7-mer sequence. (For me, $n would be set to 12 since all of my sequences are 25-mer.) $my_extraction looks to set the pattern of the sequence into three variables that instead of being $1, $2, and $3 by default, are now $before, $mid, and $after (I didn't know that you could define the pattern variables!) - containing the first 3 bases (any single character x 3 because of . $n), the middle base, and the last three bases? The /A designates the first pattern to match only at the beginning of the string, while the \z directs the third pattern to match only at the end of the string, correct? Is the xms required?

      The next block is confusing to me, and that may be my fault. So in my parent list, I have "Pair" ID's. These pair ID's each correspond to another pair of ID's. So one ID becomes two ID's which become four ID's. Out of those final four ID's, two are targets, and two are mismatches with the central base being different. That is where the ppi_pm, ppi_mm, mpi_pm, and mpi_mm come into play. pm's and mm's in the same pair differ only by the central base, while ppi's and mpi's are different sequences entirely. (Not entirely true, they are complement sequences...but that's not important for this.) For simplicity, i'm guessing it is best to ignore the identified mm sequences in this code?

      If I'm following along correctly, you are setting each of those variables to be the evaluation of the pattern variables (map{block}) $before and $after with the system default variable $_ serving to hold the center base for each $mid? and then following it with an evaluation where the center base is not equal to the target stored in $mid? for the list of A,T,C, and G?

      Is this just creating the possible permutations of the target string based on the possible outcomes? The sequences are guaranteed to only contain A,T,C, and G. And during my searches I found a couple of tools in 5.10 plus that looked useful, but alas I am stuck with what I have for the time being. (We attempted running the overarching script group on a newer deployment and couldn't make it functional. (Maybe a project for the future.)

      Taking a crack at the code...

      #$ppi_pm_seq and $mpi_pm_seq are defined from database. Can I use them + as is, or would i need to push them all into a single list to run th +rough this? my ($eval_1, $eval_2) = ($ppi_pm_seq, $mpi_pm_seq); #OR foreach $condensed_sequence_list_item my $eval = $condensed_sequence_list_item ;; my $n = 12; ;; my $extraction = my ($before, $mid, $after) = $eval =~ m{ \A (.{$ +n}) (.) (.{$n}) \z }xms; ;; die qq{no extraction from '$ppi_pm_seq'} unless $extraction; ;; my ($eval_A, $eval_T, $eval_C, $eval_G) = map qq{$before$_$after} +,$mid, grep $_ ne $mid, qw(A T C G) #this is where i get lost though. It seems like this is going to force + create the 4 possible answers? Would I then do something like... my $mm1 = "select seq from seq_table where seq = $eval_A"; my $mm2 = "select seq from seq_table where seq = $eval_T"; my $mm3 = "select seq from seq_table where seq = $eval_C"; my $mm4 = "select seq from seq_table where seq = $eval_G"; #even though I am labeling them here as mm, one of them is my pm targe +t. (I'd need to identify that at some point.) Also, these sequences +may not exist. There is one guaranteed mm per pm, and possibly an add +itional two

        I don't have much time right now, so just a few general responses in no particular order.

        ... comment similarly on a thread about K-mers. (BioMonks vs PerlMonks.)

        What I mean to convey by such comments is that while I and all (?) of my fellows in this place are biological (as opposed to silicon- or whatever-else-based) Monks, I and many others are not well versed in biological and molecular biological lore. So when someone starts posting about K-mers, consensus sequences, hairpin sequences, etc., etc., I must set off on a little research program that may quickly run into a TL;DR limit. It's always best to post links to explanations of the terms used. E.g., [wp://Consensus sequence] for Consensus sequence. See What shortcuts can I use for linking to other information?. (Although the given link seems too verbose; a more focused definition might be better.)

        ... I found a couple of tools in 5.10 plus that looked useful, but alas I am stuck with what I have ...

        What version(s) of Perl do you have available to you?

        I'm still trying to wrap my head around an understanding of the effect of the database in the operations you describe, and just what you are ultimately trying to achieve. Maybe I'll have time for more later.


        Give a man a fish:  <%-{-{-{-<

        This reply may have been so long delayed as to render it otiose; for the sake of completeness, however, I'll post it.

        So the $ppi_pm_seq is defined as an example sequence. ... The $n is defined as 3, since we are after the middle base of the 7-mer sequence. ... $my_extraction looks to set the pattern of the sequence into three variables that instead of being $1, $2, and $3 ... containing the first 3 bases ... the middle base, and the last three bases? The /A designates the first pattern to match only at the beginning of the string, while the \z directs the third pattern to match only at the end of the string, correct? Is the xms required?

        This seems like a good recap of the regex parsing/extraction section of my example code. Yes,  \A \z are absolute string-end anchors. The  /x of the  /xms regex modifier "tail" is certainly needed because I'm using whitespace in the regex for readability; without an  /x modifier, the regex engine would have to match the literal regex whitespace against corresponding whitespace in the string — not what you want. The  /m /s modifiers of the tail are not strictly needed in the given regex, but are there because every  qr// m// s/// regex expression I write uses a standard  /xms tail. This is in line with the regex-specific recommendations of TheDamian's Perl Best Practices (PBP). Not everyone agrees with all of the PBPs and nor do I, but I warmly embrace and heartily recommend all the regex BPs.

        The next block is confusing to me ... you are setting each of those variables to be the evaluation of the pattern variables (map{block}) $before and $after with the system default variable $_ serving to hold the center base for each $mid? and then following it with an evaluation where the center base is not equal to the target stored in $mid? for the list of A,T,C, and G?

        As I have said, I really don't understand your overall intent. However, I latched on to the "... use a series of if/elsif for each of the four possible combinations of $2 (A, C, T, G) to store the seq_id's for each group ..." passage in the OP and thought I could give an example of permuting all possible combinations. Once again, your recap of the actions of my code example seems accurate.

        Is this just creating the possible permutations of the target string based on the possible outcomes?

        Yes. Whether this is what you need or not is still beyond me.

        Update: Made a number of small changes to enhance clarity. Since none of the changes seem to affect context, I will not bother to enumerate them.


        Give a man a fish:  <%-{-{-{-<