Speed_Freak has asked for the wisdom of the Perl Monks concerning the following question:
The basics: I'm new to this, and I've gotten myself so far down a rabbit hole that my mind is going blank. I have searched for the basics on pattern matching, phoned a friend, and tried venturing into the BioPerl world, but can't wrap my head around the solutions involving C, or using Hex. (Also, the BioPerl How To's seem to be broken?)
The layout: I have a database with multiple tables that I am working with. One of the tables contains a list of pairs (pairs of consensus sequences and their mismatch.) These pairs correspond to ID's in another table. And those ID's correspond to sequences in a third table.
So far, (with the extensive help of a colleague) I have a script that looks through the list of pairs and finds both sequences by their id and their corresponding mismatches in the ID table (four sequence id's in total.) The script then matches those ID's with their sequences in the sequence table. I have also gone as far as to identify the central base for each sequence. (Each sequence is 25 bases in length, and I have filtered the results to only find sequences that are exactly 25 bases long.)
The problem: For each sequence, there is only one potential mismatch listed in the database. I need to find out if the other two sequences exist in the database. If my (shortened for simplicity) sequence was ACTGCCT, I would call this my pm_seq. It's identified mismatch (mm_seq) might be ACTACCT. But the sequence table itself may contain up to two more mismatches. (ACTCCCT and ACTTCCT.) I need to search each of my identified pm_id's seqences against the entire sequence list and identify the id's of all three possible mismatches. One attempt at this has been to set a pattern for the sequence like so: my $pattern = ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/). But I am lost when it comes to searching against my sequence table and identifying sequence ID's that have a sequence that matches $1 AND $3.
Ultimately I am looking to generate an output that contains each sequence, and up to three mismatches that differ at the central base for a large list of target sequences. The code below is later in the script, but it fetches the sequences for the identified pairs. (one pm and one mm per pair.) But I still need to search the entire sequence set for the other two possible mm's. (I think it will be easier to search for all 3 possible mm's based on the pm sequence rather than identify the known mm and search for the other two.)
#perl 5.8.8 #code to get data from database not included my $ppiquery1 = "select sequence from seq_table where seq_id = $ppi_pm +_id AND LENGTH(seq) = 25"; my $ppiquery2 = "select sequence from seq_table where seq_id = $ppi_mm +_id AND LENGTH(seq) = 25"; my $mpiquery1 = "select sequence from seq_table where seq_id = $mpi_pm +_id AND LENGTH(seq) = 25"; my $mpiquery2 = "select sequence from seq_table where seq_id = $mpi_mm +_id AND LENGTH(seq) = 25"; my @row3 = $dbh->selectrow_array($ppiquery1); unless (@row3) { die "No seq for $ppi_pm_id?\n"; } my @row4 = $dbh->selectrow_array($ppiquery2); unless (@row4) { die "No seq for $ppi_mm_id?\n"; } my @row5 = $dbh->selectrow_array($mpiquery1); unless (@row5) { die "No seq for $mpi_pm_id?\n"; } my @row6 = $dbh->selectrow_array($mpiquery2); unless (@row6) { die "No seq for $mpi_mm_id?\n"; } my ($ppi_pm_seq) = @row3; my ($ppi_mm_seq) = @row4; my ($mpi_pm_seq) = @row5; my ($mpi_mm_seq) = @row6;
Beyond here, I've attempted to set the pattern with ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/) and then use a series of if/elsif for each of the four possible combinations of $2 (A, C, T, G) to store the seq_id's for each group....to no avail. Any guidance would be greatly appreciated. I'd like to stay completely in perl if possible, mostly because I barely grasp this let alone something new!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: DNA Pattern Matching
by AnomalousMonk (Archbishop) on Jul 19, 2017 at 22:27 UTC | |
by Speed_Freak (Sexton) on Jul 20, 2017 at 13:40 UTC | |
by AnomalousMonk (Archbishop) on Jul 20, 2017 at 16:46 UTC | |
by Sinistral (Monsignor) on Jul 20, 2017 at 15:22 UTC | |
by Speed_Freak (Sexton) on Jul 20, 2017 at 15:46 UTC | |
by AnomalousMonk (Archbishop) on Jul 21, 2017 at 18:48 UTC |