comment on

The basics: I'm new to this, and I've gotten myself so far down a rabbit hole that my mind is going blank. I have searched for the basics on pattern matching, phoned a friend, and tried venturing into the BioPerl world, but can't wrap my head around the solutions involving C, or using Hex. (Also, the BioPerl How To's seem to be broken?)

The layout: I have a database with multiple tables that I am working with. One of the tables contains a list of pairs (pairs of consensus sequences and their mismatch.) These pairs correspond to ID's in another table. And those ID's correspond to sequences in a third table.

So far, (with the extensive help of a colleague) I have a script that looks through the list of pairs and finds both sequences by their id and their corresponding mismatches in the ID table (four sequence id's in total.) The script then matches those ID's with their sequences in the sequence table. I have also gone as far as to identify the central base for each sequence. (Each sequence is 25 bases in length, and I have filtered the results to only find sequences that are exactly 25 bases long.)

The problem: For each sequence, there is only one potential mismatch listed in the database. I need to find out if the other two sequences exist in the database. If my (shortened for simplicity) sequence was ACTGCCT, I would call this my pm_seq. It's identified mismatch (mm_seq) might be ACTACCT. But the sequence table itself may contain up to two more mismatches. (ACTCCCT and ACTTCCT.) I need to search each of my identified pm_id's seqences against the entire sequence list and identify the id's of all three possible mismatches. One attempt at this has been to set a pattern for the sequence like so: my $pattern = ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/). But I am lost when it comes to searching against my sequence table and identifying sequence ID's that have a sequence that matches $1 AND $3.

Ultimately I am looking to generate an output that contains each sequence, and up to three mismatches that differ at the central base for a large list of target sequences. The code below is later in the script, but it fetches the sequences for the identified pairs. (one pm and one mm per pair.) But I still need to search the entire sequence set for the other two possible mm's. (I think it will be easier to search for all 3 possible mm's based on the pm sequence rather than identify the known mm and search for the other two.)

#perl 5.8.8
#code to get data from database not included

my $ppiquery1 = "select sequence from seq_table where seq_id = $ppi_pm
+_id AND LENGTH(seq) = 25";
my $ppiquery2 = "select sequence from seq_table where seq_id = $ppi_mm
+_id AND LENGTH(seq) = 25";
my $mpiquery1 = "select sequence from seq_table where seq_id = $mpi_pm
+_id AND LENGTH(seq) = 25";
my $mpiquery2 = "select sequence from seq_table where seq_id = $mpi_mm
+_id AND LENGTH(seq) = 25";
my @row3 = $dbh->selectrow_array($ppiquery1);
    unless (@row3) { die "No seq for $ppi_pm_id?\n"; }
my @row4 = $dbh->selectrow_array($ppiquery2);
    unless (@row4) { die "No seq for $ppi_mm_id?\n"; }
my @row5 = $dbh->selectrow_array($mpiquery1);
    unless (@row5) { die "No seq for $mpi_pm_id?\n"; }
my @row6 = $dbh->selectrow_array($mpiquery2);
    unless (@row6) { die "No seq for $mpi_mm_id?\n"; }
my ($ppi_pm_seq) = @row3;
my ($ppi_mm_seq) = @row4;
my ($mpi_pm_seq) = @row5;
my ($mpi_mm_seq) = @row6;
[download]

Beyond here, I've attempted to set the pattern with ($ppi_pm_seq =~ /(CAGT{12})(CAGT{1})(CAGT{12})/) and then use a series of if/elsif for each of the four possible combinations of $2 (A, C, T, G) to store the seq_id's for each group....to no avail. Any guidance would be greatly appreciated. I'd like to stay completely in perl if possible, mostly because I barely grasp this let alone something new!

In reply to DNA Pattern Matching by Speed_Freak

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.