MadraghRua has asked for the wisdom of the Perl Monks concerning the following question:

Hello Chaps and Chapesses,

I'm interested in using weighted regular expressions in searching DNA/RNA sequences. So, say I take a set of information on important pieces of DNA. If I look at 12 pieces of DNA I might end up with the following matrix:

A T G C
25 25 25 25 nucleotide 1
10 15 50 25 nucleotide 2
0 90 5 5 nucleotide 3
12 16 32 40 nucleotide 4

where each of the numbers refers to the percentage weight (the percentage of that I can expect to see that nucleotide at that particular position).
So for Nucleotide 1, a simple regex would be atgc as everything is equally weighted.
Nucleotide 3 would be tgc as there is no weight for A.

I would like to write something that would pay attention to the weights at each position, not just the presence of nucleotides.
So nucleotide 3 would be t(90%)g(5%)c(5%) or whatever the correct regex pattern is.

Is this possible? Can anyone give me an example to send me on my way? I have looked in the Friedel book, but I didn't find anything terribly obvious...

Thanks for any help. Go raibh maith agat,

MadraghRua
yet another biologist hacking perl....

Replies are listed 'Best First'.
RE: weighting regex patterns
by KM (Priest) on Aug 18, 2000 at 23:04 UTC
    This isn't an answer to your question per se, but I just wanted to make sure you are aware of Bioperl, which may or may not be useful for this (or other things you are doing).

    As for your question, what data structure do you have your 'matrix' in? How is it being represented in your program? I guess I don't understand your problem. I would think a hash would help you with this (but I may be wrong).

    my %dna = ('nucleotide1' => {A => 25, T => 25, G => 25, C => 25}, 'nucleotide2' => {A => 10, T => 15, G => 50, C => 25}, etc... );
    Then you could do:

    $dna{nucleotide2}->{T};

    Which would give you the weight (15 in this case).

    Cheers,
    KM

RE: weighting regex patterns
by merlyn (Sage) on Aug 19, 2000 at 02:15 UTC
Re: weighting regex patterns
by MadraghRua (Vicar) on Aug 19, 2000 at 01:19 UTC
    Thanks for the answers. Basically I'm casting for ideas before trying to do this and your replies have been very helpful. I do know about BioPerl and this morning have been looking at the Bio::Tools::SeqPattern module as a partial solution.

    I basically would like to use a set of weight matrices from the TransFac database, a database of transcription factors. Each TransFac matrix contains three pieces of information:
    the length of the promoter sequence that the matrix represents;
    the consensus sequence that a matrix represents, eg CGCGTNSANNACAGCGTTT;
    and the percentage distribution of nucleotides at each position in the sequence that the matrix represents (such as in my original email).

    I naievly though that a regular expression search could be structured to contain both the consensus sequence information and the frequency information at each position within the consensus sequence. I now realize that this was a bit silly.

    From your replies this would fall into two steps:
    1. scan the input DNA sequence for the pattern represented by a particular matrix - I may be able to use Bio::Tools::SeqPattern for at least part of this;
    2. calculate how similar the newly matched sequence I've just found is to the pattern in the weight matrix - is it a good match or a weak match?

    I would still like to take the value of individual nucleotide frequencies at a particular matrix position into account when scanning my sequence for these promoter sites - perhaps this might decrease false positives during my search.

    Looking at the above answers, KM's approach of transferring a matching matrix's information into a hash and then getting the values from the matrix for each pattern matched nucleotide appears to be the simplest to do. I'll probably try this approach first anyway.

    Thanks for your help.

    MadraghRua
    yet another biologist hacking perl....

RE (tilly) 1: weighting regex patterns
by tilly (Archbishop) on Aug 18, 2000 at 23:25 UTC
    Try String::Approx should you not find what you want in anything specific to bioperl.

    But this problem does not sound to me like a regex problem per se. I think you should back off from the code and explain your actual problem. For instance I suspect that something like the following untested code could be useful even though it sounds nothing like what you have written:

    # Takes a string, returns hash of chars in string and # how many times each appears. sub char_count { my $str = shift; my %count; while ($str =~ /(.)/gs) { ++$count{$1}; } return %count; }
Re: weighting regex patterns
by athomason (Curate) on Aug 18, 2000 at 23:17 UTC
    I've read your question four times now, and I still don't understand what you're trying to do. It almost sounds like you want to match approximately (even nondeterministically?), a purpose for which regexes aren't suited. I'm also unsure of the purpose of your sample regexes: /atgc/ will match only that exact sequence, not combinations of the member letters. What are you matching with those, nucleotides or sequences? I guess I'm just generally confused what the "weighting" will be applied to, and generally what your goal is.
Re: weighting regex patterns
by fundflow (Chaplain) on Aug 18, 2000 at 23:20 UTC
    Your question isn't clear but it seems like what you want is something like:
    while(<>) { /(\d+) (\d+) (\d+) (\d+).*/ and $dna= ($1>0?"a":"") . ($2>0?"t":"") . ($3>0?"g":"") . ($4>0?"c":""); }
    You can of course replace the >0 with another decision, such as >($1+$2+$3+$4)/10.
RE: weighting regex patterns
by jlistf (Monk) on Aug 18, 2000 at 23:09 UTC
    forgive me if i don't understand... but you want to search through a string of DNA/RNA checking whether the ratio of nucleotides fits a specific ratio? i don't think a regular expression would be the best way to do this. it seems like you'd have to count up the number of each nucleotide in the string and compare the total numbers and see whether they're "close enough" to the ratio (which is a matter of statistical analysis which i won't talk about here). is that what you're asking for?

    jeff