Re: weighting regex patterns

Thanks for the answers. Basically I'm casting for ideas before trying to do this and your replies have been very helpful. I do know about BioPerl and this morning have been looking at the Bio::Tools::SeqPattern module as a partial solution.

I basically would like to use a set of weight matrices from the TransFac database, a database of transcription factors. Each TransFac matrix contains three pieces of information:
the length of the promoter sequence that the matrix represents;
the consensus sequence that a matrix represents, eg CGCGTNSANNACAGCGTTT;
and the percentage distribution of nucleotides at each position in the sequence that the matrix represents (such as in my original email).

I naievly though that a regular expression search could be structured to contain both the consensus sequence information and the frequency information at each position within the consensus sequence. I now realize that this was a bit silly.

From your replies this would fall into two steps:
1. scan the input DNA sequence for the pattern represented by a particular matrix - I may be able to use Bio::Tools::SeqPattern for at least part of this;
2. calculate how similar the newly matched sequence I've just found is to the pattern in the weight matrix - is it a good match or a weak match?

I would still like to take the value of individual nucleotide frequencies at a particular matrix position into account when scanning my sequence for these promoter sites - perhaps this might decrease false positives during my search.

Looking at the above answers, KM's approach of transferring a matching matrix's information into a hash and then getting the values from the matrix for each pattern matched nucleotide appears to be the simplest to do. I'll probably try this approach first anyway.

Thanks for your help.

MadraghRua
yet another biologist hacking perl....

Comment on Re: weighting regex patterns