Thanks for the answers. Basically I'm casting for ideas before trying to do this and your replies have been very helpful. I do know about BioPerl and this morning have been looking at the Bio::Tools::SeqPattern module as a partial solution.

I basically would like to use a set of weight matrices from the TransFac database, a database of transcription factors. Each TransFac matrix contains three pieces of information:
the length of the promoter sequence that the matrix represents;
the consensus sequence that a matrix represents, eg CGCGTNSANNACAGCGTTT;
and the percentage distribution of nucleotides at each position in the sequence that the matrix represents (such as in my original email).

I naievly though that a regular expression search could be structured to contain both the consensus sequence information and the frequency information at each position within the consensus sequence. I now realize that this was a bit silly.

From your replies this would fall into two steps:
1. scan the input DNA sequence for the pattern represented by a particular matrix - I may be able to use Bio::Tools::SeqPattern for at least part of this;
2. calculate how similar the newly matched sequence I've just found is to the pattern in the weight matrix - is it a good match or a weak match?

I would still like to take the value of individual nucleotide frequencies at a particular matrix position into account when scanning my sequence for these promoter sites - perhaps this might decrease false positives during my search.

Looking at the above answers, KM's approach of transferring a matching matrix's information into a hash and then getting the values from the matrix for each pattern matched nucleotide appears to be the simplest to do. I'll probably try this approach first anyway.

Thanks for your help.

MadraghRua
yet another biologist hacking perl....


In reply to Re: weighting regex patterns by MadraghRua
in thread weighting regex patterns by MadraghRua

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.