Nora has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I am new to perl and trying to solve the following problem. I have a set of sequences/strings of equal length like the following: Sequence1: ACCGT Sequence2: AGCCG Sequence 3:CATTC Sequence4: GTAAA Now I want to create matrix which will count the frequency of each letter in each column, for example the frequency of A in column 1 is 0.5, C is 0.25, G is 0.25, T is 0 etc. Basically I want to print the frequency of letters A,C,G,T in each column Thanks in advance for your help.

Replies are listed 'Best First'.
Re: Creating Frequency Matrix
by dHarry (Abbot) on Mar 25, 2009 at 10:43 UTC

    Welcome to the Monastery! It seems you're doing DNA stuff? You better take a look at CPAN since there are various BIO modules around to ease your live. A few suggestions: Bio::Matrix::PSM::SiteMatrix to work with position scoring/weight matrices, Bio::Align::DNAStatistics calculate some statistics for a DNA alignment, Bio::Tools::DNAGen generating DNA sequences. There are many more. Even if a module does not exactly what you want it can be a useful starting point, i.e. look at how somebody else did it and learn from it;)

    HTH
    dHarry

Re: Creating Frequency Matrix
by ELISHEVA (Prior) on Mar 25, 2009 at 11:39 UTC

    The first step in solving any programming problem has nothing to do with Perl (or any other language). From your description, it sounds like you might need to first think (or at least share with us) about the individual steps in more detail. Those steps are:

    1. Gather the information needed to print out a matrix.
    2. Print out the matrix

    These are in fact two very different problems, but in both cases you get clarity on the details by work backwards from the goal.

    In the second part you are presenting data. To work this out, start with an illustration (on paper) of what you want your report to look like. Even if your goal is a matrix, there are obviously lots of different ways to draw a matrix. To plan out this step you'll need to think about whether each base is a row or a column; what you want for the column and row labels; the width of each column; whether or not you want lines dividing the columns, and so on.

    When you get ready to code, you may find perlform helpful. That document is a little confusing, so you might want to read through this plain English introduction first: Perl format Primer. If your matrix layout is pretty simple and you are comfortable with sprintf and foreach loops, then you could just roll your own matrix layout as well. But I think it would be worth the time at least getting acquainted with Perl's built in formatting tools.

    The goal for the first part is gathering the information you'll need to display that matrix. So the first step in planning that phase is to list out all of the information that you will need. I see three bits of information:

    1. the number of occurrences of base A at position N. You want to express the frequency as a percentage, i.e. a count of occurrences as a percentage of some base count.
    2. the total number against which the % will be calculated. I would imagine that this is either total number of occurrences of A or the total number of items in position N or even the total number of bases in all sequences - which exactly did you want?
    3. the length of the longest sequence - you'll need to know the number of rows and columns to print out a matrix report. The number of bases is always 4. That leaves the number of positions still unknown - unless of course you keep track of the longest sequence.

    For this part of the problem, you'll need to get familiar with arrays, hashes, and references. I would recommend storing your counts in a hash of arrays (HoA) rather than an array of arrays. For more information see perldata, perlref, and perldsc (Perl data structures).

    If your input file is very large, you may also want to learn something about hashes tied to random access files, but I'd leave that issue off the table unless you find your program failing due to memory constraints. If you do need help dealing with very large files, you may want to "tie" the hash to a random access file - to see the various alternatives, search CPAN for "Tie::Hash".

    Putting all this together (especially HoA) can be a bit confusing so I've included a bit of sample code that takes the base count at position N as a percentage of all bases found at position N. Please feel free to ask questions about any part you don't understand.

    Best, beth

Re: Creating Frequency Matrix
by moritz (Cardinal) on Mar 25, 2009 at 07:19 UTC

    What have you tried so far?

    For starters, perlfaq4 and perlfaq5 contain examples of how to count substrings/characters, and perllol introduces you to arrays of arrays, which you'll probably need to create a matrix.