The first step in solving any programming problem has nothing to do with Perl (or any other language). From your description, it sounds like you might need to first think (or at least share with us) about the individual steps in more detail. Those steps are:
- Gather the information needed to print out a matrix.
- Print out the matrix
These are in fact two very different problems, but in both cases you get clarity on the details by work backwards from the goal.
In the second part you are presenting data. To work this out, start with an illustration (on paper) of what you
want your report to look like. Even if your goal is a matrix, there are obviously lots of different ways to draw a matrix. To plan out this step you'll need to think about
whether each base is a row or a column; what you want for the column and row labels; the width of each column; whether or not you want lines dividing the columns, and so on.
When you get ready to code, you may find perlform helpful. That document is a little confusing, so you might want to read through this plain English introduction first: Perl format Primer. If your matrix layout is pretty simple and you are comfortable with sprintf and foreach loops, then you could just roll your own matrix layout as well. But I think it would be worth the time at least getting acquainted with Perl's built in formatting tools.
The goal for the first part is gathering the information you'll need to display that matrix. So the first step in planning that phase is to list out all of the information
that you will need. I see three bits of information:
- the number of occurrences of base A at position N. You want to express the frequency as a percentage, i.e. a count of occurrences as a percentage of some base count.
- the total number against which the % will be calculated. I would imagine that this is either total number of occurrences of A or the total number of items in position N or even the total number of bases in all sequences - which exactly did you want?
- the length of the longest sequence - you'll need to know the number of rows and columns to print out a matrix report. The number of bases is always 4. That leaves the number of positions still unknown - unless of course you keep track of the longest sequence.
For this part of the problem, you'll need to get familiar with arrays, hashes, and references. I would recommend storing your counts in a hash of arrays (HoA) rather than an array of arrays. For more information see perldata, perlref, and perldsc (Perl data structures).
If your input file is very large, you may also want to learn something about hashes tied to random access files, but I'd leave that issue off the table unless you find your program failing due to memory constraints. If you do need help dealing with very large files, you may want to "tie" the hash to a random access file - to see the various alternatives, search CPAN for "Tie::Hash".
Putting all this together (especially HoA) can be a bit confusing so I've included a bit of sample code that takes the base count at position N as a percentage of all bases found at position N. Please feel free to ask questions about any part you don't understand.
Best, beth
|