in reply to Creating Frequency Matrix
The first step in solving any programming problem has nothing to do with Perl (or any other language). From your description, it sounds like you might need to first think (or at least share with us) about the individual steps in more detail. Those steps are:
These are in fact two very different problems, but in both cases you get clarity on the details by work backwards from the goal.
In the second part you are presenting data. To work this out, start with an illustration (on paper) of what you want your report to look like. Even if your goal is a matrix, there are obviously lots of different ways to draw a matrix. To plan out this step you'll need to think about whether each base is a row or a column; what you want for the column and row labels; the width of each column; whether or not you want lines dividing the columns, and so on.
When you get ready to code, you may find perlform helpful. That document is a little confusing, so you might want to read through this plain English introduction first: Perl format Primer. If your matrix layout is pretty simple and you are comfortable with sprintf and foreach loops, then you could just roll your own matrix layout as well. But I think it would be worth the time at least getting acquainted with Perl's built in formatting tools.
The goal for the first part is gathering the information you'll need to display that matrix. So the first step in planning that phase is to list out all of the information that you will need. I see three bits of information:
For this part of the problem, you'll need to get familiar with arrays, hashes, and references. I would recommend storing your counts in a hash of arrays (HoA) rather than an array of arrays. For more information see perldata, perlref, and perldsc (Perl data structures).
If your input file is very large, you may also want to learn something about hashes tied to random access files, but I'd leave that issue off the table unless you find your program failing due to memory constraints. If you do need help dealing with very large files, you may want to "tie" the hash to a random access file - to see the various alternatives, search CPAN for "Tie::Hash".
Putting all this together (especially HoA) can be a bit confusing so I've included a bit of sample code that takes the base count at position N as a percentage of all bases found at position N. Please feel free to ask questions about any part you don't understand.
# always start your program with these two lines. # The do a lot of your error checking for you and are # more accurate than your eyeballs. use strict; use warnings; #----------------------------------------- # store counts in a hash of array references. # - There is one hash key for each DNA base # - The value assigned to each hash key is an array # reference. Its Nth element stores the number of times # that base appears at the Nth position. # (N=0 is first position) #----------------------------------------- my %hFrequency; my @aTotalAtPos; my $iMaxSequenceLength = 0; my $sBase; my $iPos; # <DATA> Reads in one line of data from the stream DATA. # see below at __DATA__ for the actual data while (my $sSequence = <DATA>){ #remove end of record marker, i.e. newline, from sequence chomp $sSequence; # regex // can be used to split a string into characters my @aBases = split(//, $sSequence); # keep track of maximum sequence length: we'll need it # later to print out the matrix my $iSequenceLength = scalar(@aBases); if ($iSequenceLength > $iMaxSequenceLength) { $iMaxSequenceLength = $iSequenceLength; } # up the count for each base/position found for($iPos=0; $iPos <= $#aBases; $iPos++) { $sBase = $aBases[$iPos]; #current char $hFrequency{$sBase}[$iPos] ++; $aTotalAtPos[$iPos]++; } } #----------------------------------------- # print the matrix # done with sprintf, but you might prefer # to use standard Perl formatting. #----------------------------------------- # use a constant so we make sure we have the # same format for the base column each time # we print out a row my $BASE_FORMAT = "%4s |"; # use a constant so we make sure we have the # same width each time we print out a row my $POS_WIDTH = 5; # print out header print sprintf($BASE_FORMAT, 'Base'); #label row for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { print sprintf("%${POS_WIDTH}d |", $iPos); } print "\n"; #end row # print out divider bar below header print '---- |'; #label row for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { print (('-' x $POS_WIDTH) . ' |'); } print "\n"; #end row # print out one row for each base foreach $sBase (sort keys %hFrequency) { my $aCounts = $hFrequency{$sBase}; print sprintf($BASE_FORMAT, $sBase); #label row # $aCounts is an array reference # @$aCounts extracts the array for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { my $iCount = $aCounts->[$iPos]; my $iTotal = $aTotalAtPos[$iPos]; $iCount = 0 unless defined($iCount); my $dPct = $iTotal ? $iCount/$iTotal : 0; print sprintf("%-$POS_WIDTH.2f |", $dPct); } print "\n"; #end row } # This is a quick way to put in some test data # To read it in you use data as a file handle # see above for an example __DATA__ ACCGT AGCCG CATTC GTAAA
Best, beth
|
|---|