Re: Creating Frequency Matrix

The first step in solving any programming problem has nothing to do with Perl (or any other language). From your description, it sounds like you might need to first think (or at least share with us) about the individual steps in more detail. Those steps are:

Gather the information needed to print out a matrix.
Print out the matrix

These are in fact two very different problems, but in both cases you get clarity on the details by work backwards from the goal.

In the second part you are presenting data. To work this out, start with an illustration (on paper) of what you want your report to look like. Even if your goal is a matrix, there are obviously lots of different ways to draw a matrix. To plan out this step you'll need to think about whether each base is a row or a column; what you want for the column and row labels; the width of each column; whether or not you want lines dividing the columns, and so on.

When you get ready to code, you may find perlform helpful. That document is a little confusing, so you might want to read through this plain English introduction first: Perl format Primer. If your matrix layout is pretty simple and you are comfortable with sprintf and foreach loops, then you could just roll your own matrix layout as well. But I think it would be worth the time at least getting acquainted with Perl's built in formatting tools.

The goal for the first part is gathering the information you'll need to display that matrix. So the first step in planning that phase is to list out all of the information that you will need. I see three bits of information:

the number of occurrences of base A at position N. You want to express the frequency as a percentage, i.e. a count of occurrences as a percentage of some base count.
the total number against which the % will be calculated. I would imagine that this is either total number of occurrences of A or the total number of items in position N or even the total number of bases in all sequences - which exactly did you want?
the length of the longest sequence - you'll need to know the number of rows and columns to print out a matrix report. The number of bases is always 4. That leaves the number of positions still unknown - unless of course you keep track of the longest sequence.

For this part of the problem, you'll need to get familiar with arrays, hashes, and references. I would recommend storing your counts in a hash of arrays (HoA) rather than an array of arrays. For more information see perldata, perlref, and perldsc (Perl data structures).

If your input file is very large, you may also want to learn something about hashes tied to random access files, but I'd leave that issue off the table unless you find your program failing due to memory constraints. If you do need help dealing with very large files, you may want to "tie" the hash to a random access file - to see the various alternatives, search CPAN for "Tie::Hash".

Putting all this together (especially HoA) can be a bit confusing so I've included a bit of sample code that takes the base count at position N as a percentage of all bases found at position N. Please feel free to ask questions about any part you don't understand.

# always start your program with these two lines.
# The do a lot of your error checking for you and are
# more accurate than your eyeballs.
use strict;
use warnings;

#-----------------------------------------
# store counts in a hash of array references.
# - There is one hash key for each DNA base
# - The value assigned to each hash key is an array
#   reference. Its Nth element stores the number of times
#   that base appears at the Nth position.
#   (N=0 is first position)
#-----------------------------------------
my %hFrequency;
my @aTotalAtPos;
my $iMaxSequenceLength = 0;
my $sBase;
my $iPos;

# <DATA> Reads in one line of data from the stream DATA.
# see below at __DATA__ for the actual data

while (my $sSequence = <DATA>){

  #remove end of record marker, i.e. newline, from sequence
  chomp $sSequence;

  # regex // can be used to split a string into characters
  my @aBases = split(//, $sSequence);

  # keep track of maximum sequence length: we'll need it
  # later to print out the matrix

  my $iSequenceLength = scalar(@aBases);
  if ($iSequenceLength > $iMaxSequenceLength) {
    $iMaxSequenceLength = $iSequenceLength;
  }

  # up the count for each base/position found
  for($iPos=0; $iPos <= $#aBases; $iPos++) {

    $sBase = $aBases[$iPos]; #current char
    $hFrequency{$sBase}[$iPos] ++;
    $aTotalAtPos[$iPos]++;
  }
}

#-----------------------------------------
# print the matrix
# done with sprintf, but you might prefer
# to use standard Perl formatting.
#-----------------------------------------

# use a constant so we make sure we have the
# same format for the base column each time
# we print out a row

my $BASE_FORMAT = "%4s |";

# use a constant so we make sure we have the
# same width each time we print out a row

my $POS_WIDTH = 5;

# print out header

print sprintf($BASE_FORMAT, 'Base');  #label row
for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) {
  print sprintf("%${POS_WIDTH}d |", $iPos);
}
print "\n";  #end row

# print out divider bar below header

print '---- |';  #label row
for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) {
  print (('-' x $POS_WIDTH) . ' |');
}
print "\n";  #end row

# print out one row for each base

foreach $sBase (sort keys %hFrequency) {
  my $aCounts = $hFrequency{$sBase};

  print sprintf($BASE_FORMAT, $sBase);  #label row
  # $aCounts is an array reference
  # @$aCounts extracts the array

  for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) {
    my $iCount = $aCounts->[$iPos];
    my $iTotal = $aTotalAtPos[$iPos];
    $iCount = 0 unless defined($iCount);
    my $dPct = $iTotal ? $iCount/$iTotal : 0;
     print sprintf("%-$POS_WIDTH.2f |", $dPct);
  }
  print "\n";  #end row
}

# This is a quick way to put in some test data
# To read it in you use data as a file handle
# see above for an example

__DATA__
ACCGT
AGCCG
CATTC
GTAAA
[download]

Best, beth

Comment on Re: Creating Frequency Matrix Download Code