The first step in solving any programming problem has nothing to do with Perl (or any other language). From your description, it sounds like you might need to first think (or at least share with us) about the individual steps in more detail. Those steps are:

  1. Gather the information needed to print out a matrix.
  2. Print out the matrix

These are in fact two very different problems, but in both cases you get clarity on the details by work backwards from the goal.

In the second part you are presenting data. To work this out, start with an illustration (on paper) of what you want your report to look like. Even if your goal is a matrix, there are obviously lots of different ways to draw a matrix. To plan out this step you'll need to think about whether each base is a row or a column; what you want for the column and row labels; the width of each column; whether or not you want lines dividing the columns, and so on.

When you get ready to code, you may find perlform helpful. That document is a little confusing, so you might want to read through this plain English introduction first: Perl format Primer. If your matrix layout is pretty simple and you are comfortable with sprintf and foreach loops, then you could just roll your own matrix layout as well. But I think it would be worth the time at least getting acquainted with Perl's built in formatting tools.

The goal for the first part is gathering the information you'll need to display that matrix. So the first step in planning that phase is to list out all of the information that you will need. I see three bits of information:

  1. the number of occurrences of base A at position N. You want to express the frequency as a percentage, i.e. a count of occurrences as a percentage of some base count.
  2. the total number against which the % will be calculated. I would imagine that this is either total number of occurrences of A or the total number of items in position N or even the total number of bases in all sequences - which exactly did you want?
  3. the length of the longest sequence - you'll need to know the number of rows and columns to print out a matrix report. The number of bases is always 4. That leaves the number of positions still unknown - unless of course you keep track of the longest sequence.

For this part of the problem, you'll need to get familiar with arrays, hashes, and references. I would recommend storing your counts in a hash of arrays (HoA) rather than an array of arrays. For more information see perldata, perlref, and perldsc (Perl data structures).

If your input file is very large, you may also want to learn something about hashes tied to random access files, but I'd leave that issue off the table unless you find your program failing due to memory constraints. If you do need help dealing with very large files, you may want to "tie" the hash to a random access file - to see the various alternatives, search CPAN for "Tie::Hash".

Putting all this together (especially HoA) can be a bit confusing so I've included a bit of sample code that takes the base count at position N as a percentage of all bases found at position N. Please feel free to ask questions about any part you don't understand.

# always start your program with these two lines. # The do a lot of your error checking for you and are # more accurate than your eyeballs. use strict; use warnings; #----------------------------------------- # store counts in a hash of array references. # - There is one hash key for each DNA base # - The value assigned to each hash key is an array # reference. Its Nth element stores the number of times # that base appears at the Nth position. # (N=0 is first position) #----------------------------------------- my %hFrequency; my @aTotalAtPos; my $iMaxSequenceLength = 0; my $sBase; my $iPos; # <DATA> Reads in one line of data from the stream DATA. # see below at __DATA__ for the actual data while (my $sSequence = <DATA>){ #remove end of record marker, i.e. newline, from sequence chomp $sSequence; # regex // can be used to split a string into characters my @aBases = split(//, $sSequence); # keep track of maximum sequence length: we'll need it # later to print out the matrix my $iSequenceLength = scalar(@aBases); if ($iSequenceLength > $iMaxSequenceLength) { $iMaxSequenceLength = $iSequenceLength; } # up the count for each base/position found for($iPos=0; $iPos <= $#aBases; $iPos++) { $sBase = $aBases[$iPos]; #current char $hFrequency{$sBase}[$iPos] ++; $aTotalAtPos[$iPos]++; } } #----------------------------------------- # print the matrix # done with sprintf, but you might prefer # to use standard Perl formatting. #----------------------------------------- # use a constant so we make sure we have the # same format for the base column each time # we print out a row my $BASE_FORMAT = "%4s |"; # use a constant so we make sure we have the # same width each time we print out a row my $POS_WIDTH = 5; # print out header print sprintf($BASE_FORMAT, 'Base'); #label row for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { print sprintf("%${POS_WIDTH}d |", $iPos); } print "\n"; #end row # print out divider bar below header print '---- |'; #label row for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { print (('-' x $POS_WIDTH) . ' |'); } print "\n"; #end row # print out one row for each base foreach $sBase (sort keys %hFrequency) { my $aCounts = $hFrequency{$sBase}; print sprintf($BASE_FORMAT, $sBase); #label row # $aCounts is an array reference # @$aCounts extracts the array for($iPos=0; $iPos < $iMaxSequenceLength; $iPos++) { my $iCount = $aCounts->[$iPos]; my $iTotal = $aTotalAtPos[$iPos]; $iCount = 0 unless defined($iCount); my $dPct = $iTotal ? $iCount/$iTotal : 0; print sprintf("%-$POS_WIDTH.2f |", $dPct); } print "\n"; #end row } # This is a quick way to put in some test data # To read it in you use data as a file handle # see above for an example __DATA__ ACCGT AGCCG CATTC GTAAA

Best, beth


In reply to Re: Creating Frequency Matrix by ELISHEVA
in thread Creating Frequency Matrix by Nora

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.