buzzybeewhee has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm currently trying to match the different frequency of occurrences for each month to my key itself. I have this perl script which has a key, and the different frequency of occurrences for different months as values. Its multiple value to one key.

my %mutperc; while (<FILE>) { chomp; $countentry++; ($acc, $freq, $perc) = split ("\t"); $accession = $acc; push @{$mutperc{$accession}}, $perc; $" = "\t"; } close (FILE); foreach my $accession (sort {$mutperc{$b} <=> $mutperc{$a}} (keys %mut +perc)) { if(grep $_ > 10 , @{$mutperc{$accession}}) { print "$accession\t@{$mutperc{$accession}}\n"; $countprint++; } } print "$countentry\t$countprint\n";
and right now, it will give me something like this...
T373I 24.3902439 16.36363636 7.142857143 9.090909091 V100I 36.58536585 61.81818182 85.71428571 96.66666667 1 +00 L122Q 97.56097561 100 100 100
but then the problem right now is that I actually have 5 months, but the occurrence rate for some months is actually 0, so the perl script does not give me the values matched to the months. In other words, I'm looking for 5 columns (for my 5 months) and the frequency of occurrences for each month should be matched to the column number. (e.g. if the frequency of occurrence in April is 4.67%, 4.67 should appear in the 4th column) If the occurrence rate for that month is 0, a 0 should appear in that column, instead of moving the next few occurrence rates for the following months in front. The end product I'm looking for is:
T373I 24.3902439 16.36363636 7.142857143 9.090909091 0 V100I 36.58536585 61.81818182 85.71428571 96.66666667 1 +00 L122Q 0 97.56097561 100 100 100

I'm so sorry that I'm asking so much >< but I truly need this script to be working...to save my life because I have around 250 such files to sort through and its almost impossible to do it manually! Please help!~ Many thanks and appreciation! :D Best Regards, Buzzybee

Replies are listed 'Best First'.
Re: Put multiple values as columns inside a hash
by spazm (Monk) on Jun 18, 2010 at 04:18 UTC
    Questions:
    1. what does your data file look like?
    2. What does the middle element, $freq, represent?
    3. What is in the data file when the occurrence rate is zero? row skipped or zero elements?
    4. What do you intend to happen with your grep? You want keys where any month is greater than 10?
    5. What do you intend to happen with your sort? $multperc{$a} and {$b} are array references, so you're comparing on some nonsensical pointer value. This was probably sorting by value before you turned the value into an array ref. $multperc{$a}->[0] <=> $multperc{$b}->[0] will sort by the first month value.
    Your code looks close to working, let's get it going.
      hello!

      thanks for all the replies! :D

      To Almut:

      Actually, judging from what almut said...It may be kind of impossible to generate a script that is able to differentiate which columns have frequencies of 0...

      Because the input file is actually like that:

      L122Q 40 97.56097561 V100I 15 36.58536585 T373I 10 24.3902439 V217I 4 9.756097561 G16D 3 7.317073171 L283P 3 7.317073171 D21N 2 4.87804878 S450N 2 4.87804878 D53E 2 4.87804878 I109V 2 4.87804878 R236K 2 4.87804878 K351R 2 4.87804878 K470R 2 4.87804878 K400R 2 4.87804878 I61L 2 4.87804878 V425I 2 4.87804878 N309T 2 4.87804878 M105V 2 4.87804878 K452R 2 4.87804878 E372D 2 4.87804878 L456V 2 4.87804878 M191L 2 4.87804878 A190V 2 4.87804878 V444I 2 4.87804878 V313Y 2 4.87804878 T373A 2 4.87804878 M316I 2 4.87804878 S498N 2 4.87804878 R293K 2 4.87804878 N433T 2 4.87804878
      (this is only 1 little portion)

      The system won't really know where each month start or ends right? ): So it also won't know when the frequency for each is 0... ...How about if I use separate files instead? Maybe 1 file for each month? Would it be possible to achieve my goal this way?

      Sorry for the pestering and I really appreciate you pointing out the fatal error in my question! :))

      To Spazm:

      1. I have given a sample of the data file already, thanks for taking the effort to really help me look at my script...although what I'm asking for is a bit illogical ):

      2. $freq represents the middle number in my data file...I'm still a newbie at this, so I may actually have a lot of extra steps that do not contribute to anything, but I don't really know how to make them more concise

      3. When the occurrence rate is zero, it just doesn't appear in the data file...

      4. My grep is to get values larger than 10, because I'm only looking at keys which have an occurrence rate of above 10 for any months

      5. Thanks a lot for this helpful input! :D I was wondering why the sort function no longer worked after I changed it into an array. This is really useful! :)

      Many thanks for helping me look at my script, I hope it'll start working too! :))

      Best Regards,

      Buzzybee

        How about if I use separate files instead? Maybe 1 file for each month?

        Yes, if that's possible you could do something like this:

        my $mon_idx = 0; for my $fname ( qw(jan.dat feb.dat mar.dat apr.dat may.dat) ) { open my $fh, "<", $fname or die "Couldn't open '$fname': $!"; while (<$fh>) { my ($acc, $freq, $perc) = split /\t/; $mutperc{$acc}[$mon_idx] = $perc; } $mon_idx++; } for my $acc (sort keys %mutperc) { print $acc; for my $mon_idx (0..4) { my $perc = $mutperc{$acc}[$mon_idx]; $perc = 0 unless defined $perc; print "\t$perc"; } print "\n"; }

        By usnig a fix index ($mon_idx) instead of pushing onto the arrays, the values would end up in the right place and you'd get undef for the months with missing values, which you can later turn into 0.

Re: Put multiple values as columns inside a hash
by aquarium (Curate) on Jun 18, 2010 at 06:23 UTC
    looks like you're data hiding, i.e. using an implicit value of something to then decide what column/month it is. this is generally bad and no longer done in this day and age of cheap storage. so to cut to the chase, make the implicit explicit, i.e. store the column number (or even better month number and/or name) as part of the hash. then you adjust the code to stop relying on implicit order of data. it also opens up avenues of further flexibility, e.g. can summarize data by quarter etc etc.
    also just in case the sums of these decimal point numbers really matter using a non floating point math library. typical gotcha is doing a conditional on numbers supposed to add to 100 percent or similar, when in fact floating point math can produce something not quite 100 percent, and the conditional fails. btw for aesthetics, columns of decimal numbers are best decimal point justified..which is a fun little excercise in itself.
    cheers.
    the hardest line to type correctly is: stty erase ^H
      Hello

      I'm so sorry, but I'm pretty new to programming, so I'm not exactly sure I got your full idea...

      I'm not exactly letting the value decide what column/month it is...the value is actually the frequency of occurrence of the key in that month.

      And when you say store the column/month/name inside the hash, you mean as the value of the hash? Or the key? I apologize for my ignorance, but I haven't the foggiest idea of how to go about that ><, since they're not exactly the values I need (the values I need are the frequencies of occurrence)

      The last point about decimals is really interesting though :) In the end, I'm going to input the information into an excel file and plot graphs using the data, so this can come into good use :)

      Thank you so much for your helpful input! :D

      Best Regards

      Buzzybee

        sorry for getting back to you late. what i mean is either you need to make the month part of the hash, e.g. assign explicitly $muteperc{$accession}{'month'} = whichever_month_need_to_assign. then when you're doing the output you can easily detect missing month or implicitly show (by printing month). Hope this makes sense.
        in the past i've written a perl report that adds up summary data values, with many different slices of summary results required per month and quarter. rather than counting on particular month data being present, the final output routine was a for(1..12) that indexed into the convoluted hashes by month number. Whatever month indexed values didn't exist (no data) i made sure undef printed as zero.
        on something related in doing summary calculations, can't remember what perl does when you divide by zero...but it ends up being a error with live data when values are missing or contain zero. so always put in an explicit check for divide by zero, before it bites.
        have fun perl coding.
        the hardest line to type correctly is: stty erase ^H
Re: Put multiple values as columns inside a hash
by almut (Canon) on Jun 18, 2010 at 05:28 UTC

    (update: sorry, it seems I misread your description — too early in the morning... :)


    but the occurrence rate for some months is actually 0, ...

    When you have an input line with less than 5 months columns like

    L122Q 97.56097561 100 100 100

    how do you want to tell if it actually should have been

    L122Q 0 97.56097561 100 100 100

    or

    L122Q 97.56097561 100 100 100 0

    or

    L122Q 97.56097561 0 100 100 100

    i.e. how would you disambiguate?