angerusso has asked for the wisdom of the Perl Monks concerning the following question:

I have a UPDATED datafile which has only "W" and "Ms" entries. As in the example, I want to count number of A's which have "M" appearing atleast once over unique column names. I want to sum over rows, not columns. As long as "M" appears in column once, I just count that row as 1.

Gname G1 G1 G1 G1 G2 G2 G3 A W W M W W W M A W W W W W W W A W W W W W W W B W W W W W M M B M W W W W M M C M M M W W W W C M W W M M W W The output should be: Gname G1 G2 G3 A 1 0 1 B 1 2 2 C 2 1 0

I have written the following code to write the header row but I am very confused how should I start counting over blocks/chunks of data like I want. Can anyone help?

#!/usr/bin/perl -w if (@ARGV != 1){ print "USAGE: ./parse-counts.pl file\n"; exit(-1); } $mutfile = $ARGV[0]; %hash = (); open(INPUTR,"<$mutfile") || die "Can't open \$mutfile for reading. \n" +; while($line=<INPUTR>){ chomp $line; @toks = split(/\t/,$line); if ($toks[0] =~ /^Gname/){ $k = 0; # loop over the header row to get the unique "Gname"s @header = split(/\t/,$line); for $j (1..@toks-2){ $i = $j+1; if ($header[$i] ne $header[$j]){ $k++; $name[$k] = $header[$j]; } } for $i (0..$k){ $hash{$toks[0]}{$name[$k]} = $name[$k]; } } else { $k = 0; for $j (1..@toks-2){ $i = $j+1; if ($header[$i] ne $header[$j]){ $k++; $hash{$toks[0]}{$name[$k]} = 0; if ($toks[$j] =~ /M/){ $hash{$toks[0]}{$name[$k]} = 1; } } } } } close(INPUTR); $outdata = $mutfile; $outdata =~ /(.+).txt/; $outdata = $1."-COUNTS.txt"; open(OUTD,">$outdata"); foreach $idname (sort keys %hash){ if ($idname =~ /^Gname/){ print OUTD $idname; foreach $gid (sort keys %{$hash{$idname}}){ print OUTD "\t".$hash{$idname}{$gid}; } print OUTD "\n"; } } foreach $idname (sort keys %hash){ if ($idname !~ /^Gname/){ print OUTD $idname; foreach $gid (sort keys %{$hash{$idname}}){ print OUTD "\t".$hash{$idname}{$gid}; } print OUTD "\n"; } } close(OUTD); print "Printing $outdata file done.\n";

Replies are listed 'Best First'.
Re: how to sum over rows based on column headings in perl
by roboticus (Chancellor) on Jul 29, 2015 at 20:49 UTC

    angerusso:

    I'd suggest building an array to map column number to the column name, which it looks like you may be doing. I'd also suggest using a hash to act as your accumulator: the key would be the column name (G1, G2, G3).

    Then the procedure would be something like this:

    • Create a hash to hold the totals
    • For each line in the file:
      • If it's a header
        • build your column number to name map
      • otherwise
        • Split line into columns
        • For each column:
          • Does the column contain a value I care to count? If not, go to the next column.
          • Look up the column name for the column
          • Add one to the hash{column name}
    • Print the report

    So lets see how it would work with this input data:

    Gname G1 G1 G2 G3 G3 A W M M W W A M W W M M B M W M M M

    So we first create a hash to hold your counters, starting like { }

    So we read the first line, and find it's a header, so we build a column number to name map, resulting in the map: 1-->G1, 2-->G1, 3-->G2, 4-->G3, 5-->G3, then go to the next line.

    For the second line, we split it into columns.

    The first column (column 0) will be the name of the counter in our hash. So let's scan through the columns for interesting values: Column 1 has 'W' so it's boring and we skip over it. Column 2 has an 'M', so we want to count it. So we consult our column to name table, and find that column 2 maps to G1. So we update the counter slot for Gname=A, Colheader G1 to 1. Column 3 also has a 'M', and column 3 maps to G2, so update counter for A/G2 to 1. The remaining columns are 'W', so we ignore them.

    We split the next line up and find that it's the same Gname (A). Working through the columns, we see that the interesting columns are 1, 4 and 5, which map to G1, G3 and G3. So the A/G1 counter increases to 2, the A/G2 counter increases to 2 also. The A/G3 counter is set to 1.

    The final line gives us a new Gname, B. It's interesting columns are 1, 3, 4 and 5, which map to G1, G2, G3 and G3 respectively. So we set both B/G1 and B/G2 to 1, and B/G3 to 2.

    So our final counter hash shows that for those three rows, we would have:

    Gname G1 G2 G3 A 2 2 1 B 1 1 2

    Update: Added the "Create a hash" line to show where it should be created, and then the worked out example.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hey, thanks for your help. I have posted my complete code. The problem i am running into is to how to sum over rows, not columns. As long as column has "M" appearing once, I count it otherwise the count should be zero. However, I do care about if "M" appears in another row for the same name. So I have it sum it up. Am I being clear enough? I am getting stuck with how to add over rows which somehow has to be out of the for loop (over columns) but I am not able to figure out how.

        angerusso:

        The algorithm I presented should work, but I went back and clarified it and showed an example to show how it works.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: how to sum over rows based on column headings in perl
by Not_a_Number (Prior) on Jul 29, 2015 at 22:25 UTC

    You have a problem in your data. Ignoring the first column, you have seven columns of headers but only six columns of actual data:

    G1 G1 G1 G1 G2 G2 G3 W W M W W M W W W W W W ...

    Update: I see now that anonymonk had already spotted this...

    Update2: Corrected miscount

      Sorry I have updated my example data matrix and the updated my complete code which is causing me problems as I don't know how to update the hash outside the for loop over columns as I want to sum over rows, not columns. Any additional help will be greatly appreciated.
Re: how to sum over rows based on column headings in perl
by Anonymous Monk on Jul 29, 2015 at 21:43 UTC

    Is there a mis-match between the data file header and the body of the data file?

Re: how to sum over rows based on column headings in perl
by Anonymous Monk on Jul 30, 2015 at 19:57 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1136784 use strict; use warnings; my (%counts, %unique, %masks); my ($headname, @unique) = grep !$unique{$_}++, my @headers = split ' ' +, <DATA>; for my $label ( @unique ) { $masks{$label} = join ' ', map { $_ eq $label ? 'M' : '_' } @headers[1..$#headers]; } while(<DATA>) { s/\s+/ /g; # for uniform spacing my ($name, $letters) = split ' ', $_, 2; $counts{$name}{$_} += ($masks{$_} | $letters) =~ /M/ for @unique; } print "$headname @unique\n"; print "$_ @{ $counts{$_} }{@unique}\n" for sort keys %counts; # Gname G1 G2 G3 # A 1 0 1 # B 1 2 2 # C 2 1 0 __DATA__ Gname G1 G1 G1 G1 G2 G2 G3 A W W M W W W M A W W W W W W W A W W W W W W W B W W W W W M M B M W W W W M M C M M M W W W W C M W W M M W W
      Very grateful for your response! I will spend sometime understanding your elegant piece of code before I try to run it myself. Thanks so much! I need time to understand it.

      I am trying to understand your code by modifying it myself and trying to read the data through a file

      #!/usr/bin/perl -w use strict; use warnings; use autodie; if (@ARGV != 1){ print "USAGE: ./parse-counts.pl file\n"; exit(-1); } my $mutfile = $ARGV[0]; my (%counts, %unique, %masks); my ($headname, @unique) = grep !$unique{$_}++, my @headers = split ' ' +, <DATA>; for my $label ( @unique ) { $masks{$label} = join ' ', map { $_ eq $label ? 'M' : '_' } @headers[1..$#headers]; } my $line; open(INPUTR,"<$mutfile") || die "Can't open \$mutfile for reading. \n" +; while($line=<INPUTR>){ #while(<DATA>) { chomp $line; s/\s+/ /g; # for uniform spacing my ($name, $letters) = split ' ', $line, 2; $counts{$name}{$line} += ($masks{$line} | $letters) =~ /M/ for @un +ique; } print "$headname @unique\n"; print "$line @{ $counts{$_} }{@unique}\n" for sort keys %counts;

      I am sorry i don't understand what I did wrong as now I get error as:

      ./parse-counts.pl junk.txt Missing right curly or square bracket at ./parse-counts.pl line 36, at + end of line syntax error at ./parse-counts.pl line 36, at EOF Execution of ./parse-counts.pl aborted due to compilation errors.

        That error is pretty clear! You are missing the right curly bracket closing the while loop you opened on line 26. The end of line 36 is given as the location of the error because that's where Perl runs out of code in which to find the closing bracket.

        You have a stray opening bracket under the line you commented out at line 27. This is a good example of why you should have consistent syntax: your commented-out while statement has the bracket on the next line, while the new while statement has the bracket on the same line. Pick one and stick with it. In my view, putting the brackets on the same line reduces the chances of them going astray.

        Update: explain cause of error

        The way forward always starts with a minimal test.
        #!/usr/bin/perl use strict; use warnings; use autodie; if (@ARGV != 1){ print "USAGE: ./parse-counts.pl file\n"; exit(1); } my $mutfile = $ARGV[0]; open(INPUTR,"<$mutfile") or die "Can't open \$mutfile for reading. \n" +; my (%counts, %unique, %masks); my ($headname, @unique) = grep !$unique{$_}++, my @headers = split ' ' +, <INPUTR>; for my $label ( @unique ) { $masks{$label} = join ' ', map { $_ eq $label ? 'M' : '_' } @headers[1..$#headers]; } my $line; while($line=<INPUTR>) #while(<DATA>) { chomp $line; $line =~ s/\s+/ /g; # for uniform spacing my ($name, $letters) = split ' ', $line, 2; $counts{$name}{$_} += ($masks{$_} | $letters) =~ /M/ for @unique; } print "$headname @unique\n"; print "$_ @{ $counts{$_} }{@unique}\n" for sort keys %counts;
Re: how to sum over rows based on column headings in perl
by Anonymous Monk on Jul 30, 2015 at 22:15 UTC
    Man... like, is it really totally impossible for the original-source of that data file to ... like ... "run a different query, pretty-please?!" Because, if they could, somehow, be pressed into doing so, then you would probably have no program to write at all. It is certainly worth asking.