lomSpace has asked for the wisdom of the Perl Monks concerning the following question:

Hello,
I have created a hash after parsing a file in order to count the occurrence of values.
I am able to calculate and print the counts for the values, but it goes to output multiple times. How can I print duplicates once?
use strict; #open file open(my $in,"/Users/mydir/Desktop/CCDS.current.txt") or die " Can't op +en file: $!"; #open out file open(OUT, ">/Users/mydir/Desktop/genesperchrcnt.txt"); # initialize the hash my %geneids=(); #open the file and push the info from the designated columns into it # remove header my $firstline = <$in>; chomp $firstline; while(<$in>){ chomp; # remove the newline character my @fields = split (/\t/); #extract the columns that we are interested in. # Populate the key value pairs of the hash with $gene and $id $geneids{$fields[2]} = $fields[0]; # initialize an array to store hash values my @chr; push @chr, $fields[0]; #count chromosome number which is the value in the hash $geneids{$fields[0]}++; next if $geneids{$fields[0]} > 1; foreach my $values (sort values %geneids) { print OUT "Chromosome $values has $geneids{$values} genes\n"; } } close($in); close(OUT); =cut Output Chromosome 1 has 1635 genes Chromosome 1 has 1635 genes Chromosome 1 has 1635 genes Chromosome 1 has 1635 genes Chromosome 3 has 778 genes Chromosome 3 has 778 genes Chromosome 3 has 778 genes Chromosome 3 has 778 genes Chromosome 4 has 518 genes Chromosome 4 has 518 genes Chromosome 4 has 518 genes Chromosome 4 has 518 genes

I need each duplicate printed once. What's the best way to do this?
DeepSpace

Replies are listed 'Best First'.
Re: how can I print my hash values once?
by ikegami (Patriarch) on Mar 10, 2011 at 05:16 UTC

    You're iterating over the values, then proceed to use it as the key. I suspect you want

    foreach my $geneid ( sort { $geneids{$a} <=> $geneids{$b} } keys %geneids ) { print OUT "Chromosome $geneid has $geneids{$geneid} genes\n"; }

    Also switched from cmp to <=> so that 10 comes after 2.

    Oh! And you want to move the foreach outside of the while loop.

    Update: Added last paragraph.

      ikegami,
      Thanks that was simple enough. I failed to mention that there are counts for the x and y chr. example:
      #!/usr/bin/perl -w use strict; #open file open(my $in,"/Users/mgavibrathwaite/Desktop/CCDS.current.txt") or die +" Can't open file: $!"; #open out file open(OUT, ">/Users/mgavibrathwaite/Desktop/genesperchrcnt.txt"); # initialize the hash my %geneids=(); #open the file and push the info from the designated columns into it # remove header my $firstline = <$in>; chomp $firstline; while(<$in>){ chomp; # remove the newline character my @fields = split (/\t/); #extract the columns that we are interested in. # Populate the key value pairs of the hash with $gene and $id $geneids{$fields[2]} = $fields[0]; # initialize an array to store hash values my @chr; push @chr, $fields[0]; #count chromosome number which is the value in the hash $geneids{$fields[0]}++; next if $geneids{$fields[0]} > 1; } foreach my $geneid ( sort { $geneids{$a} <=> $geneids{$b} } keys %geneids ) { print OUT "Chromosome $geneid has $geneids{$geneid} genes\n"; } close($in); close(OUT); =cut Output Chromosome has X genes Chromosome KLHL13 has X genes Chromosome UTY has Y genes Chromosome SPIN2B has X genes Chromosome PIR has X genes Chromosome ADRBK2 has 22 genes Chromosome SLC2A11 has 22 genes Chromosome SELO has 22 genes Chromosome PIK3IP1 has 22 genes Chromosome 21 has 323 genes Chromosome 18 has 358 genes Chromosome 13 has 402 genes Chromosome 22 has 553 genes Chromosome 20 has 724 genes Chromosome 15 has 733 genes Chromosome 14 has 772 genes Chromosome 8 has 827 genes Chromosome 4 has 922 genes Chromosome 9 has 982 genes Chromosome 10 has 1007 genes Chromosome 16 has 1009 genes Chromosome X has 1045 genes Chromosome 5 has 1054 genes Chromosome 7 has 1137 genes Chromosome 12 has 1283 genes Chromosome 6 has 1298 genes Chromosome 3 has 1354 genes Chromosome 17 has 1412 genes Chromosome 11 has 1543 genes Chromosome 2 has 1624 genes Chromosome 19 has 1660 genes Chromosome 1 has 2611 genes

      I am only interested in output that contains the "Chromosome "num/x/y" has "num" genes.
      How can I accomplish that?
      Thanks Ikegami!
      DeepSpace

        Just a quick note, you seem to be placing some garbage in your hash:

        Chromosome has X genes Chromosome KLHL13 has X genes Chromosome UTY has Y genes Chromosome SPIN2B has X genes Chromosome PIR has X genes

        The values in the hash should only be counts. Once you fix that bug, you can use

        foreach my $geneid ( sort { $geneids{$a} <=> $geneids{$b} } keys %geneids ) { if ($geneid =~ /^(?:[0-9]+|X|Y)\z/) { print OUT "Chromosome $geneid has $geneids{$geneid} genes\n"; } }
        Untested: You could add something like
        if ($geneid =~ /[0-9]+|X|Y/ && $geneids{$geneid} =~/[0-9]+/) { print "..."; }
Re: how can I print my hash values once?
by roboticus (Chancellor) on Mar 10, 2011 at 13:52 UTC

    IomSpace:

    If the lines are truly duplicates, and you don't care about the ordering, then I think I'd leverage the sort utility to remove duplicates and just open[1] the file with:

    open(my $in, '-|', "sort -u /Users/mydir/Desktop/CCDS.current.txt") or + die " Can't open file: $!";

    Then your code can be considerably shorter/simpler.

    Note: [1] From my reading, I think this is how to use an input pipe with the three-argument form of open, but I've not used it before.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.