There are a few gotchas in your code... let me modify it like this:
use strict;
use warnings;
my %GeneCount = ();
#open the textfile GeneType.txt
open (GENETYPE, "GeneType.txt") or die "Could not open file: '$!'";
my $header = <GENETYPE>; # read the header before entering the loop
while (<GENETYPE>) {
chomp;
my ($GeneName, $GeneType)= split (/\t/, $_);
$GeneCount{$GeneType}++;
}
for my $type (sort keys %GeneCount) {
print "$type: $GeneCount{$type}\n";
}
So what did I change?
- I started the program with use strict; and use warnings; which is a good habit and will save a lot of time in the long run. The only downside is that I now have to declare my %GeneCount = () before using it.
- In the open statement I included the reason why it failed into the error message. There's also the opportunity to use the three-parameter form of open and a lexical file handle, which I let pass, because your code is correct (but slightly out of fashion).
- Instead of removing the header in every line of the loop, I just read the header before even entering the loop.
- I added chomp which kills the newline which will otherwise be at the end of every gene type you read.
- Most important for your logic: I changed the hash so that the types are the keys, and the count are the values.
I seem to recall that older versions of Perl (I'm using 5.28) issued some warnings about uninitialized $GeneCount{pseudogene}. To get rid of these you can add the line no warnings "uninitialized" before entering the loop.
And that's it. The rest is just typing out the collected values.
If you are a beginner in Perl, you might also checkout https://learn.perl.org/books/: They are fun to read. | [reply] [d/l] |
| [reply] [d/l] [select] |
That works perfectly, thanks!
| [reply] |
| [reply] |
The link provided by hippo seems a good place to start and the correspondence to your case can be deduced by:
($GeneName, $GeneType)= split (/\t/, $_); # your program
my ($ip, $size) = split /:/; # the other program
Once you practice building and searching the hash, consider this:
- in a hash the most efficient search is by its keys. If you need to check by value then consider re-designing your hash and use the values for keys (of course this is not always possible because the keys of a hash must be unique).
- In a situation where a key can be associated with multiple values (and must absolutely used as a key, i.e. can't be redesigned), we can use an array to hold all the values. Like: $hash{akey} = ['v1', 'v2', 'v3']; or even another hash like $hash{akey} = {'k1'=>['v1k1','v2k1'], 'k2' => ['v1k2','v2k2']};. There is also the possibility of arrays-of-hashes, arrays-of-arrays etc. etc. etc. With nesting data structures the possibilities are endless.
I mentioned this because in your case I think the key should be the genotype (and not the genename) and the value should be an array of gene names.
Also, my $scalar = delete $GeneHash{GeneName}; will remove the genename you just added to your hash! You will end up with nothing. You probably wanted to skip the first line of the file. And do that just once, i.e. before the loop, like open (GENETYPE, "GeneType.txt") or die "Could not open file"; my $header = <GENETYPE>; which skips the first line of the file and saved it in that variable.
Finally, shouldn't print (each %GeneHash); be outside the file-reading-hash-generation loop? And this would do just fine: while( my ($k,$v) = each %GeneHash ){ print "$k=>$v\n"; }
Having said all these, and you going through them in order to get some experience, I want to mention the existence of BioPerl which is especially designed for bio-informatics and does tasks like yours pretty well, here is something relevant: https://bioperl.org/howtos/Beginners_HOWTO.html#item19 . But even with BioPerl you will need to know your hashes.
bw, bliako | [reply] [d/l] [select] |
my $count = grep /$specific/ values(%GeneHash);
| [reply] [d/l] |