Tallying appearance of a unique string from hash keys

jack_j has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Tallying appearance of a unique string from hash keys by tilly (Archbishop) on Mar 26, 2009 at 05:42 UTC
Let me be sure I understand the question. We have a file with lines. Each line contains an ID followed by a tab followed by another ID. You want to know how many times each ID appears in the file, that's called the degree. And then you want to summarize how many times each degree appears. If so then this should solve the problem: `#! /local/bin/perl use strict; use warnings; my %degree; my $filename = "edges.txt"; open(my $fh, "<", $filename) or die "Can't open '$filename': $!"; while (<$fh>) { if (/(\S+)\t(\S+)/) { $degree{$1}++; $degree{$2}++; } } my %degree_distribution; $degree_distribution{$_}++ for values %degree; for my $id (sort keys %degree) { my $d = $degree{$id}; my $freq = $degree_distribution{$d}; print "$id has degree:\t$d\t(freq: $freq)\n"; }` [download] If this is not the question you are asking, please clarify your question. Showing us a small example would be best. For example give us 5-20 lines of input and what output you'd expect from that.	[reply] [d/l]
Re^2: Tallying appearance of a unique string from hash keys by jack_j (Initiate) on Mar 27, 2009 at 19:32 UTC
Hello, So I've progressed in my script (for a newbie), but still have some gaps that I could use some help on if anyone is able to help. #! /local/bin/perl use strict; use warnings; #Declare hash to pull IDs and corresponding degrees into from file my %degree; my %newdegree; my @geneID_1; my @geneID_2; my $filename = "edges.txt"; #Set a variable for our file name open(my $fh, "<", $filename) or die "Can't open file $filename."; #Op +en the file edges.txt while (<$fh>) { if ($_ =~ m/(\S+)\t(\S+)/) { #Match the IDs in the file to + $1 and $2 $degree{$1}++; #Count the appearance of each + ID, and store this as $degree{$2}++; #the value for that key (this + will be the degree) push (@geneID_1, $1); push (@geneID_2, $2); }} close $fh; #Close the file edges.txt #Calling the following subroutines DegreeDistribution(); RandomSequence(); DegreeRan(); #Subroutines can be found below sub DegreeDistribution{ #Determines the degree distribution (i.e. fre +quency of each degree) my %degree_distribution; $degree_distribution{$_}++ for values %degree; #Creates a hash wi +th the keys as the degrees and the values as the frequencies for my $id (keys %degree) { #This corresponds each b +elow to it's key my $d = $degree{$id}; #This is the degree value +corresponding to its gene ID my $freq = $degree_distribution{$d}; #This is the frequency val +ue corresponding to its gene ID??? print "$id has degree:\t$d\t(freq: $freq)\n"; }} sub RandomSequence{ #Generates a hash of random ID interactions and + each IDs degree my $length = scalar (@geneID_1); for (my $i=0; $i<$length; $i++){ my $ID = int(rand($length)); my $new_id1 = $geneID_1[$ID]; my $new_id2 = $geneID_2[$ID]; $newdegree{$new_id1}++; $newdegree{$new_id2}++; }} sub DegreeRan{ #same as sub DegreeDistribution but for the random +sequence my %degree_distribution; $degree_distribution{$_}++ for values %newdegree; #Creates a hash + with the keys as the degrees and the values as the frequencies for my $id (keys %degree) { #This corresponds each b +elow to it's key my $d = $degree{$id}; #This is the degree value +corresponding to its gene ID my $freq = $degree_distribution{$d}; #This is the frequency val +ue corresponding to its gene ID??? print "$id has degree:\t$d\t(freq: $freq)\n"; }} exit; [download] I have a few concerns: 1. I want to create three random networks, not just one. I'm not sure how to make my subroutine so that it produces a different hash each time (i.e. named differently), than I can return to the main script -- the return function didn't seem to work properly, it only returns one key-value pair not the whole list. 2. Similar to one, I want to run each hash through the DegreeDistribution subroutine, because making a separate subroutine for each hash defeats the purpose of a subroutine. 3. Probably also very related, I want to return the final values to the program so I can use them all (from the original dataset and the three random datasets) in an excel file. Thank you to anyone that responds.	[reply] [d/l]
Re^3: Tallying appearance of a unique string from hash keys by tilly (Archbishop) on Mar 28, 2009 at 03:15 UTC
First of all how to pass a hash: `# Populate a hash. my %result = some_function(%data); sub some_function { my %passed = @_; my %to_return; # Do stuff with %passed here and populate %to_return return %to_return; }` [download] With this you can do things like this: `my %distribution = degree_distribution(); my %random_distribution_1 = random_distribution(); my %random_distribution_2 = random_distribution(); my %random_distribution_3 = random_distribution(); output(%distribution); output(%random_distribution_1); output(%random_distribution_2); output(%random_distribution_3);` [download] and so on. (I'm not suggesting that those be actual functions you use, but that gives you an idea.) Before long I predict that having to repetitively work with 3 random distributions will get very old. That's where you'll want to work with more complex data structures. For that read references quick reference and come back if you have any questions.	[reply] [d/l] [select]
Re^4: Tallying appearance of a unique string from hash keys by jack_j (Initiate) on Mar 28, 2009 at 19:17 UTC
Re^5: Tallying appearance of a unique string from hash keys by ig (Vicar) on Mar 28, 2009 at 20:45 UTC
Re^2: Tallying appearance of a unique string from hash keys by jack_j (Initiate) on Mar 26, 2009 at 06:26 UTC
Thank you very much! Let me just make sure it's doing what I want it to (being such a large dataset I can't confirm the output manually). The degree is the number of times each ID appears. The frequency is the number of times that degree appears in the dataset. I have another aspect to this I was wondering if you guys could help me out with. I have to create three random networks with the same number of IDs and interactions (but since they are random, the actual interactions will be different). And do the same degree and frequency calculations (to compare to this network) -- A subroutine would be most useful. Lastly, I need to export this all to an excel file using the Spreadsheet::WriteExcel module (with headers of the separate columns being the degrees (data in ascending order) and the corresponding degree frequencies for the given network and each of the three randomized networks). Thank you in advance to anyone that can help!	[reply]
Re^3: Tallying appearance of a unique string from hash keys by tilly (Archbishop) on Mar 26, 2009 at 07:40 UTC
That is exactly what that snippet of code does. As for the rest of it, nuh uh, that isn't how it works. This is a place for people to learn about programming, not a place for people to get their projects done by others. If I think that I can give you a little help and you can try the rest, I may donate a little effort. If you want to demand a detailed spec and show no evidence of putting any of your own effort in, that sounds like real work. I'm perfectly willing to discuss my hourly rate with you, but I won't contribute anything more for free.	[reply]
Re^4: Tallying appearance of a unique string from hash keys by jack_j (Initiate) on Mar 26, 2009 at 08:15 UTC
Re^3: Tallying appearance of a unique string from hash keys by planetscape (Chancellor) on Mar 26, 2009 at 07:28 UTC
PerlMonks is not a code-writing service. You must at least make some minimal effort on your own. Start by trying to write your own subroutine; if you get stuck, feel free to come back and ask more specific questions, and show your code! HTH, planetscape	[reply]
Re^4: Tallying appearance of a unique string from hash keys by Anonymous Monk on Mar 27, 2009 at 18:29 UTC
Re: Tallying appearance of a unique string from hash keys by ELISHEVA (Prior) on Mar 26, 2009 at 05:13 UTC
Is the ID or the number of interactions of that ID called the "degree"? If ID=degree, this is quite a complicated problem. kyle's solution counts interactions and will get you part of the way. But counting occurrences of the ID from pairs is much more complicated. To see why consider these two star graphs. In graph I, node A has 1 occurrence and 4 interactions. In graph II, it has two occurrences and 8 interactions. `Graph I ------- X1 \| X4 - A - X2 \| X3 Pairs: X1,A X2,A X3,A X4,A Graph II -------- X1 Y1 \| \| X4 - A - X2 - Y4 - A - Y2 \| \| X3 Y3 Pairs: X1,A Y1,A X2,Y4 X2,A Y2,A X3,A Y3,A X4,A Y4,A` [download] Distinguishing between multiple paths to the same occurrence and multiple paths to two occurrences requires analysis of entire paths, not just pairs. What makes Graph II have 2 occurrences of A is the fact that there is no way to start at X1 and reach the edge nodes Y1, Y2, or Y3 without passing through a second occurrence of A. Best, beth Update:made explicit my original (mis?)reading of the question.	[reply] [d/l]
Re: Tallying appearance of a unique string from hash keys by kyle (Abbot) on Mar 26, 2009 at 04:33 UTC
If we take `%hash` as a given and the code is immutable, I think you can get statistics this way. `use English '-no_match_vars'; my %appearances_of; $appearances_of{$_}++ for map { split /\Q$SUBSCRIPT_SEPARATOR/ } keys %hash;` [download] Easier than that, however, is to collect the numbers as you're collecting `%hash`. `$hash{$1,$3} = "$holder"; # add these lines: $appearances_of{$1}++; $appearances_of{$3}++;` [download] Update: Come to think of it, those don't do the same thing at all. Now I'm not sure I answered the question asked in either case.	[reply] [d/l] [select]
Re^2: Tallying appearance of a unique string from hash keys by jack_j (Initiate) on Mar 28, 2009 at 19:24 UTC
Hello, I've discovered that my random hashes based on my above code are not actually random, but are biased based on the original network. What I should do is outlined in my pseudocode, some elements of which I am having trouble with: #read IDs into %edges using match operator /(\S)(\t)(\S)/ and specia +l variables $1, $3 for each ID per line #assign IDs as format $edge{ID1,ID2} #skip edge assignment if ID2,ID1 already exists .. the network needs t +o be undirected such that ID2-ID1 is equivalent to ID1-ID2 and should + therefore not be counted twice #populate hash of unique IDs from %edges #create array of unique IDs from hash of unique IDs .... @uniqIDs, $un +iqIDs[0]=ID1, uniqIDs[1]=ID2 etc .. how do i do this? #initialize degree counter hash (IDs are keys, values are degrees) usi +ng @uniqIDs with foreach #go through %edges and increment %counter for each ID #generate random network with rand(int(scalar(@uniqIDs))), discard ran +dom picks if they already exist or represent undirected equivalent #initialize random network degree counter hash, and then count random +network degrees as before [download] I'm already stuck at how to skip an edge assignment if $2,$1 exists. Then, how can I populate a new hash of unique IDs? Again, the problem of excluding something if it already exists. `my $filename = "edges.txt"; #Set a variable for our file name open(my $fh, "<", $filename) or die "Can't open file $filename."; #Op +en the file edges.txt while (<$fh>) { if ($_ =~ m/(\S+)\t(\S+)/) { #Match the IDs in the fil +e to $1 and $2 $edge{$1,$2}= $holder; #assign IDs as format $edge +{ID1,ID2} close $fh; #Close the file edges.txt my @list = keys %edge; print "@list\n"; #Prints the list of keys, but they are un +i-directional (i.e. repeated for ID1-ID2 and ID2-ID1)` [download] Once I have these steps, I can proceed to counting and using the unique IDs list to create my random networks.	[reply] [d/l] [select]