jonc has asked for the wisdom of the Perl Monks concerning the following question:

Okay, so I've been trying, for quite a while, to sort a AoH(Array of Hash references) or AoA(Array of array references) by the frequency of duplicates. But duplicates of only 3 of 6 elements in the array or values of the hash.

If that doesn't make sense, I'm trying to use the sort found under the "Simple Aggregate" heading in the Tutorial: The Uniqueness of hashes. But for a data structure.

Here is what I've (pitifully) gotten so far:

my @array1 = (1 .. 20); my @array2 = (10 .. 30); my @array3 = (19 .. 40); push @all_arrays, \@array1, \@array2, \@array3; ##This is like what wi +ll be passed in my actual code my %unique_descriptive = do { local %_; for (@{$all_arrays[0]}, @{$all_arrays[1]}, @{$all_arrays[2]}) { $_{$_}->{count}++ ; push @{$_{$_}->{values}},$_; } %_; };

I need help with this step. It works as I would like it to (that is, gives number of times each number has been seen), but I can only apply this when I know @all_arrays is made up of exactly 3 array references. Also, I am doubting that I actually understand what is going on with the de-referencing?, so any help on that would be great.

I would like some help in doing what the above code does for an array that I don't know the size of (so I can't just list the de-referenced versions in the for loop). Sort of like the equivalent of:  @{@all_arrays}, which doesn't work.

I've tried looking at Sort::ArrayofArrays, but couldn't implement it. The tutorial says it's easy to apply to 2d data structures but... I failed.

Thanks a lot! Let me know if I need to be more clear.

Next steps (not being asked, but any tips are welcome): Actually sort by the frequency then make it sort by only the elements/values I want.

Replies are listed 'Best First'.
Re: How to (get started on) sort AoA or AoH by frequency
by GrandFather (Saint) on Jun 13, 2011 at 04:36 UTC

    Well, I couldn't say your problem description is crystal clear, but the following may get you started:

    #!/usr/bin/perl use strict; use warnings; use Data::Dump; srand 0; my @aoa = ( [map {int rand 3} 1 .. 10], [map {int rand 6} 1 .. 10], [map {int rand 20} 1 .. 10], ); my @aof = map {{asFreq (@$_)}} @aoa; print Data::Dump::dump (\@aof); sub asFreq { my @elements = @_; my %freqs; ++$freqs{$_} for @elements; return %freqs; }

    Prints:

    [ { "0" => 6, "1" => 2, "2" => 2 }, { "0" => 3, "1" => 2, "2" => 2, "3" => 1, "4" => 1, "5" => 1 }, { "0" => 1, "1" => 1, "5" => 1, "11" => 1, "12" => 1, "13" => 1, "15 +" => 1, "17" => 1, "19" => 2 }, ]
    True laziness is hard work

      Sorry for the clarity. I have edited the question to be more clear. I think I caused a misunderstanding the first time.

Re: How to (get started on) sort AoA or AoH by frequency
by Marshall (Canon) on Jun 13, 2011 at 08:03 UTC
    I took your question to mean: how do I make a sorted printout by frequency of the structure that my code builds? First, I found the OP's code to be confusing, so I recoded it.

    It is not possible to sort a hash, but it is possible to sort the keys of the hash into an array. Then use that array of keys to print the hash. Below, I used pp() to assist in the printing.

    In this example, the sub hash is actually not necessary. A HoA would have sufficed because the value of the array evaluated in a scalar context is the "count". Not quite sure what you mean in terms of AoA to sort.

    Update: As clarification to the OP, @all_arrays is an array of references to arrays. The map{@$_} takes each array reference and expands it into a list of numbers. So this is the answer to one of the questions about needing to know that there are 3 rows, you don't. The code below "flattens" the whole structure into a long list of numbers no matter how many rows that there are.

    #!/usr/bin/perl -w use strict; use Data::Dump qw(pp); my @all_arrays = ([1 .. 20], [10 .. 30], [19 .. 40], ); my %unique_descriptive; foreach my $num (map{@$_}@all_arrays) { $unique_descriptive{$num}{count}++; push @{$unique_descriptive{$num}{values}}, $num; } #print pp(\%unique_descriptive); # example for num=10 #10 => { count => 2, "values" => [10, 10] }, my @sorted_keys = sort{ $unique_descriptive{$a}{count} <=> $unique_des +criptive{$b}{count} or $a <=> $b }keys %unique_descriptive; foreach my $key (@sorted_keys) { printf "%2d=", $key; #make the print out look nice print pp($unique_descriptive{$key}),"\n"; } __END__
    Program output: Update: Another set of code - probably is not what OP needs for AoA, but it does demo how to add a column and how to sort a 2-D array by different column positions...
    #!/usr/bin/perl -w use strict; use 5.010; #for new //= operator use Data::Dump qw(pp); my @all_arrays = ([1 .. 20], [10 .. 30], [19 .. 40], ); my @unique_descriptive; foreach my $num (map{@$_}@all_arrays) { $unique_descriptive[$num]++; #simple peg counter } # add a column to the 2-D array with row number # undef counts as freq of zero, the //=0 does that my $i=0; @unique_descriptive = map{[$i++,$_//=0]}@unique_descriptive; @unique_descriptive = sort{ $a->[1] <=> $b->[1] #by freq or $a->[0] <=> $b->[0] #by peg number }@unique_descriptive; foreach my $row (@unique_descriptive) { print "num = $row->[0] \tfreq=$row->[1]\n" if ($row->[1] > 0); }
    AoA output:

      Great! The output (more of the 2nd one) is what I was looking for. I guess I'll include it in the question next time.

      The HoA won't work for me, because in my actual code: in the AoA/AoH, the "inner" array/hash (the references) are 6 strings, not numbers. The "outer" array is the a list of all these "sets" of strings (which come from a search engine type of code and needs to be sorted).

      >(How do you indent?)>(which also means I'm going to create a more complex sorting method, where I sort the "outer" array by certain *values* of the hash, or elements by certain elements of the "inner" array).

      Should I include this type of background in my questions?

      Thanks a lot, I'm going to try and understand these codes.

      Thanks for explaining map{@$_}...$_ really screws me up, I take it that one comes from the elements of @all_matches

      .

      Whew... This is intense.

      I'm sorry, but I've tried, and read some articles on hashes/data structures, but still need some assistance understanding this.

      I get the map statement. But:

      $unique_descriptive{$num}{count}++; push @{$unique_descriptive{$num}{values}}, $num;
      Is a little confusing.

      Here's what I got so far: The first statement increments the value of "count"(which I guess is a new key made then and there?). The value is in the HoH %unique_descriptive, at the key: That is the number, which is the element of the de-referenced array being looped through.

      Then the 2nd line is AoHoH?? But that array is never used later? The keys of the most inner hash are the values of something(what?). The end value of this is the number from the loop being pushed in. The 2nd inner hash is at the key of $num. Was the @ in front only necessary b/c push takes list context?

      Then the other problem is:

      my @sorted_keys = sort{ $unique_descriptive{$a}{count} <=> $unique_des +criptive{$b}{count} or $a <=> $b }keys %unique_descriptive;

      The numbers are being sorted based on count first (did you know to put $a where it is b/c $num was there before?) And if that is equal, the numbers themselves are compared. The keys are being sorted.

      Sorry for the trouble I'm having with this, I hope I was close/this makes sense to you.

        The first statement increments the value of "count"(which I guess is a new key made then and there?).

        Yes. Perl will "autovivify" a new entry if none already exists. The is pretty cool stuff. In other languages I would have had to size and initialize the structure. In Perl, I can just do that as I go.

        But that array is never used later?

        Correct. My code produces exactly the same structure as your code, but in more understandable way (at least for me!). The "values" really aren't needed (and yes this is an @array). I just did that because your code did it. The "values" will always be the same as the number key and always repeated the same times as the frequency. That could easily be computed. So really all that is needed is a single dimensional hash instead of two dimensions (see my code later in the thread).

        update:
        I'm not sure about your understanding of sort... Trying to further clarify... $a and $b are two hash keys that sort chooses for us - these are some chosen set of number pairs. We don't have to be concerned with the algorithm that sort uses, we just have to tell it how to compare a and b.

Re: How to (get started on) sort AoA or AoH by frequency
by jonc (Beadle) on Jun 13, 2011 at 18:12 UTC

    Okay, So without using hashes, here is a sample showing a crude way of getting the sort I need (annotated for clarity):

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @results = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "p +rotect"], ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "prot +ect"], ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "p +rotect"], ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "prot +ect"], ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "prote +ct"], ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "pr +otect"], ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "pro +tect"] ); my @sort_results = sort {lc $a->[4] cmp lc $b->[4]} @results; ##By alp +habet of arg1 my $last_word; my $current_word; my $word_count; $sort_results[-1][6] = 1; ##This weird step is b/c last element didn't + get 7th column appended for my $j (0 .. $#sort_results) { ##[ROW][COLUMN] $current_word = $sort_results[$j][4]; ## current word is arg1 of w +hichever matchset is being looked at (alphabetical) if (lc $last_word eq lc $current_word) { $word_count++; ##If seen before, increment freq. count } else { ##new word if ($j != 0) ##unless it's the first row { for (my $k = 1; $k <= $word_count; $k++) { ##make a new column with freq. Each of the previous see +n word will have to have the same freq. number so iterate back and ma +ke them all the same word count $sort_results[($j-$k)][6] = $word_count; } } ##Now set up for next iteration $last_word = $current_word; $word_count = 1; } } @sort_results = sort {$b->[6] <=> $a->[6]} @sort_results; ##Sort the r +esults by the new 7th freq. column for my $i (0 .. $#sort_results) { print "$sort_results[$i][0], $sort_results[$i][1]: "; ##chptnum, + sent num print "$sort_results[$i][2]\n\n"; ##sentence print "gramatical relation: $sort_results[$i][3]; argument: $sor +t_results[$i][4]; freq: $sort_results[$i][6]\n\n\n"; ##dependency a +rgs }

    I would appreciate either a new, better way to do this (I think hashes are the way to get it done), or just an improvement on this crude code. Thanks again for all your help!

      See attached code. I used the map trick again: foreach ( map{$_-> [ 4]}@results) iterates over all of the contents of column 4 and a freq hash is built. A list of references to rows is what is going into the map. The map then de-references and transforms this such that the output is list of every contents of column 4.

      The way sort works: <---output sort{...} <---input
      is that what goes in is what comes out. What is coming in are references to rows of the @results array. What sort needs is a way to compare 2 rows: row A<row B, row A equal row B or row A>rowB. The function that provides the comparison can be anything that you want as long as it produces a consistent result (reverses the answer if a and b are reversed).

      So I look up the value of col 4 for say row A, then I ask the frequency hash what the frequency is and I compare that result with a likewise computation for row B. In the case of a tie, I use an alphabetic comparison of row 0. Note that I reversed a and b to get highest frequency first while I am sorting on lowest column 0 first.

      The way that the sort decider function is written may appear a bit strange, but it is just returning a: -1, 0 or 1 depending upon how row A and row B compare.

      It is completely legal to assign the sorted result set back to the input variable and I did that. To get your printout, just do the column 4 look up in the freq hash to get frequency. The order of my @result jives with the order of your output.

      For printing, of course you can access each element as a 2-D coordinate, but usually better is to iterate over the rows with row reference like this:

      foreach my $row (@results) { print "$row->[0] $row->[1]\n"; }
      I think the following code does what you want...
      #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use Data::Dump qw(pp); my @results = (["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "p +rotect"], ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "prot +ect"], ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "p +rotect"], ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "prot +ect"], ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "prote +ct"], ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "pr +otect"], ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "pro +tect"] ); my %freq; foreach ( map{$_->[4]}@results) #feeds in list of animals, cells, utic +le, etc. { $freq{lc $_}++; } @results = sort {$freq{lc $b->[4]} <=> $freq{lc $a->[4]} #freq order or $a->[0] cmp $b->[0] #text col 0 + } @results; print pp(\@results); __END__ [ ["chpt12_1", "sent. 54", "bob", "nsubj", "cells", "protect"], ["chpt34_1", "sent. 1", "dave", "nsubj", "cells", "protect"], ["chpt35_1", "sent. 2", "eli", "nsubj", "cells", "protect"], ["chpt10_2", "sent. 2", "alice", "nsubj", "animals", "protect"], ["chpt38_1", "sent. 1", "fred", "nsubj", "animals", "protect"], ["chpt25_4", "sent. 47", "carol", "nsubj", "plants", "protect"], ["chpt54_1", "sent. 1", "greg", "nsubj", "uticle", "protect"], ]

        Marshall,

        You are awesome! This is more than I ever could have asked for, you really helped me understand sorting and the power of hashes. I will try and award you whatever I can, since you've helped me out so much (when I get this rumoured vote fairy). Thanks for your time!

        p.s. I don't think you need to lc the 1st line in sort, since it's numbers... the 2nd line would need it