Re: sorting arrays with common index

shebang:

You should really invest a little time to get used to hashes: The parallel-array approach is one way you can get the flexibility of a hash, but it's brittle and prone to errors. The hash syntax is just a little trickier than parallel arrays, but once you're used to it, you'll find it simpler overall. To illustrate a little of where I'm coming from, here's a simple program that will print a sorted list of people by age using parallel arrays *and* the same data in an array of hashes:

$ cat pm_1198871_a.pl
use strict;
use warnings;

# Parallel arrays:
my @first_name = ('Joe', 'Bob', 'Mary', 'Sue');
my @last_name = ('Smith', 'Jones', 'Blige', 'Parker');
my @age = (25, 43, 19, 57);

# An array of hashes:
my @people = (
    { first=>'Joe',  last=>'Smith',  age=>25 },
    { first=>'Bob',  last=>'Jones',  age=>43 },
    { first=>'Mary', last=>'Blige',  age=>19 },
    { first=>'Sue',  last=>'Parker', age=>57 },
);

# Parallel arrays: Make a sort-by-age list of indices:
my @indices = sort { $age[$a] <=> $age[$b] }  0 .. $#first_name;

print "Using parallel arrays, sorted by age\n";
for my $i (@indices) {
    print "$first_name[$i] $last_name[$i] $age[$i]\n";
}

# Array of hashes: Make a sort-by-age list of indices:
@indices = sort { $people[$a]{age} <=> $people[$b]{age} } 0 .. $#peopl
+e;

print "\nUsing a hash, sorted by age\n";
for my $i (@indices) {
    print "$people[$i]{first} $people[$i]{last} $people[$i]{age}\n";
}

$ perl pm_1198871_a.pl
Using parallel arrays, sorted by age
Mary Blige 19
Joe Smith 25
Bob Jones 43
Sue Parker 57

Using a hash, sorted by age
Mary Blige 19
Joe Smith 25
Bob Jones 43
Sue Parker 57
[download]

In both cases, I just created a sorted list of the array indexes containing the data and printed the report from it. As you can see, the code is very similar in structure. I find that the data in the hash section is a lot easier to read because the related items are right next to each other. The sort statement is a little bit simpler in the parallel array section than the array of hashes, but that's an illusion!

There are several reasons that the simplicity of parallel arrays is an illusion. First, we just looked at a very simple case where we wanted to print out the data as a sorted report. But what happens if we really want to sort the data? Let's modify our code to put the data in the actual order we want:

# Parallel arrays: Sort our data by age
my @indices = sort { $age[$a] <=> $age[$b] }  0 .. $#first_name;
@first_name = @first_name[@indices];
@last_name  = @last_name[@indices];
@age        = @age[@indices];

print "Using parallel arrays, sorted by age\n";
for my $i (0 .. $#first_name) {
    print "$first_name[$i] $last_name[$i] $age[$i]\n";
}
[download]

In the parallel array version, we still resort to using a list of indices to sort on, then we have to rearrange all the parallel arrays. Immediately the code gets a bit longer. On the other hand, if we're sorting the array of hashes, we don't need to remember a list of indices: we can sort all the data in one step rather than four:

# Array of hashes: Sort our data by age
@people = sort { $a->{age} <=> $b->{age} } @people;

print "\nUsing a hash, sorted by age\n";
for my $hr (@people) {
    print "$hr->{first} $hr->{last} $hr->{age}\n";
}
[download]

Note that the code became smaller rather than larger. We don't need a list of indices, because we don't have to try to map the changes over multiple data structures. Instead, sort can directly rearrange the array for us.

This post is already going long and I'm getting short on time, so I'll be brief on the other reasons that the simplicity is an illusion:

Unstated Assumptions

Any time you're having to manage multiple data structures in concert, you have to remember to do the appropriate actions *everywhere* relevant. The parallel array technique relies on a two unstated assumptions:

All arrays are the same size. This way, you can use any of the arrays to generate a list of indexes for sorting.
All arrays are changed the same way in every location. Any time you change data (add/delete/replace) you must verify that you make the appropriate changes for all arrays.

In small programs it's not a problem, but programs have a habit of becoming large. Suppose you wanted to add a person's favorite color to your data. For the hash version, there's no problem--each slot in your array is a bundle of data for a particular person. Adding a favorite color just adds a little information to a specific person. In the parallel array version though, you must add the favorite color array, and then go through your program and find every location where you're changing one of your parallel arrays and ensure that you perform the proper operations on your favorite color array. A mistake anywhere could cause your data to become mismatched and useless.

Final Notes

When you begin with hashes, things will get a little sticky for a little while. But once you're accustomed to it, many things will suddenly get much easier. When you can just lump a complicated thing in a ball and forget about its internals, it opens up more of your brain to think about the larger problems in your programs. It also helps you make reusable chunks of code.

As an example, suppose you added addresses to your collection of "people", so you might have { first=>'Morticia', last=>'Addams', street_num=>131313, street_name=>'Mockingbird Lane', city=>'Perish', st=>'NY', zip=>13131, ... } and include it in your person data. Later when someone asks to add buildings to your program, and you notice that they also have addresses, you could split out your address information into a subhash, and simply call out the address part when you call subroutines that deal with addresses. Then you could use with your buildings and not have to worry about sets of parallel arrays and how to mix arrays containing people and arrays containing buildings information:

my $ma = {
    first=>'Morticia', last=>'Addams', age=>undef, favorite_color=>'bl
+ack',
    address=>{ street_num=>131313, street_name=>'Mockingbird Lane', 
               city=>'Perish', st=>'NY', zip=>13131 
    },
};

my $white_house = {
    class=>'GOVT', branch=>'Executive', usage=>'Presidents Residence',
+ ...
    address=>{ street_num=>1600, street_name=>'Pennsylvania Avenue, N.
+W.', 
               city=>'Washington', st=>'DC', zip=>20500 
    },
};

print_address($ma->{address});
print_address($white_house->{address});

sub print_address {
    my $addr = shift;
    print "$addr->{street_num} $addr->{street_name}\n$addr->{city} $ad
+dr->{st} $addr->{zip}\n";
}
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re: sorting arrays with common index Select or Download Code