kappa has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monks.

I have a kinda algorithmic question. Probably my Knuth-fu skills need a refresh or something.

What I've got is a large unordered array of opaque objects (refs, actually, so most operations are cheap). Each of them has a unique precalculated numeric key. Also, I have an ordered array of those keys (this is index). Naturally, I'd like to end up with an array of original objects ordered according to the index.

I do it in this way now:

my $uids = $self->sorted_uids; my $msgs = $self->unsorted_messages; my %uid2msg = map { $_->uid => $_ } @$msgs; return [ map { $uid2msg{$_} } @$uids ];

It seems to me rather awkward and probably slow. I actually didn't do any benchmarks as there's nothing to compare it with. Directly sorting messages is an alternative but that makes other operations extremely... eh.. different. Although I'll probably try it later.

I'd like to add that creating a temp array $uid2msg[$_->uid] = $_ foreach @$msgs is not an option as those uids could be very large (and more important, both large and small in one message set).

Looks like this is the very kind of operation each sql server implementation performs when queried with a SELECT * FROM t ORDER BY column and there's an index on column. I failed to find anything about relevant algorithms on google, though.

Replies are listed 'Best First'.
Re: Ordering objects using external index
by fergal (Chaplain) on Sep 06, 2004 at 19:04 UTC
    You should look at maintaining the %uid2msg mapping as you go along. This shouldn't be too hard since $self appears to be an object. It just means you'll have to add some code to your insert and delete methods to keep $self->{uid2msg} up to date. This is exactly what a database does when you mark a column as indexed.

    In the benchmark below, this resulted in a speedup of about 9x. Of course you do have to pay a little for maintaining the %uid2msg index but I'm assuming in your case you do a lot more reading than inserting and deleting.

    Once you've done that, you can further speed things up with hash slice in the return, changing

    return [ map { $uid2msg{$_} } @$uids ];
    to
    return [ @uid2msg{ @$uids } ];
    This doesn't make much of difference in the original version but it more than doubles performance when %uid2msg uses a precomputed index.

    Here's the results of a benchmark for 1000 msg objects

    Benchmark: running hashslice, hashslice_pre, original, original_pre fo +r at least 5 CPU seconds... hashslice: 5 wallclock secs ( 5.29 usr + 0.00 sys = 5.29 CPU) @ 97 +.35/s (n=515) hashslice_pre: 5 wallclock secs ( 5.30 usr + 0.01 sys = 5.31 CPU) @ + 1757.63/s (n=9333) original: 5 wallclock secs ( 5.30 usr + 0.00 sys = 5.30 CPU) @ 91 +.51/s (n=485) original_pre: 5 wallclock secs ( 5.33 usr + 0.00 sys = 5.33 CPU) @ +817.07/s (n=4355)
    and 10000 msg objects
    Benchmark: running hashslice, hashslice_pre, original, original_pre fo +r at least 5 CPU seconds... hashslice: 5 wallclock secs ( 5.04 usr + 0.04 sys = 5.08 CPU) @ 7 +.28/s (n=37) hashslice_pre: 5 wallclock secs ( 5.27 usr + 0.01 sys = 5.28 CPU) @ + 93.56/s (n=494) original: 5 wallclock secs ( 5.08 usr + 0.01 sys = 5.09 CPU) @ 6 +.68/s (n=34) original_pre: 6 wallclock secs ( 5.37 usr + 0.00 sys = 5.37 CPU) @ +46.93/s (n=252)
    the _pre versions are hugely faster. Code below
    use Benchmark; my $UID = 0; my $uids = []; my $msgs = []; for (1..10000) { UO->new; } my %pre_uid2msg = map { $_->uid => $_ } @$msgs; timethese(-5, { original => sub { my %uid2msg = map { $_->uid => $_ } @$msgs; return [ map { $uid2msg{$_} } @$uids ]; }, hashslice => sub { my %uid2msg = map { $_->uid => $_ } @$msgs; return [ @uid2msg{ @$uids }]; }, original_pre => sub { return [ map { $pre_uid2msg{$_} } @$uids ]; }, hashslice_pre => sub { return [ @pre_uid2msg{ @$uids }]; } } ); package UO; sub new { $UID += rand(1000); my $self = bless {uid => $UID}, shift(); push(@$uids, $UID); push(@$msgs, $self); return $self; } sub uid { my $self = shift; return $self->{uid}; }

    edit (broquaint): changed <pre> tags to <code> tags

      Thanks for a comprehensive reply! I'll certainly incorporate some ideas as soon as I'm at work!

      Your main suggestion is to keep the hash always up-to-date as I do something on the messages array. That is actually my next big problem :)) You see, the messages can be sorted by different criteria. Currently, there're only eight. So, on each write operation on the messages array I will need to update eight indices. That looks weird.

      The main reason to separate sorting order into another array was to be able to save lots of presorted indices (currently they are in memcached) for a big message list and then quickly retrieve messages in the order I need. So the actual events that take place in the script are these: load big array, load indices, try to sort the array in less than n*log(n) ops using the indices. Hope this will clarify my intentions. I can probably try to save both $uids and %uid2msg for each criterium.

      Are there any other way to presort array on different criteria and save the order for future reference? Seems like this is my real question :)

        I replied to this already but something seems to have gone wrong and the reply didn't make it. Basically if you have 8 columns that you need to index then need 8 indexes. No way around it. If you are only retrieving the sorted list once and then forgetting about it forever, then maintaining the indices only slows you down and it's not worth it. However if you are going to retrieve it even just a few times, then it's probably a win.

        You could also try DB_File with it's DB_BTREE functionality to handle the sorting and storing of the arrays. This effectively gives you a sorted hash that persists on disk between calls to your program. You would maintain 8 of these and whenever you add a message, you would do

        tie %index1, "DB_File", "index1", O_RDWR&#9474;O_CREAT, 0666, $DB_BTRE +E tie %index2, "DB_File", "index2", O_RDWR&#9474;O_CREAT, 0666, $DB_BTRE +E ... sub insert { my $msg = shift; $index1{$msg->key1} = $msg->uid; $index2{$msg->key2} = $msg->uid; ... } my @sorted_by_index1 = @uid2msg{values %index1};
        unlike a normal hash, when you use a DB_BTREE values will give you the values back in the correct order (sort by their keys)

        If you go down this route you are basically implementing your own database and you may want to look at just using DBD::SQLite which gives you a fast, direct to disk database.

Re: Ordering objects using external index
by saintmike (Vicar) on Sep 06, 2004 at 17:45 UTC
    Check out this thread, it seems like your requirements are very similar.

      Exactly. Thanks!

      Sadly enough, that discussion didn't end up in anything efficient either :((

Re: Ordering objects using external index
by Anonymous Monk on Sep 06, 2004 at 18:29 UTC
    return [ map { $uid2msg{$_} } @$uids ];

    Can be written as:

    return @uid2msg{@$uids};

    But I don't know if that's any faster...

Re: Ordering objects using external index
by BrowserUk (Patriarch) on Sep 06, 2004 at 19:23 UTC

    A few examples of the data might have clarified your question. Do you mean something like this?

    #! perl -slw use strict; my $uids = [ 9, 1, 4, 7, 2, 0, 3, 6, 8, 5 ]; my $msgs = [ qw[ zero one two three four five six seven eight nine ] ] +; my @msgsByUid = @$msgs[ @$uids ]; print for @msgsByUid; __END__ P:\test>junk nine one four seven two zero three six eight five

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Aa, kind of. Look:
      my $uids = [ 5674, 1, 4 ]; my $msgs = [ $msg1, $msg4, $msg5674 ];
      Provided $msgNN->uid == NN we'd like to have [ $msg5674, $msg1, $msg4 ].

        That makes it look like your using symbolic references? Ie. Variable names that are (partially) made up from other variable names. eg.

        $uid = 5674; ${'msg' . $uid } = ...;

        In which case, you should be making that a hash directly:

        push @uid, 5674; $msgs{ $uid[ -1 ] } = ...;

        then you wouldn't be having the mapping problem later on. Producing your ordered array would then become a simple hash slice:

        @ordered = @msgs{ @uid };

        It's difficult to know without seeing how the variables and data in your snippets are beiing created.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon