in reply to Re^8: In-place sort with order assignment
in thread In-place sort with order assignment

BrowserUk,
I have no idea how much additional memory Heap::Simple::XS uses under the covers, but the speed is dramatically faster than using splice with a binary search (above).
#!/usr/bin/perl use strict; use warnings; use Heap::Simple::XS; use Time::HiRes qw/gettimeofday tv_interval/; my $items = $ARGV[0] || 100; my $str = 'a'; my %hash = map {$str++ => undef} 1 .. $items; my $at_once = int($items * .10); my $heap = Heap::Simple::XS->new(order => "gt", elements => "Scalar", +max_count => $at_once); my ($cnt, $beg, %known) = ($at_once, [gettimeofday], ()); while (1) { while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); } my $items = $heap->count; last if ! $items; %known = (); my $max = $cnt + $items; $known{$_} = $cnt-- for $heap->extract_all; $cnt = $max; $heap->clear; } my $elapsed = tv_interval($beg, [gettimeofday]); my $per = sprintf("%.7f", $elapsed / $items); print "Took $elapsed seconds for $items items ($per per item)\n"; __DATA__ C:\tmp>perl buk2.pl 100 Took 0.001999 seconds for 100 items (0.0000200 per item) C:\tmp>perl buk2.pl 1000 Took 0.021015 seconds for 1000 items (0.0000210 per item) C:\tmp>perl buk2.pl 10000 Took 0.241327 seconds for 10000 items (0.0000241 per item) C:\tmp>perl buk2.pl 100000 Took 3.375 seconds for 100000 items (0.0000338 per item) C:\tmp>perl buk2.pl 1000000 Took 48.25 seconds for 1000000 items (0.0000483 per item)

Cheers - L~R

Replies are listed 'Best First'.
Re^10: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 20, 2010 at 07:40 UTC
    I have no idea how much additional memory Heap::Simple::XS uses under the covers,

    For 1e6 items, the memory usage grows from 145MB to over 200MB, which for 10e6 items is going to push a 32-bit machine into swapping.

    That said, I think this memory usage may, in part at least, be due to a bug in this incarnation of the code.

    I cannot see what would prevent this loop copying everything from %hash into both %known and the heap?

    while (my ($key, $val) = each %hash) { next if defined $val; if (exists $known{$key}) { $hash{$key} = $known{$key}; next; } $heap->insert($key); }

    Overall, the approach used in the second snippet in Re^2: In-place sort with order assignment seems to be the best. It takes 8 seconds and very little extra memory for 1e6; versus 50 seconds and +25% for the heap. And it happily handles 10e6 in 108 seconds and under 2GB.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I cannot see what would prevent this loop copying everything from %hash into both %known and the heap?

      next if defined $val; will skip any keys from %hash that we have previously assigned a value to.

      $hash{$key} = $known{$key};next; will assign any values we learned from the last run and then move on to the next record.

      $heap->insert($key); will only insert records into the heap for keys that we have not assigned a value to (either in a previous run or this run). Update: According to the documentation, max_count => $at_once will throw out items from the heap beyond that point. If that doesn't work as advertised, that may be the source of the additional memory.

      Cheers - L~R

        I was thinking that on the first pass, no values would be set, therefore everything would end up in the heap. Whilst everything gets added, I was unaware that things were discarded beyond the specified maximum.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.