in reply to In-place sort with order assignment

BrowserUk,
In your example, you have created two large lists. Perhaps you could get away with just one:
my $cnt = 0; for (sort keys %hash) { $hash{$_} = ++$cnt; }
Now, assuming that doesn't work you can take a multi-pass approach using each (which consume almost no memory). Essentially, you have two data structures designed to maintain order (pick your poison). One the first pass, you keep track of the bottom N elements leaving the 2nd data structure empty. On the second pass, you start recording the next bottom N elements while assigning the values of the first N. On the 3rd pass they swap functions. Wash, rinse, repeat.

Cheers - L~R

Replies are listed 'Best First'.
Re^2: In-place sort with order assignment
by BrowserUk (Patriarch) on Sep 19, 2010 at 14:30 UTC

    The problem is, the statement sort keys %hash has already created the two large lists. The input list and the output list. After that, the hash slice assignment is just reusing space acquired during the sort.

      It appears to me (perhaps wrongly so) that the problem reduces down to how to print all keys of a hash without Perl making a new list of all keys of that hash in order for the print to work? Assumption is that doubling the size of the storage for key values will exceed physical memory.

      In a more general sense: how to call a sub for each hash entry as the hash table is traversed.

      If that can be done, then output all keys to a file (call a print routine for each entry). Call a system sort routine for that file. The Perl program may be paged out. Many files may be created and lots of memory may be consumed, but when that sort finishes, there is a file that is in hash key order and all the physical memory that system sort used is now free.

      Perl program reads that big file (millions of lines) and assigns sequential numbers to each entry.

      Why wouldn't that work?

        I've never really tried to optimize code to minimize memory usage, so my thoughts here might be stupid and/or crazy. I'll go ahead and risk being ridiculed and toss out my ideas in case they might spark a better idea from a more experienced programmers.

        Marshall, your idea of sorting from a file is close to an idea that I had, but was very hesitant to put it in a post. However, it seems to me that sorting the file(s) as you suggest could potentially eat up a lot of memory. I admit that I could be dead wrong about that.

        Here's my stupid/crazy idea that's close to what Marshall suggested:

        • Loop through the unsorted keys of the hash.
        • For each key in the hash:
          • Open a file in inline edit mode
          • Loop through each line and insert the new key in the proper line (one hash key per line) based on desired sort method
        • After doing this for each hash key, the file above should have the keys in sorted order. Reopen the file. While looping through that file, you'll be progressing through a sorted list of the keys.

        In other words, instead of doing the sorting after populating the file with all of the hash keys, do the sorting one element at a time as each hash key is added to the file.

        I believe that this would sort the keys with minimal memory usage. However, execution time might not be that great or even take too long. Since BrowserUK said that "Speed is not a huge priority here", this might be acceptable depending on how long it takes.

        As I said, I have no experience optimizing for minimal memory usage, which means that this could be a horrible idea. I'm open to constructive criticism on this idea, which will help me learn more about optimizing.

        Why wouldn't that work?

        It would.

        But whilst memory rather than speed was the focus; solutions that avoid writing millions of lines to a file; sorting (which itself re-reads and re-writes all those lines, often several times); and the re-reading them all again; are likely to be considerably faster. Hence my preference for an 'internal' solution.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      You didn't comment on my multi-pass each approach. It is an ugly solution but there is no reason I can think of that it wouldn't work.

      Cheers - L~R

        You didn't comment on my multi-pass each approach.

        Um...mostly, because I didn't understand it. Or at least, I'm still pondering how it might be implemented?