in reply to Saving an array to a disk file

once you have assumed you have to use external storage, a sort based solution is not the most convenient anymore.

At least in theory, using a hash with on-disk storage (i.e. DB_File) is a better way with O(N) cost:

my @data = qw(foo bar foo bar bar doz); use DB_File; use File::Temp qw(tempfile); my ($fh, $fn) = tempfile(UNLINK => 1); tie my %u, 'DB_File', $fn, O_RDWR, 0600; $u{$_} = undef for @data; @data = keys %u; print "@data\n";

Replies are listed 'Best First'.
Re^2: Saving an array to a disk file
by BrowserUk (Patriarch) on May 26, 2006 at 14:23 UTC

    Using the default settings, this takes just over 8 minutes for 1e7 elements, compared to just over 3 using sort. Berkeley is very fast once the data is in the DB, but getting it in takes time, which makes it not so useful for disposable applications like this.

    Having been down this road before, I know there are myriad options that can be used to tune the insert performance, but most of them involve using extra memory to buffer the DB, but memory is the premium item here.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Saving an array to a disk file
by Anonymous Monk on May 27, 2006 at 14:30 UTC
    Thanks so much again for the reply, salva.

    BTW, suppose I have two or more codes (containing your snippet above) running at the same time.
    Will it be conflicting? If so, how can I avoid that?
      As you can see from BrowserUk response to my previous post, it seems that in practice, the sort solution has better performance than the DB_File based one. I guess it's due to the sorting algorithm being more cache friendly (where cache = RAM).

      Anyway, answering your question: yes, you can have several processes running the code in my post at the same time, though, it would probably be slower than running them in sequence.