in reply to Saving an array to a disk file

Well, what have you tried as far as benchmarking is concerned? I actually would expect:

{ my $old = select $fh; local $\="\n"; print for @very_big_redundant_array; select $old; }
to be about as fast as it gets (i.e., not chewing up stupid amounts of memory if that array really is big, while still allowing your cache to save time). But I haven't benchmarked it, and computers can do really strange things that us humans don't expect.

That said, even faster is probably to not save it, but to uniq-sort it in memory:

my @very_big_sorted_unique_array = do { my %seen; $seen{$_} = 1 for @very_big_redundant_array; sort keys %seen; };
By bypassing the disk, you can get huge improvements in speed. If you run out of memory, this will still swap to disk, but that shouldn't be slower than your method. Only if you run out of address space will you actually have problems (which could be 1.5GB, 2GB, 3.5GB, 3.75GB, or some number of TB or something, depending on OS and architecture) that using the disk manually would prevent.

Of course, if your intention is to have a reboot in the middle somewhere, then persistant storage is important - don't get me wrong, saving a huge amount of data as quickly as possible is still a worthwhile question. But I'm not sure it is necessarily an important question for you without knowing that you need to load the data in another process.

Replies are listed 'Best First'.
Re^2: Saving an array to a disk file
by Anonymous Monk on May 26, 2006 at 05:18 UTC
    Dear Tanktalus,

    You have a great intuition! Actually this posting is a continuation of my earlier question:
    Howto Avoid Memory Problem in List::MoreUtils.
    My basic problem is "Out of memory" error during process of getting uniq array. Unfortunately, the suggestion by salva, at some point also giving me same memory problem. So following your comment:
    If you run out of memory, this will still swap to disk, but that shouldn't be slower than your method.
    Seems that your solution (last snippet) is the best I can get. Avoiding run out of memory problem yet still significantly as fast as List::MoreUtils::uniq ? Please correct me if I'm wrong..

      If you're running out of memory, increase your swap size.

      If you're on a unix/linux type of box, check your ulimit. Set it to unlimited (you may need superuser authority to do this). The only reason I can think of for a sysadmin to legitimitely say no is that you're still in school, and this is a school assignment. In that case, I'd suggest asking your professor for direction. Otherwise, it's either your machine at home (where you should already have superuser access - use it), or it's at work (where this is a work requirement and if the sysadmin says "no" then you ask your manager for help in turning that "no" to a "yes").

      (You might be able to tell that I don't suffer fool admins well.)

      As for as fast as List::MoreUtils::uniq - I didn't remember about that function. This solution probably won't be as fast as that one if your array is already sorted. If your array is not sorted, then this solution removes the duplicates before sorting meaning that you have less to sort - that should make it faster: O(MlogM) instead of O(NlogN) where M is the number of unique values while N is the total number of values. (It's actually closer to MlogM + N, but under normal order notation, the N is lower-order, and thus discarded.)

        As a (potentially fool) admin... I can think of reasons to legitimately say "no" which are not (in any sane workplace) subject to managerial override. They mostly fall into the category of "You want me to let you suck up all the memory on $RANDOM_MULTIUSER_PRODUCTION_HOST and bring other, potentially business-critical, processes to their knees as they wait for the disk to finish thrashing? No!" But, then, in a sane workplace, this sort of development would most likely be taking place on the developer's private workstation, or at least on a dedicated shared development machine, so those sorts of reasons wouldn't apply.