Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Suppose I have this array:
# after some tedious process I have @very_big_redundant_array;
My question is how can I save that array into a disk file in a efficient way? Coz later what I want to do is to perform an external unix sort -u on that file. And later reading it back again to the array.

Replies are listed 'Best First'.
Re: Saving an array to a disk file
by Tanktalus (Canon) on May 26, 2006 at 04:17 UTC

    Well, what have you tried as far as benchmarking is concerned? I actually would expect:

    { my $old = select $fh; local $\="\n"; print for @very_big_redundant_array; select $old; }
    to be about as fast as it gets (i.e., not chewing up stupid amounts of memory if that array really is big, while still allowing your cache to save time). But I haven't benchmarked it, and computers can do really strange things that us humans don't expect.

    That said, even faster is probably to not save it, but to uniq-sort it in memory:

    my @very_big_sorted_unique_array = do { my %seen; $seen{$_} = 1 for @very_big_redundant_array; sort keys %seen; };
    By bypassing the disk, you can get huge improvements in speed. If you run out of memory, this will still swap to disk, but that shouldn't be slower than your method. Only if you run out of address space will you actually have problems (which could be 1.5GB, 2GB, 3.5GB, 3.75GB, or some number of TB or something, depending on OS and architecture) that using the disk manually would prevent.

    Of course, if your intention is to have a reboot in the middle somewhere, then persistant storage is important - don't get me wrong, saving a huge amount of data as quickly as possible is still a worthwhile question. But I'm not sure it is necessarily an important question for you without knowing that you need to load the data in another process.

      Dear Tanktalus,

      You have a great intuition! Actually this posting is a continuation of my earlier question:
      Howto Avoid Memory Problem in List::MoreUtils.
      My basic problem is "Out of memory" error during process of getting uniq array. Unfortunately, the suggestion by salva, at some point also giving me same memory problem. So following your comment:
      If you run out of memory, this will still swap to disk, but that shouldn't be slower than your method.
      Seems that your solution (last snippet) is the best I can get. Avoiding run out of memory problem yet still significantly as fast as List::MoreUtils::uniq ? Please correct me if I'm wrong..

        If you're running out of memory, increase your swap size.

        If you're on a unix/linux type of box, check your ulimit. Set it to unlimited (you may need superuser authority to do this). The only reason I can think of for a sysadmin to legitimitely say no is that you're still in school, and this is a school assignment. In that case, I'd suggest asking your professor for direction. Otherwise, it's either your machine at home (where you should already have superuser access - use it), or it's at work (where this is a work requirement and if the sysadmin says "no" then you ask your manager for help in turning that "no" to a "yes").

        (You might be able to tell that I don't suffer fool admins well.)

        As for as fast as List::MoreUtils::uniq - I didn't remember about that function. This solution probably won't be as fast as that one if your array is already sorted. If your array is not sorted, then this solution removes the duplicates before sorting meaning that you have less to sort - that should make it faster: O(MlogM) instead of O(NlogN) where M is the number of unique values while N is the total number of values. (It's actually closer to MlogM + N, but under normal order notation, the N is lower-order, and thus discarded.)

Re: Saving an array to a disk file
by BrowserUk (Patriarch) on May 26, 2006 at 07:29 UTC

    Rather than writing the array to disk, having sort read it from disk and write the results back to disk, and then reading them in; you can save a couple of steps by piping the data directly to sort using a piped open, and let it write the results to disk for you to read back. The following processes 10 million elements in around 3 minutes on my machine.

    #! perl -slw use strict; use Benchmark::Timer; our $N ||= 1e6; my $T = new Benchmark::Timer; my @array; push @array, sprintf "Test%07d", int rand 32767 for 1 .. $N; $T->start( 'sort -u' ); open my $pipe, "| u:sort -u > temp.tmp" or die $!; print $pipe $_ for @array; close $pipe; open my $in, '<', 'temp.tmp' or die $!; my $i = 0; $array[ $i++ ] = $_ while <$in>; close $in; $#array = $i - 1; ## Corrected, with thanks to [johngg] $T->stop( 'sort -u' ); $T->report; printf "Array now contains %d uniq elements\n", scalar @array; __END__ [ 8:17:47.37] c:\test>551753-2 -N=1e7 1 trial of sort -u (183.281s total) Array now contains 32768 uniq elements

    In theory, you could use IPC::Run3 to avoid hitting the disk at all, but then you have the problem of needing storage for both the input and output at the same time. The code above reads the data back into the original array and truncates the leftovers, so avoiding any extra memory growth.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I might be barking up the wrong tree here but I think your

      $#array = $i;

      should be

      $#array = $i - 1;

      because the $i subscript has been post-incremented ready to receive the next value in the ... while <$in>; loop. I don't think you can get 32768 unique values when generating them with ... int rand 32767 ....

      Cheers,

      JohnGG

        Indeed you are absolutely correct! Updated, thankyou++.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Dear BrowserUK,

      As I'm running in Linux, I enconter problem running your script above. On running it it gave this error
      $perl save_disk_to_array_browseruk.pl sh: u:sort: command not found
      Then when I tried removing u: into
      open my $pipe, "| sort -u > temp.tmp" or die $!;
      It gave only:
      Array now contains 1 uniq elements
      Is there any changes need to be made on your code above, if I'm running on Unix? Sorry to trouble you, I'm new in Perl. Thanks so much and hope to hear from you again.

        Yup. Switching from the win32 command to a unix command in order to run the code on unix is a good idea :)

        With that change made, the code should run correct on any system that has a sort command that takes a -u option. If that is not the case on your system, you will need to investigate the syntax for the command options on your machine, verify that the command is being run correctly, and generally work out why your system is different.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Saving an array to a disk file
by Samy_rio (Vicar) on May 26, 2006 at 03:50 UTC

    Hi, Try this,

    TIMTOWDI

    use strict; use warnings; my @array = qw(foo bar baz); open OUT, ">filename.txt" or die "$!"; $,="\n"; print OUT @array; close OUT;

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

Re: Saving an array to a disk file
by GrandFather (Saint) on May 26, 2006 at 03:24 UTC

    Have you tried the simple approach:

    use strict; use warnings; my @array = (...}; open OUTFILE, '>', 'filename' or die "Couldn't create filename: $!"; print join "\n", @array; close OUTFILE;

    Update: fixed i/o direction and error string special variable!


    DWIM is Perl's answer to Gödel

      Just a quick question because I don't know if perl would optimize that code but... would the join not be a bit onerous if the array is very large? (Also, cleaned up teh code a little.)

      Perhaps a foreach loop?

      use strict; use warnings; my @array = ('array','stuff','here'); open OUTFILE, '>', 'filename' or die "Couldn't create filename: $!"; print OUTFILE $_,"\n" foreach @array; close OUTFILE;

      Update: D'oh. Cleaned up my comment a little... would help if I could type

        Hard to be sure. The join generates a large string that is output as one large write (maybe). The foreach (possibly) generates a large number of small writes. For file I/O one large write (the join) generally is much better than a large number of smaller writes (the foreach).

        On the other hand, if the string gets so large it starts swapping out to disk, then the join is likely to be much worse.


        DWIM is Perl's answer to Gödel
Re: Saving an array to a disk file
by salva (Canon) on May 26, 2006 at 13:34 UTC
    once you have assumed you have to use external storage, a sort based solution is not the most convenient anymore.

    At least in theory, using a hash with on-disk storage (i.e. DB_File) is a better way with O(N) cost:

    my @data = qw(foo bar foo bar bar doz); use DB_File; use File::Temp qw(tempfile); my ($fh, $fn) = tempfile(UNLINK => 1); tie my %u, 'DB_File', $fn, O_RDWR, 0600; $u{$_} = undef for @data; @data = keys %u; print "@data\n";

      Using the default settings, this takes just over 8 minutes for 1e7 elements, compared to just over 3 using sort. Berkeley is very fast once the data is in the DB, but getting it in takes time, which makes it not so useful for disposable applications like this.

      Having been down this road before, I know there are myriad options that can be used to tune the insert performance, but most of them involve using extra memory to buffer the DB, but memory is the premium item here.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Thanks so much again for the reply, salva.

      BTW, suppose I have two or more codes (containing your snippet above) running at the same time.
      Will it be conflicting? If so, how can I avoid that?
        As you can see from BrowserUk response to my previous post, it seems that in practice, the sort solution has better performance than the DB_File based one. I guess it's due to the sorting algorithm being more cache friendly (where cache = RAM).

        Anyway, answering your question: yes, you can have several processes running the code in my post at the same time, though, it would probably be slower than running them in sequence.