in reply to Re^4: Indexing two large text files
in thread Indexing two large text files

I feel like I'm missing something. Why would it take 52GB of memory to build a hash from 350MB of data? Does the hash overhead really take 150 times as much space as the data itself? I just wrote a little script that takes one of my httpd logs, splits each line on the first ", and uses those two sections as key and value of a hash. This log file is 27MB, and Devel::Size->total_size says the resulting hash is 38MB. That's 40% overhead, which seems much more reasonable, and would mean the original poster's 350MB might take up 500MB as a hash, still well within his limits.

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.

Replies are listed 'Best First'.
Re^6: Indexing two large text files
by BrowserUk (Patriarch) on Apr 10, 2012 at 16:11 UTC

    I did this:

    C:\test>p1 $h{ $_ } = 'x'x50 for 1 .. 10e6;; print total_size( \%h );; 1583106697 print 1583106697 * 35;; 55408734395

    Which looking back at the OP means I calculated the size of a 350 million record file instead of a 350MB file. My mistake.

    A more appropriate figure for the OPs 350MB file is 3.8GB:

    C:\test>dir file2x 10/04/2012 17:27 369,499,228 file2x C:\test>perl -nle"($k,$v)=split '\*'; $h{$k}=$v }{ print 'Check mem'; +<>" file2x Check mem 3.8GB

    I did try to use the latest Devel::Size to do the measurement, but it pushed the memory usage over 8GB before crashing. Looks like it is time for a new release of my unauthorised version.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Interesting. That's still more than 9 bytes of hash overhead for every byte of data, which seems like a lot. My own test script is below, following the results. It creates a 350MB file with random keys and values separated by a *, and then reads that file into a hash. I figured there was enough randomness to make duplicate keys (which would reduce the hash size) unlikely, but I added a check to be sure. In my test, running on 64-bit Linux, Devel::Size reports that the hash is just about 3 times the size of the file, or 2 bytes of overhead for each byte of data. A check on the memory size of the program after building the hash shows about 1.4GB in use, or close to 4 times the size of the file, so it might get killed after all on his system with a 1GB/process cap.

      That's still a far cry from your 3.8GB and 8GB+, though. Is Perl on Windows just that much less efficient with RAM for some reason? I realize that the shorter the keys and values, and thus the more of them there are in the file, the more overhead there is likely to be, but that's a big difference.

      bannor:~/work/perl/monks$ perl 964355.pl File size: 367001600 keys: 6924700 size: 1129106184 Overhead: 67.50% abaugher 11340 96.6 33.9 1402520 1376916 pts/3 S+ 17:25 4:16 perl +964355.pl bannor:~/work/perl/monks$ cat 964355.pl #!/usr/bin/env perl use Modern::Perl; use Devel::Size qw(total_size); # create a 350MB file with a single * in each line # dividing keys and values of random lengths of 10..40 chars open my $out, '>', 'bigfile' or die $!; while(-s 'bigfile' < 350*1024*1024 ){ my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); print $out "$part1*$part2\n"; } my $filesize = -s 'bigfile'; say 'File size: ', $filesize; # now process the file into a hash and analyze the hash my %h; open my $in, '<', 'bigfile' or die $!; while(<$in>){ chomp; my($unus, $duo) = split '\*'; die "Duplicate key!" if $h{$unus}; # no duplicates $h{$unus} = $duo; } close $in; say 'keys: ', scalar keys %h; my $totalsize = total_size(\%h); say 'size: ', $totalsize; printf "Overhead: %.2f%%\n",($totalsize - $filesize)*100/$totalsize; print `ps auxww|grep 964355.pl`;

      Aaron B.
      My Woefully Neglected Blog, where I occasionally mention Perl.

        Okay. The reasons for the discrepancy between your figures and mine:

        1. Your records are longer than mine, meaning fewer for the same size file.

          Your records average 52-chars each: ( 10+(0..30) ) *2 +2; giving 350MB/52 ~ 7 million records.

          My records were 20 million x 17-chars:

          C:\test>perl -le"printf qq[key%07d*value\n], $_ for 1 .. 20e6" >bigfil +e C:\test>dir bigfile 11/04/2012 04:34 370,000,001 bigfile

          The size of a hash is directly proportional to the number of key/value pairs, and far less influenced by the actual size those keys & values:

          C:\test>perl -F\* -anle"$h{$F[0]}=$F[1] }{ print `tasklist /nh /fi \"p +id eq $$\"`" bigfile perl.exe 3140 Console 1 4,509 +,624 K

          NOTE: That figure is 4,509,624 K ie. 4.3GB.

        2. You only measured the size of the final hash; not the total memory consumed in getting there.

          The 3.8GB/4.3GB numbers I measured, are the memory acquired by the process in order to build the hash.

          As hashes fill, they reach a point where the current number of buckets is inadequate to contain the number of keys; so a new hash is created with double the number of buckets and the old hash is copied to the new, before the hash can continue expanding. This doubling happens many times when building a large hash. The point at which the hash doubles in size is complicated as it depends upon not just the number of keys contained, but also the number of collisions in individual buckets.

          But for sake of discussion, let's assume that the doubling occurs when the current bucket count -- always a power of 2 -- becomes 75% utilised. To hold 20 million key/value pairs requires a bucket count of 2**25 = 33 million. So the point when the previous bucket count needed to be doubled was 2**24 = 16 million. That means in order to build the hash with 20 million keys, the process had to have memory allocated to hold 33 + 16 = 49 million keys.

          The effects of the doubling can be clearly seen in a time trace of cpu/memory and IO activity

          By the time you measure the memory, the space allocated to the smaller hash has been returned to the runtime for reallocation -- but not the OS. And, not all of those key slots had values associated (in either hash), so they didn't use as much space as fully allocated key/value pair structures, but the buckets slots needed to be there any way.

          The upshot is, that even if the OP had a more normal 32-bit memory limit of 2GB or 3GB, he wouldn't have enough memory to build the hash even though the final structure might just squeeze into ram.

        A few further comments:

        • I'll leave it to you to decide whether my dummy records or your dummy records (length) more closely resemble those shown by the OP.
        • The disparity in your example, between the size measured by Devel::Size and that shown by the ps output, is:

          in part due to the space required to construct the hash as detailed above; and part due to the overhead of memory used by Devel::Size itself when performing the measurement. Run your ps before, as well as after using Devel::size to see the memory the latter uses in order to calculate the size of the hash.

        • Finally (and more tongue-in-cheek than criticism):

          The method you used to construct your data file -- stating the file after every write -- is sooo sloooow!

          A faster alternative would be:

          open my $out, '>', 'bigfile' or die $!; my $l = 0; while( $l < 350*1024*1024 ){ my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $output = "$part1*$part2\n"; $l += length $output; print $out $output; } my $filesize = -s 'bigfile'; say 'File size: ', $filesize;

          The first time I ran your code I thought my disk must have died it took so long:)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?