Re^8: Indexing two large text files

Okay. The reasons for the discrepancy between your figures and mine:

Your records are longer than mine, meaning fewer for the same size file.
Your records average 52-chars each: ( 10+(0..30) ) *2 +2; giving 350MB/52 ~ 7 million records.
My records were 20 million x 17-chars:
```
C:\test>perl -le"printf qq[key%07d*value\n], $_ for 1 .. 20e6" >bigfil
+e

C:\test>dir bigfile
11/04/2012  04:34       370,000,001 bigfile
[download]
```
The size of a hash is directly proportional to the number of key/value pairs, and far less influenced by the actual size those keys & values:
```
C:\test>perl -F\* -anle"$h{$F[0]}=$F[1] }{ print `tasklist /nh /fi \"p
+id eq $$\"`" bigfile

perl.exe                      3140 Console                    1  4,509
+,624 K
[download]
```
NOTE: That figure is 4,509,624 K ie. 4.3GB.
You only measured the size of the final hash; not the total memory consumed in getting there.
The 3.8GB/4.3GB numbers I measured, are the memory acquired by the process in order to build the hash.
As hashes fill, they reach a point where the current number of buckets is inadequate to contain the number of keys; so a new hash is created with double the number of buckets and the old hash is copied to the new, before the hash can continue expanding. This doubling happens many times when building a large hash. The point at which the hash doubles in size is complicated as it depends upon not just the number of keys contained, but also the number of collisions in individual buckets.
But for sake of discussion, let's assume that the doubling occurs when the current bucket count -- always a power of 2 -- becomes 75% utilised. To hold 20 million key/value pairs requires a bucket count of 2**25 = 33 million. So the point when the previous bucket count needed to be doubled was 2**24 = 16 million. That means in order to build the hash with 20 million keys, the process had to have memory allocated to hold 33 + 16 = 49 million keys.
The effects of the doubling can be clearly seen in a time trace of cpu/memory and IO activity
By the time you measure the memory, the space allocated to the smaller hash has been returned to the runtime for reallocation -- but not the OS. And, not all of those key slots had values associated (in either hash), so they didn't use as much space as fully allocated key/value pair structures, but the buckets slots needed to be there any way.
The upshot is, that even if the OP had a more normal 32-bit memory limit of 2GB or 3GB, he wouldn't have enough memory to build the hash even though the final structure might just squeeze into ram.

A few further comments:

I'll leave it to you to decide whether my dummy records or your dummy records (length) more closely resemble those shown by the OP.
The disparity in your example, between the size measured by Devel::Size and that shown by the ps output, is:
in part due to the space required to construct the hash as detailed above; and part due to the overhead of memory used by Devel::Size itself when performing the measurement. Run your ps before, as well as after using Devel::size to see the memory the latter uses in order to calculate the size of the hash.

Finally (and more tongue-in-cheek than criticism):

The method you used to construct your data file -- stating the file after every write -- is sooo sloooow!

A faster alternative would be:

open my $out, '>', 'bigfile' or die $!;
my $l = 0;
while( $l < 350*1024*1024 ){
    my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0
+..(rand(30)+10));
    my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0
+..(rand(30)+10));
    my $output = "$part1*$part2\n";
    $l += length $output;
    print $out $output;
}

my $filesize = -s 'bigfile';
say 'File size: ', $filesize;
[download]

The first time I ran your code I thought my disk must have died it took so long:)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Comment on Re^8: Indexing two large text files Select or Download Code

Replies are listed 'Best First'.
Re^9: Indexing two large text files by aaron_baugher (Curate) on Apr 12, 2012 at 16:58 UTC
Yes, I figured someone might comment on my slow-as-molasses file-building code, but I only needed to run that part once, so I didn't bother trying to speed it up. :-) Thanks for the back and forth on this; it's been very instructive. In nearly all cases, I'd say that "put your filtering file in a hash and process the other file against it" is such a superior algorithm that it's worth trying, even if you suspect it's going to force swapping to disk. But your solution for processing it in chunks was nice for situations where that just isn't possible. Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply]
Re^10: Indexing two large text files by BrowserUk (Patriarch) on Apr 12, 2012 at 18:16 UTC
In nearly all cases, I'd say that "put your filtering file in a hash and process the other file against it" is such a superior algorithm I agree with your thoughts on the use of hashes for filtering. that it's worth trying, even if you suspect it's going to force swapping to disk. The problem is that hashes are just about the ultimate in 'cache unfriendly data structures'. Once built, even if only a relatively small percentage of the total structure needs to be swapped out at any given time, their nature means that you will need to swap-in the bit that is out (and therefore swap out a bit that is currently in), for nearly every iteration of the lookup. In turn, that has a disastrous affect upon the usually good performance of serial reading the other file from disk. And if you enter swapping during hash creation, prior to the final doubling of the buckets, then the copying of the keys from the last intermediate stage to the final hash, just sends the disk head nuts. It can take forever. It's a great way to check if your disk is working properly :) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]