in reply to Re^3: Indexing two large text files
in thread Indexing two large text files

A 350MB total-size file can simply fit in memory and be done with it. (I know that you have recently dealt with files that are several orders of magnitude larger.)

Slurped into a scalar, okay. But for the OP's purpose he would need to build a hash from it, and that would require 52.5GB of ram.

Not impossible for sure, but it would (still, currently) take a machine that is a cut (or two) above the average commodity box, many of which the motherboards are still limited to 16 or 32GB.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^5: Indexing two large text files
by aaron_baugher (Curate) on Apr 10, 2012 at 15:15 UTC

    I feel like I'm missing something. Why would it take 52GB of memory to build a hash from 350MB of data? Does the hash overhead really take 150 times as much space as the data itself? I just wrote a little script that takes one of my httpd logs, splits each line on the first ", and uses those two sections as key and value of a hash. This log file is 27MB, and Devel::Size->total_size says the resulting hash is 38MB. That's 40% overhead, which seems much more reasonable, and would mean the original poster's 350MB might take up 500MB as a hash, still well within his limits.

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

      I did this:

      C:\test>p1 $h{ $_ } = 'x'x50 for 1 .. 10e6;; print total_size( \%h );; 1583106697 print 1583106697 * 35;; 55408734395

      Which looking back at the OP means I calculated the size of a 350 million record file instead of a 350MB file. My mistake.

      A more appropriate figure for the OPs 350MB file is 3.8GB:

      C:\test>dir file2x 10/04/2012 17:27 369,499,228 file2x C:\test>perl -nle"($k,$v)=split '\*'; $h{$k}=$v }{ print 'Check mem'; +<>" file2x Check mem 3.8GB

      I did try to use the latest Devel::Size to do the measurement, but it pushed the memory usage over 8GB before crashing. Looks like it is time for a new release of my unauthorised version.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Interesting. That's still more than 9 bytes of hash overhead for every byte of data, which seems like a lot. My own test script is below, following the results. It creates a 350MB file with random keys and values separated by a *, and then reads that file into a hash. I figured there was enough randomness to make duplicate keys (which would reduce the hash size) unlikely, but I added a check to be sure. In my test, running on 64-bit Linux, Devel::Size reports that the hash is just about 3 times the size of the file, or 2 bytes of overhead for each byte of data. A check on the memory size of the program after building the hash shows about 1.4GB in use, or close to 4 times the size of the file, so it might get killed after all on his system with a 1GB/process cap.

        That's still a far cry from your 3.8GB and 8GB+, though. Is Perl on Windows just that much less efficient with RAM for some reason? I realize that the shorter the keys and values, and thus the more of them there are in the file, the more overhead there is likely to be, but that's a big difference.

        bannor:~/work/perl/monks$ perl 964355.pl File size: 367001600 keys: 6924700 size: 1129106184 Overhead: 67.50% abaugher 11340 96.6 33.9 1402520 1376916 pts/3 S+ 17:25 4:16 perl +964355.pl bannor:~/work/perl/monks$ cat 964355.pl #!/usr/bin/env perl use Modern::Perl; use Devel::Size qw(total_size); # create a 350MB file with a single * in each line # dividing keys and values of random lengths of 10..40 chars open my $out, '>', 'bigfile' or die $!; while(-s 'bigfile' < 350*1024*1024 ){ my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); print $out "$part1*$part2\n"; } my $filesize = -s 'bigfile'; say 'File size: ', $filesize; # now process the file into a hash and analyze the hash my %h; open my $in, '<', 'bigfile' or die $!; while(<$in>){ chomp; my($unus, $duo) = split '\*'; die "Duplicate key!" if $h{$unus}; # no duplicates $h{$unus} = $duo; } close $in; say 'keys: ', scalar keys %h; my $totalsize = total_size(\%h); say 'size: ', $totalsize; printf "Overhead: %.2f%%\n",($totalsize - $filesize)*100/$totalsize; print `ps auxww|grep 964355.pl`;

        Aaron B.
        My Woefully Neglected Blog, where I occasionally mention Perl.