Interesting. That's still more than 9 bytes of hash overhead for every byte of data, which seems like a lot. My own test script is below, following the results. It creates a 350MB file with random keys and values separated by a *, and then reads that file into a hash. I figured there was enough randomness to make duplicate keys (which would reduce the hash size) unlikely, but I added a check to be sure. In my test, running on 64-bit Linux, Devel::Size reports that the hash is just about 3 times the size of the file, or 2 bytes of overhead for each byte of data. A check on the memory size of the program after building the hash shows about 1.4GB in use, or close to 4 times the size of the file, so it might get killed after all on his system with a 1GB/process cap.

That's still a far cry from your 3.8GB and 8GB+, though. Is Perl on Windows just that much less efficient with RAM for some reason? I realize that the shorter the keys and values, and thus the more of them there are in the file, the more overhead there is likely to be, but that's a big difference.

bannor:~/work/perl/monks$ perl 964355.pl File size: 367001600 keys: 6924700 size: 1129106184 Overhead: 67.50% abaugher 11340 96.6 33.9 1402520 1376916 pts/3 S+ 17:25 4:16 perl +964355.pl bannor:~/work/perl/monks$ cat 964355.pl #!/usr/bin/env perl use Modern::Perl; use Devel::Size qw(total_size); # create a 350MB file with a single * in each line # dividing keys and values of random lengths of 10..40 chars open my $out, '>', 'bigfile' or die $!; while(-s 'bigfile' < 350*1024*1024 ){ my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); print $out "$part1*$part2\n"; } my $filesize = -s 'bigfile'; say 'File size: ', $filesize; # now process the file into a hash and analyze the hash my %h; open my $in, '<', 'bigfile' or die $!; while(<$in>){ chomp; my($unus, $duo) = split '\*'; die "Duplicate key!" if $h{$unus}; # no duplicates $h{$unus} = $duo; } close $in; say 'keys: ', scalar keys %h; my $totalsize = total_size(\%h); say 'size: ', $totalsize; printf "Overhead: %.2f%%\n",($totalsize - $filesize)*100/$totalsize; print `ps auxww|grep 964355.pl`;

Aaron B.
My Woefully Neglected Blog, where I occasionally mention Perl.


In reply to Re^7: Indexing two large text files by aaron_baugher
in thread Indexing two large text files by never_more

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.