Re^4: Indexing two large text files

Replies are listed 'Best First'.
Re^5: Indexing two large text files by aaron_baugher (Curate) on Apr 10, 2012 at 15:15 UTC
I feel like I'm missing something. Why would it take 52GB of memory to build a hash from 350MB of data? Does the hash overhead really take 150 times as much space as the data itself? I just wrote a little script that takes one of my httpd logs, splits each line on the first ", and uses those two sections as key and value of a hash. This log file is 27MB, and `Devel::Size->total_size` says the resulting hash is 38MB. That's 40% overhead, which seems much more reasonable, and would mean the original poster's 350MB might take up 500MB as a hash, still well within his limits. Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply] [d/l]
Re^6: Indexing two large text files by BrowserUk (Patriarch) on Apr 10, 2012 at 16:11 UTC
I did this: `C:\test>p1 $h{ $_ } = 'x'x50 for 1 .. 10e6;; print total_size( \%h );; 1583106697 print 1583106697 * 35;; 55408734395` [download] Which looking back at the OP means I calculated the size of a 350 million record file instead of a 350MB file. My mistake. A more appropriate figure for the OPs 350MB file is 3.8GB: `C:\test>dir file2x 10/04/2012 17:27 369,499,228 file2x C:\test>perl -nle"($k,$v)=split '\*'; $h{$k}=$v }{ print 'Check mem'; +<>" file2x Check mem 3.8GB` [download] I did try to use the latest Devel::Size to do the measurement, but it pushed the memory usage over 8GB before crashing. Looks like it is time for a new release of my unauthorised version. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^7: Indexing two large text files by aaron_baugher (Curate) on Apr 10, 2012 at 22:33 UTC
Interesting. That's still more than 9 bytes of hash overhead for every byte of data, which seems like a lot. My own test script is below, following the results. It creates a 350MB file with random keys and values separated by a , and then reads that file into a hash. I figured there was enough randomness to make duplicate keys (which would reduce the hash size) unlikely, but I added a check to be sure. In my test, running on 64-bit Linux, Devel::Size reports that the hash is just about 3 times the size of the file, or 2 bytes of overhead for each byte of data. A check on the memory size of the program after building the hash shows about 1.4GB in use, or close to 4 times the size of the file, so it might get killed after all on his system with a 1GB/process cap. That's still a far cry from your 3.8GB and 8GB+, though. Is Perl on Windows just that much less efficient with RAM for some reason? I realize that the shorter the keys and values, and thus the more of them there are in the file, the more overhead there is likely to be, but that's a big difference. bannor:~/work/perl/monks$ perl 964355.pl File size: 367001600 keys: 6924700 size: 1129106184 Overhead: 67.50% abaugher 11340 96.6 33.9 1402520 1376916 pts/3 S+ 17:25 4:16 perl +964355.pl bannor:~/work/perl/monks$ cat 964355.pl #!/usr/bin/env perl use Modern::Perl; use Devel::Size qw(total_size); # create a 350MB file with a single in each line # dividing keys and values of random lengths of 10..40 chars open my $out, '>', 'bigfile' or die $!; while(-s 'bigfile' < 35010241024 ){ my $part1 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); my $part2 = join '', map { ('A'..'Z','a'..'z',0..9)[rand(62)] } (0 +..(rand(30)+10)); print $out "$part1$part2\n"; } my $filesize = -s 'bigfile'; say 'File size: ', $filesize; # now process the file into a hash and analyze the hash my %h; open my $in, '<', 'bigfile' or die $!; while(<$in>){ chomp; my($unus, $duo) = split '\'; die "Duplicate key!" if $h{$unus}; # no duplicates $h{$unus} = $duo; } close $in; say 'keys: ', scalar keys %h; my $totalsize = total_size(\%h); say 'size: ', $totalsize; printf "Overhead: %.2f%%\n",($totalsize - $filesize)*100/$totalsize; print `ps auxww\|grep 964355.pl`; [download] Aaron B. My Woefully Neglected Blog, where I occasionally mention Perl.	[reply] [d/l]
Re^8: Indexing two large text files by BrowserUk (Patriarch) on Apr 11, 2012 at 03:55 UTC
Re^9: Indexing two large text files by aaron_baugher (Curate) on Apr 12, 2012 at 16:58 UTC
Some notes below your chosen depth have not been shown here