you can prepare the hash so that it contains, as the “value” that is associated with the key, the position of the start of the record in the file and maybe also the length of that record.
The problem with that is, very little of the size of hash in memory is occupied by its values. And the internal overhead associated with each value is such that until the value gets to be over something like 16 bytes, there is no savings in storing a shorter value.
C:\test>p1 $h{ $_ } = 'xxxxxxxx' for 1 .. 1e6;; print total_size( \%h );; 112277576 C:\test>p1 $h{ $_ } = 1 for 1 .. 1e6;; print total_size( \%h );; 72277576
So unless the records are particularly long, there is little savings to be made by storing the file offset over storing the record itself.
I'm using a 64-bit perl, which means that integers occupy 8 bytes, which flatters my point somewhat, but even on a 32-bit perl, the savings are far from what you might at first expect. For example, the same 1 million key hash with "no values" at all still occupies a substantial the same as one with values:
C:\test>p1 undef $h{ $_ } for 1 .. 1e6;; print total_size( \%h );; 72277576
In summary: You can gain some space by storing record positions instead of the records themselves, but it doesn't buy you as much as you at first might think it would.
In reply to Re^2: Indexing two large text files
by BrowserUk
in thread Indexing two large text files
by never_more
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |