in reply to huge file->hash

I'm not sure I fully understand your goal, but I think you do not want to try loading the path_n or path_p file into a hash.

If I read your post correctly, your script will have three inputs:

You didn't say, exactly, but I assume that your goal is to find the records in each path file that matches one of the unique path strings, and do something with the full content of those records -- maybe just output these records.

If I got that right, then the method you want is something like this:

  1. read dif_file into a hash, using each path name as a hash key; you don't need to worry about what the hash value is -- you could do  $diff_hash{$path} = undef;
  2. open path_n, read it one line at a time, assign the path string to a variable, and see if %diff_hash contains an element with that path as the hash key; if so, print out the full record, otherwise go on to the next record
  3. open path_p, and do the same thing you did for path_n

So the loop over the path_* files might look like this:

for my $file ( qw/path_n path_p/ ) { open PATH, $file or die "$file: $!"; while (<PATH>) { # use a regex match with parens to capture path string in $1 # and test to see if the path string was in dif_file: if ( /^(\S+)/ and exists( $diff_hash{$1} )) { print; # or whatever else you need to do with this record } } close PATH; }

Replies are listed 'Best First'.
Re^2: huge file->hash
by ISAI student (Scribe) on Apr 21, 2005 at 07:43 UTC
    Sorry about that. The answer is this:
    The dif_file contains two lists.
    1. paths to be looked in file path_n
    2. paths to be looed in file path_p
    I know how to map from the list to the file and vice-veras
    By reading the path data, from one huge file at a time (a few lines for each path) I can go and look it's respective list. if it is there, I should do X, continue
    My idea was to convert these two lists into hashes, and to keep deleting from the hashes paths found, as It is relatively easy to genrate the key, so using exists($hash{$key}) shouldn't be that much of a problem.
    That way, when the two hashes are empty, I can simply write my final output and exit, and not read the two huge files for paths that aren't needed.
    Is it more easily understood now?

      The size of your hashes in memory is going to be your limiting factor.

      You say each of your files is approx. 2GB in size, and each element is "several lines". You certainly won't be able to store a hash containing all the data in both files in memory on most normal PCs that typically have a 2GB/process ram limit.

      If your paths are an average of say 5 lines of 80 chars in length, then that gives approximately 5 million paths in each file. And if the keys to your paths were say 10 chars in length, then you could represent each file in memory with a hash that has 5 million 10-char keys and a single integer value--the byte offset into the file. That approximates to 250MB per file, which would theoretically allow you to index both your huge files in memory simultaneously, if the machine this is going to run on has a full complement of ram.

      However, if your paths are substantially less in average length (giving more paths) or your keys are substantially longer, then memory requirement for each hash grows and you start to move to the point where you are pushing the limits of what is possible.

      Your alternative in that case is to look at further reducing the memory requirements of storing the index, which basically means getting into using (or writing) some kind of less memory hungry alternative to hashes--Tie::SubstrHash comes with (most) distributions of perl, and whilst it's a bit awkward to use, and substantially slower than a real hash, it does reduce the memory requirements markedly.

      Or, you could load your data into some flavour of DB.

      A Berkely DB might fit the bill, but it requires moving your datafiles into the Berkeley format. This is a fairly slow process, and the resultant files will be much larger than the equivalent flats files. Once transfered, and with the optimal configuration, the actual lookups during processing will be quite fast say 4x to 10x slower than a normal hash.

      Or you could use a general purpose RDBMS. Again the initial loading of the flatfiles into the DB will require a fair amount of time, and the diskspace requirement climbs substantially, and as each lookup will require a separate SQL statement with it's communications overhead, so the processing stage will be substantially slower than the Berkeley option.

      Which approach is best for your situation really depends on how often you would need to do the processing, how often you would need to reload the DB, and whether you could change the process that produces the two files to write the data directly to one form of DB or the other.

      You haven't provided any clues to the numbers of records (paths), or their sizes, or the key lengths, or the frequencies of the tasks, so the best that anyone can do is give you a very general reply.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?
      Is it more easily understood now? Um, sorry, but no I'm afraid it is not.

      Go ahead and try something out. If what you try works to your satisfaction, great. If not, and if you would like us to help, show us the code you tried, along with a minimal sample of input data from each file.

      And if you can, try to state clearly what your "final output" is supposed to be.