in reply to Re^4: write to Disk instead of RAM without using modules
in thread write to Disk instead of RAM without using modules

The code you've posted is rubbish. Ie.

  1. If you need to slurp files, then you should be setting $/ = undef not $/ = "";.

    This only "works" for your files by blind luck.

  2. Having slurped the entire file into a string, you then chomp it.

    Except chomp removes the current value of $/ from the end of the string. As you have $/ = "";, this does nothing.

  3. You then do my ($key, $value) = split ('\t', $_);

    But the files do not contain any tabs, so the result is that you've copied the entire file into $key and set $value to undef.

  4. You then split the file to an array of lines in order to pick out the sequence that you use as your $key1...

    Laborious, but okay.

  5. Then you do $seen{$key1} //= [ $key ];.

    Ie. You store a string, containing the entire file contents, in an anonymous array, and store that as the value indexed by the sequence.

    Why? Why store the entire contents of all the files, when you could read them back from disk at any time?

  6. Then you do push (@{$seen{$key1}}, $value);.

    But as explained above, $value will always be undef.

    What you are doing is storing the contents of all your files, and using arrays of undefs as a mechanism to count how many of those files each sequence appears in.

    And you wonder why you are running out of space!

And now you want to write that hash to disk, to avoid running out of memory! That's a really silly idea when most of the contents of that hash are already stored on disk in the files you are reading!

Why not just store the name of the file and read it again when you need it? And increment an integer value for each file containing the sequence?

That would reduce the memory requirements of your application to ~300 bytes per unique sequence, regardless of the number of files they are in. Which means that a typical 8GB system would be able to handle at least 20 million files (if they were all unique) and any number of duplicates without running out of memory.

All in all, the standard of the code you posted, and your desire to work around the self-inflicted problems it contains by writing your hash to disk -- *without using modules* -- is a pretty clear indication that you need to take a programming course or two, or find someone local to you to help you over the learning curve. Asking stranger's on the internet to do your job for you isn't going to fly.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.