in reply to write to Disk instead of RAM without using modules

You're looking for records common to a group of files.

Suppose you have one file containing the following words (one word per line): "two, four, six, seven", a second file containing "one, three, four, seven, eight", and a third file containing "two, four, seven, nine".

Read the first file and store the words in a hash with a value 1. You get a hash looking like this:

( two => 1, four => 1, six => 1, seven => 1);
Read the second file line by line, and, for each line, check if the line is in the hash. If it isn't there, just discard the line: it wasn't in the first file, it cannot be in all files. If it is in the hash, just increment the counter for it. Your end up with somehing like this:
( two => 1, four => 2, six => 1, seven => 2);
Repeat the same process with the third file, and you get:
( two => 1, four => 3, six => 1, seven => 3);
Notice that your hash is not growing, even though you've read three files. Only one file has ever been loaded into memory. Now you know that the records common to all files are those whose value is 3, you can just print them or do whatever you want with them.

Replies are listed 'Best First'.
Re^2: write to Disk instead of RAM without using modules
by Laurent_R (Canon) on Oct 25, 2016 at 16:25 UTC
    This is quick perlish pseudo code for the process just explained in my post just above:
    my %seen; open my $FH, "<", "file1.txt" or die "cannot open ..."; while (<$FH>) { chomp; $seen($_) = 1; } close $FH; for my $other_file (qw/ file2.txt file3.txt file4.txt .../) { open my $FH, "<", $other_file or die "cannot open ..."; while (<$FH>) { chomp; if (exists $seen{$_}) { $seen{$_}++; } } close $FH; }
    Only the first file is ever loaded into memory. When you read the other files, you just update counters. Even if you have hundreds of files, it will work provided the first file can be loaded into the hash.

    Update:: moved the second chomp line to the right place (within the while loop, not just before).

      Your suggestion will discard those records which are file specific i.e not present in first file but may be present in others. I want those records also to be printed with their value. for example:
      file1: aaa 1 abc 1 acb 2 file2: aaa 2 abb 1 acb 1 file3: acb 2 aaa 3 abc 1 output: aaa 1 2 3 abc 1 0 1 acb 2 1 2 abb 0 1 0
        That's exactly why I asked several time for a detailed explanation of what you need. From your last message describing your requirement,your files you were just looking for records common to all files in our collection. Now your need is different.

        The method I suggested is still possible, but with a slight modification. When you compare all your files with the first one, write to disk the records that were not found in the hash. You'll end up with versions of all the other files with records from.the first file filtered out. At this point, the original %seen hash is no longer needed. Your can now compare the filtered file2 (presumably significantly smaller than the original one) with file3, file4, etc (also filtered and smaller), and so one. And you end up with a situation where your input file get smaller and smaller and, at any given point in the process, you only have one file in memory.

        You write stuff to disk, but the amount of data you need to handle is shrinking at each step in the process.

        I am very sorry but still not getting your point.
        I want to compare all files with each other and not first file with all other. Hope I am clear. may be the records are present in second and third file but not in first. I need them also.
        I tried with your solution but getting errors with script. I cannot solve the problem. please please help me with a short example. I have to get through this issue for further processing. Please help me and sorry for the inconvenience caused to you because of the unclear details earlier.