Merging/Joining Multiple Large Files

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Fellow Monks,

I'm trying to pre-process some raw data so that I can use it for a variety of machine-learning analyses. The raw data itself is quite simply formatted:

ID\tID2\tID3\tID4\t.......value
[download]

Each ID is a character string between 3 and 7 characters long (median length = 5). Depending on the specific data, the number of IDs ranges between 3 and 50. I'm currently working on an intermeidate case of 9. The values are floating point numbers. Here are a few example rows:

THRAP2  ICA1    TIP47   PCSK1   PTTG1   CIT     CEACAM6 ADM     RMB5  
+  0.526
DOC_1R  CPE     GRB7    RELA    PTGES   PLOD2   SLC2A1  G22P1   GPR56 
+  1.664
NICE_4  GPC3    SNRPB   MST1    ITGA11  PHB     STARD10 TYRP1   ARG1  
+  0.579
THRAP2  GAPD    KRT5    CEACAM6 COPEB   CNN1    DSC2    LYPLA1  NAP1L1
+  1.568
ACTA2   CPE     TIP47   ZYG     SSBP1   CIT     IGFBP3  CALB2   HIF1a 
+  0.742
[download]

I have 12 such files, and each file has approximately 5 million records. The files have the same rows (based on the IDs, and in the same order. Records are *not* guaranteed to be unique, but duplicate records can be discarded freely. I wish to merge these 12 files into a single file. This file would have the 9 IDs, followed by the 12 floats.

In theory this is really, really simple. You could just open all 12 file-handles, and move through them line-by-line. Unfortunately, my data is messy in two ways:

any one of the 12 matching rows for a given set of IDs can be incomplete -- either IDs or values can be missing. In these cases, the entire row can be discarded from all 12 files
any one of the 12 matching rows for a given set of IDs can be entirely missing. Not replaced by a blank row, just completely skipped.

So, how can I handle this? I'm trying to avoid loading all this into a database if possible, so a pure-perl solution would be very beneficial.

My initial approach was to simply brute-force it: create a hash with the IDs concatenated into a key, and an array of integers. Here's the core of the code:

my $file_number = 0;

# loop through all file-handles
foreach my $file (@files) {

        # open the file
        open(my $fh, '<', $file) or croak "Unable to open file: $file"
+;

        # count this file as being processed
        ++$file_number;

        # give the user a little bit of an update
        print "Now processing file $file_number\n";

        # completely process the file
        while (<$fh>) {

                # trim leading/trailing whitespace
                s/^\s+//;
                s/\s+$//;

                # tab-delimited rows with the stats value in the last 
+column
                my @row = split("\t");
                my $val = pop(@row);

                # sort the row and concatenate it into a key
                @row = sort(@row);
                my $key = join("\t", @row);

                # skip rows with missing values
                next() if (scalar(@row) != $subset_num);

                # save this data point
                push @{ $data{$key} }, $val;

                }

        # look for any missing values
        foreach my $key (keys(%data)) {
                if ($file_number > scalar( @{ $data{$key} } )) {
                        push @{ $data{$key} }, '';
                        }
                }

        # close the filehandle
        close($fh);

        }
[download]

Unfortunately this doesn't work for exactly the reason you'd expect:

genra@qcxxt[164] >> perl -w rewriter.pl n09 9
Now processing file 1
Now processing file 2
Now processing file 3
Now processing file 4
Now processing file 5
Now processing file 6
Now processing file 7
Now processing file 8
Now processing file 9
Now processing file 10
Out of memory!
[download]

I can try to squeeze down memory usage by replacing the IDs with some sort of briefer representation. However, even this won't be helpful when I move to records with 50 IDs.

I'm looking for some bright ideas on how to do this "database-less merge/join" as efficiently as possible. I'm okay eating up more CPU, but I need a way to squeeze down the space requirements as much as possible. My idea was to process in "batches" of a million records, but I was hoping for a cleaner/easier solution from ye wise monks!

Comment on Merging/Joining Multiple Large Files Select or Download Code

Replies are listed 'Best First'.
Re: Merging/Joining Multiple Large Files by Anno (Deacon) on Sep 24, 2007 at 20:47 UTC
Try to tie the large hash to a disk file. Your program should'nt need any changes. (Except, perhaps, that `scalar` is unnecessary in `file_number > scalar( @{ $data{$key} } )` [download] The numeric comparison already provides scalar context, so `file_number > @{ $data{$key} }` [download] is equivalent.) Anno	[reply] [d/l] [select]
Re: Merging/Joining Multiple Large Files by kyle (Abbot) on Sep 24, 2007 at 20:41 UTC
If you're not going to keep it all in memory, you're going to write it all to disk. The question is how exactly to do that. Instead of a straight in-memory hash, you could use a DBM::Deep file. This would require very little modification of your current code. Alternately a solution based on Cache::Cache (like Cache::FileCache) might serve well. A more crude solution might be to turn your hash keys into filenames and store each hash entry as a file with one line per value. You end up with millions of files, each 12 lines long. Then you can post-process those files into one big file pretty easily. In all cases, it's probably going to be a lot slower than memory.	[reply]
Re: Merging/Joining Multiple Large Files by jdporter (Paladin) on Sep 24, 2007 at 20:36 UTC
The files have the same rows (based on the IDs, and in the same order. This sounds to me like you saying that the files correspond precisely by line number. If so, then I'd open 12 files for reading, and read one line from each. When all twelve "current" lines have been read in, look for reasons to discard, process accordingly, and then move on to the next line in each file.	[reply]
Re: Merging/Joining Multiple Large Files by roboticus (Chancellor) on Sep 25, 2007 at 01:12 UTC
bernanke01: If I understand your question correctly, then if you're on unix, or cygwin under windows, you could just: `sort -u file1 file2 file3 \| awk 'NF==10 {print}'` [download] The sort -u just concatenates all the files sorts the result and emits only the unique lines. The awk filter then selects only the lines with 10 fields (to eliminate rows with missing fields). Salt to taste... ...roboticus	[reply] [d/l]