bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Fellow Monks,

I'm trying to pre-process some raw data so that I can use it for a variety of machine-learning analyses. The raw data itself is quite simply formatted:

ID\tID2\tID3\tID4\t.......value

Each ID is a character string between 3 and 7 characters long (median length = 5). Depending on the specific data, the number of IDs ranges between 3 and 50. I'm currently working on an intermeidate case of 9. The values are floating point numbers. Here are a few example rows:

THRAP2 ICA1 TIP47 PCSK1 PTTG1 CIT CEACAM6 ADM RMB5 + 0.526 DOC_1R CPE GRB7 RELA PTGES PLOD2 SLC2A1 G22P1 GPR56 + 1.664 NICE_4 GPC3 SNRPB MST1 ITGA11 PHB STARD10 TYRP1 ARG1 + 0.579 THRAP2 GAPD KRT5 CEACAM6 COPEB CNN1 DSC2 LYPLA1 NAP1L1 + 1.568 ACTA2 CPE TIP47 ZYG SSBP1 CIT IGFBP3 CALB2 HIF1a + 0.742

I have 12 such files, and each file has approximately 5 million records. The files have the same rows (based on the IDs, and in the same order. Records are *not* guaranteed to be unique, but duplicate records can be discarded freely. I wish to merge these 12 files into a single file. This file would have the 9 IDs, followed by the 12 floats.

In theory this is really, really simple. You could just open all 12 file-handles, and move through them line-by-line. Unfortunately, my data is messy in two ways:

  1. any one of the 12 matching rows for a given set of IDs can be incomplete -- either IDs or values can be missing. In these cases, the entire row can be discarded from all 12 files
  2. any one of the 12 matching rows for a given set of IDs can be entirely missing. Not replaced by a blank row, just completely skipped.

So, how can I handle this? I'm trying to avoid loading all this into a database if possible, so a pure-perl solution would be very beneficial.

My initial approach was to simply brute-force it: create a hash with the IDs concatenated into a key, and an array of integers. Here's the core of the code:

my $file_number = 0; # loop through all file-handles foreach my $file (@files) { # open the file open(my $fh, '<', $file) or croak "Unable to open file: $file" +; # count this file as being processed ++$file_number; # give the user a little bit of an update print "Now processing file $file_number\n"; # completely process the file while (<$fh>) { # trim leading/trailing whitespace s/^\s+//; s/\s+$//; # tab-delimited rows with the stats value in the last +column my @row = split("\t"); my $val = pop(@row); # sort the row and concatenate it into a key @row = sort(@row); my $key = join("\t", @row); # skip rows with missing values next() if (scalar(@row) != $subset_num); # save this data point push @{ $data{$key} }, $val; } # look for any missing values foreach my $key (keys(%data)) { if ($file_number > scalar( @{ $data{$key} } )) { push @{ $data{$key} }, ''; } } # close the filehandle close($fh); }

Unfortunately this doesn't work for exactly the reason you'd expect:

genra@qcxxt[164] >> perl -w rewriter.pl n09 9 Now processing file 1 Now processing file 2 Now processing file 3 Now processing file 4 Now processing file 5 Now processing file 6 Now processing file 7 Now processing file 8 Now processing file 9 Now processing file 10 Out of memory!

I can try to squeeze down memory usage by replacing the IDs with some sort of briefer representation. However, even this won't be helpful when I move to records with 50 IDs.

I'm looking for some bright ideas on how to do this "database-less merge/join" as efficiently as possible. I'm okay eating up more CPU, but I need a way to squeeze down the space requirements as much as possible. My idea was to process in "batches" of a million records, but I was hoping for a cleaner/easier solution from ye wise monks!

Replies are listed 'Best First'.
Re: Merging/Joining Multiple Large Files
by Anno (Deacon) on Sep 24, 2007 at 20:47 UTC
    Try to tie the large hash to a disk file. Your program should'nt need any changes.

    (Except, perhaps, that scalar is unnecessary in

    file_number > scalar( @{ $data{$key} } )
    The numeric comparison already provides scalar context, so
    file_number > @{ $data{$key} }
    is equivalent.)

    Anno

Re: Merging/Joining Multiple Large Files
by kyle (Abbot) on Sep 24, 2007 at 20:41 UTC

    If you're not going to keep it all in memory, you're going to write it all to disk. The question is how exactly to do that.

    Instead of a straight in-memory hash, you could use a DBM::Deep file. This would require very little modification of your current code.

    Alternately a solution based on Cache::Cache (like Cache::FileCache) might serve well.

    A more crude solution might be to turn your hash keys into filenames and store each hash entry as a file with one line per value. You end up with millions of files, each 12 lines long. Then you can post-process those files into one big file pretty easily.

    In all cases, it's probably going to be a lot slower than memory.

Re: Merging/Joining Multiple Large Files
by jdporter (Paladin) on Sep 24, 2007 at 20:36 UTC
    The files have the same rows (based on the IDs, and in the same order.

    This sounds to me like you saying that the files correspond precisely by line number. If so, then I'd open 12 files for reading, and read one line from each. When all twelve "current" lines have been read in, look for reasons to discard, process accordingly, and then move on to the next line in each file.

Re: Merging/Joining Multiple Large Files
by roboticus (Chancellor) on Sep 25, 2007 at 01:12 UTC
    bernanke01:

    If I understand your question correctly, then if you're on unix, or cygwin under windows, you could just:

    sort -u file1 file2 file3 | awk 'NF==10 {print}'
    The sort -u just concatenates all the files sorts the result and emits only the unique lines. The awk filter then selects only the lines with 10 fields (to eliminate rows with missing fields). Salt to taste...

    ...roboticus