Howdy Fellow Monks,

I'm trying to pre-process some raw data so that I can use it for a variety of machine-learning analyses. The raw data itself is quite simply formatted:

ID\tID2\tID3\tID4\t.......value

Each ID is a character string between 3 and 7 characters long (median length = 5). Depending on the specific data, the number of IDs ranges between 3 and 50. I'm currently working on an intermeidate case of 9. The values are floating point numbers. Here are a few example rows:

THRAP2 ICA1 TIP47 PCSK1 PTTG1 CIT CEACAM6 ADM RMB5 + 0.526 DOC_1R CPE GRB7 RELA PTGES PLOD2 SLC2A1 G22P1 GPR56 + 1.664 NICE_4 GPC3 SNRPB MST1 ITGA11 PHB STARD10 TYRP1 ARG1 + 0.579 THRAP2 GAPD KRT5 CEACAM6 COPEB CNN1 DSC2 LYPLA1 NAP1L1 + 1.568 ACTA2 CPE TIP47 ZYG SSBP1 CIT IGFBP3 CALB2 HIF1a + 0.742

I have 12 such files, and each file has approximately 5 million records. The files have the same rows (based on the IDs, and in the same order. Records are *not* guaranteed to be unique, but duplicate records can be discarded freely. I wish to merge these 12 files into a single file. This file would have the 9 IDs, followed by the 12 floats.

In theory this is really, really simple. You could just open all 12 file-handles, and move through them line-by-line. Unfortunately, my data is messy in two ways:

  1. any one of the 12 matching rows for a given set of IDs can be incomplete -- either IDs or values can be missing. In these cases, the entire row can be discarded from all 12 files
  2. any one of the 12 matching rows for a given set of IDs can be entirely missing. Not replaced by a blank row, just completely skipped.

So, how can I handle this? I'm trying to avoid loading all this into a database if possible, so a pure-perl solution would be very beneficial.

My initial approach was to simply brute-force it: create a hash with the IDs concatenated into a key, and an array of integers. Here's the core of the code:

my $file_number = 0; # loop through all file-handles foreach my $file (@files) { # open the file open(my $fh, '<', $file) or croak "Unable to open file: $file" +; # count this file as being processed ++$file_number; # give the user a little bit of an update print "Now processing file $file_number\n"; # completely process the file while (<$fh>) { # trim leading/trailing whitespace s/^\s+//; s/\s+$//; # tab-delimited rows with the stats value in the last +column my @row = split("\t"); my $val = pop(@row); # sort the row and concatenate it into a key @row = sort(@row); my $key = join("\t", @row); # skip rows with missing values next() if (scalar(@row) != $subset_num); # save this data point push @{ $data{$key} }, $val; } # look for any missing values foreach my $key (keys(%data)) { if ($file_number > scalar( @{ $data{$key} } )) { push @{ $data{$key} }, ''; } } # close the filehandle close($fh); }

Unfortunately this doesn't work for exactly the reason you'd expect:

genra@qcxxt[164] >> perl -w rewriter.pl n09 9 Now processing file 1 Now processing file 2 Now processing file 3 Now processing file 4 Now processing file 5 Now processing file 6 Now processing file 7 Now processing file 8 Now processing file 9 Now processing file 10 Out of memory!

I can try to squeeze down memory usage by replacing the IDs with some sort of briefer representation. However, even this won't be helpful when I move to records with 50 IDs.

I'm looking for some bright ideas on how to do this "database-less merge/join" as efficiently as possible. I'm okay eating up more CPU, but I need a way to squeeze down the space requirements as much as possible. My idea was to process in "batches" of a million records, but I was hoping for a cleaner/easier solution from ye wise monks!


In reply to Merging/Joining Multiple Large Files by bernanke01

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.