bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:
Howdy Fellow Monks,
I'm trying to pre-process some raw data so that I can use it for a variety of machine-learning analyses. The raw data itself is quite simply formatted:
ID\tID2\tID3\tID4\t.......value
Each ID is a character string between 3 and 7 characters long (median length = 5). Depending on the specific data, the number of IDs ranges between 3 and 50. I'm currently working on an intermeidate case of 9. The values are floating point numbers. Here are a few example rows:
THRAP2 ICA1 TIP47 PCSK1 PTTG1 CIT CEACAM6 ADM RMB5 + 0.526 DOC_1R CPE GRB7 RELA PTGES PLOD2 SLC2A1 G22P1 GPR56 + 1.664 NICE_4 GPC3 SNRPB MST1 ITGA11 PHB STARD10 TYRP1 ARG1 + 0.579 THRAP2 GAPD KRT5 CEACAM6 COPEB CNN1 DSC2 LYPLA1 NAP1L1 + 1.568 ACTA2 CPE TIP47 ZYG SSBP1 CIT IGFBP3 CALB2 HIF1a + 0.742
I have 12 such files, and each file has approximately 5 million records. The files have the same rows (based on the IDs, and in the same order. Records are *not* guaranteed to be unique, but duplicate records can be discarded freely. I wish to merge these 12 files into a single file. This file would have the 9 IDs, followed by the 12 floats.
In theory this is really, really simple. You could just open all 12 file-handles, and move through them line-by-line. Unfortunately, my data is messy in two ways:
So, how can I handle this? I'm trying to avoid loading all this into a database if possible, so a pure-perl solution would be very beneficial.
My initial approach was to simply brute-force it: create a hash with the IDs concatenated into a key, and an array of integers. Here's the core of the code:
my $file_number = 0; # loop through all file-handles foreach my $file (@files) { # open the file open(my $fh, '<', $file) or croak "Unable to open file: $file" +; # count this file as being processed ++$file_number; # give the user a little bit of an update print "Now processing file $file_number\n"; # completely process the file while (<$fh>) { # trim leading/trailing whitespace s/^\s+//; s/\s+$//; # tab-delimited rows with the stats value in the last +column my @row = split("\t"); my $val = pop(@row); # sort the row and concatenate it into a key @row = sort(@row); my $key = join("\t", @row); # skip rows with missing values next() if (scalar(@row) != $subset_num); # save this data point push @{ $data{$key} }, $val; } # look for any missing values foreach my $key (keys(%data)) { if ($file_number > scalar( @{ $data{$key} } )) { push @{ $data{$key} }, ''; } } # close the filehandle close($fh); }
Unfortunately this doesn't work for exactly the reason you'd expect:
genra@qcxxt[164] >> perl -w rewriter.pl n09 9 Now processing file 1 Now processing file 2 Now processing file 3 Now processing file 4 Now processing file 5 Now processing file 6 Now processing file 7 Now processing file 8 Now processing file 9 Now processing file 10 Out of memory!
I can try to squeeze down memory usage by replacing the IDs with some sort of briefer representation. However, even this won't be helpful when I move to records with 50 IDs.
I'm looking for some bright ideas on how to do this "database-less merge/join" as efficiently as possible. I'm okay eating up more CPU, but I need a way to squeeze down the space requirements as much as possible. My idea was to process in "batches" of a million records, but I was hoping for a cleaner/easier solution from ye wise monks!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Merging/Joining Multiple Large Files
by Anno (Deacon) on Sep 24, 2007 at 20:47 UTC | |
|
Re: Merging/Joining Multiple Large Files
by kyle (Abbot) on Sep 24, 2007 at 20:41 UTC | |
|
Re: Merging/Joining Multiple Large Files
by jdporter (Paladin) on Sep 24, 2007 at 20:36 UTC | |
|
Re: Merging/Joining Multiple Large Files
by roboticus (Chancellor) on Sep 25, 2007 at 01:12 UTC |