comment on

Howdy Fellow Monks,

I'm trying to pre-process some raw data so that I can use it for a variety of machine-learning analyses. The raw data itself is quite simply formatted:

ID\tID2\tID3\tID4\t.......value
[download]

Each ID is a character string between 3 and 7 characters long (median length = 5). Depending on the specific data, the number of IDs ranges between 3 and 50. I'm currently working on an intermeidate case of 9. The values are floating point numbers. Here are a few example rows:

THRAP2  ICA1    TIP47   PCSK1   PTTG1   CIT     CEACAM6 ADM     RMB5  
+  0.526
DOC_1R  CPE     GRB7    RELA    PTGES   PLOD2   SLC2A1  G22P1   GPR56 
+  1.664
NICE_4  GPC3    SNRPB   MST1    ITGA11  PHB     STARD10 TYRP1   ARG1  
+  0.579
THRAP2  GAPD    KRT5    CEACAM6 COPEB   CNN1    DSC2    LYPLA1  NAP1L1
+  1.568
ACTA2   CPE     TIP47   ZYG     SSBP1   CIT     IGFBP3  CALB2   HIF1a 
+  0.742
[download]

I have 12 such files, and each file has approximately 5 million records. The files have the same rows (based on the IDs, and in the same order. Records are *not* guaranteed to be unique, but duplicate records can be discarded freely. I wish to merge these 12 files into a single file. This file would have the 9 IDs, followed by the 12 floats.

In theory this is really, really simple. You could just open all 12 file-handles, and move through them line-by-line. Unfortunately, my data is messy in two ways:

any one of the 12 matching rows for a given set of IDs can be incomplete -- either IDs or values can be missing. In these cases, the entire row can be discarded from all 12 files
any one of the 12 matching rows for a given set of IDs can be entirely missing. Not replaced by a blank row, just completely skipped.

So, how can I handle this? I'm trying to avoid loading all this into a database if possible, so a pure-perl solution would be very beneficial.

My initial approach was to simply brute-force it: create a hash with the IDs concatenated into a key, and an array of integers. Here's the core of the code:

my $file_number = 0;

# loop through all file-handles
foreach my $file (@files) {

        # open the file
        open(my $fh, '<', $file) or croak "Unable to open file: $file"
+;

        # count this file as being processed
        ++$file_number;

        # give the user a little bit of an update
        print "Now processing file $file_number\n";

        # completely process the file
        while (<$fh>) {

                # trim leading/trailing whitespace
                s/^\s+//;
                s/\s+$//;

                # tab-delimited rows with the stats value in the last 
+column
                my @row = split("\t");
                my $val = pop(@row);

                # sort the row and concatenate it into a key
                @row = sort(@row);
                my $key = join("\t", @row);

                # skip rows with missing values
                next() if (scalar(@row) != $subset_num);

                # save this data point
                push @{ $data{$key} }, $val;

                }

        # look for any missing values
        foreach my $key (keys(%data)) {
                if ($file_number > scalar( @{ $data{$key} } )) {
                        push @{ $data{$key} }, '';
                        }
                }

        # close the filehandle
        close($fh);

        }
[download]

Unfortunately this doesn't work for exactly the reason you'd expect:

genra@qcxxt[164] >> perl -w rewriter.pl n09 9
Now processing file 1
Now processing file 2
Now processing file 3
Now processing file 4
Now processing file 5
Now processing file 6
Now processing file 7
Now processing file 8
Now processing file 9
Now processing file 10
Out of memory!
[download]

I can try to squeeze down memory usage by replacing the IDs with some sort of briefer representation. However, even this won't be helpful when I move to records with 50 IDs.

I'm looking for some bright ideas on how to do this "database-less merge/join" as efficiently as possible. I'm okay eating up more CPU, but I need a way to squeeze down the space requirements as much as possible. My idea was to process in "batches" of a million records, but I was hoping for a cleaner/easier solution from ye wise monks!

In reply to Merging/Joining Multiple Large Files by bernanke01

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.