in reply to transposing and matching large datasets

Just to confirm my understanding, you are hoping to create a file that has 1250 lines, with each line containing 500,010 space delimited fields?

What are you going to do with that file afterwards? How are you going to iterate that data? For example, it might make a lot more sense to prepend or append the 10 colums from file2 to file1--but it depends upon how you need to subsequently access the data.

If you are going to do further processing on that data--I assume this 500,010 column file isn't meant to be read by humans?--then just reading and splitting those long lines 1 at a time is going to be a slow, memory intensive process.

It would be interesting watching rdbms trying to do a join on a 500,010 column table :)

Update: Since you seem to be doing this regularly, you might find this useful. It will transpose any(*) space delimited file using minimal memory. it works by splitting the lines one at a time and accumulating the fields in separate temporary files, one per field. It then rewinds the temporaries, reads them back, and outputs the transposed records in sequence. It's coded to act as a command line filter; reading from stdin and writing to stdout. See after the __END__ tag for a usage. A 5000 line/650 field file took ~22 seconds so your 500,000 line file should take ~30 minutes. Once the file is transposed, merging it with your other file should be simple.

Update2:(*)Within the limits of your OS/CRT to hold 1 filehandle per field open. (max:2043 for me)

#! perl -slw use strict; my $tempdir = '/temp'; ## Read the first line of the file to determine how many columns my @fields = split ' ', <>; seek *ARGV, 0, 0; ## And rewind for the loop below. my @fhs; open $fhs[ $_ ], '+>', sprintf "$tempdir/%03d.tmp", $_ or die "Failed to open temp file $_; you probably ran out of handl +es: $!\n" for 0 .. $#fields; warn "opened intermediaries\n"; ## Read each line, split it and append each field to its file while( <> ) { printf STDERR "\r$.\t"; ## Activity indicator my @fields = split; printf { $fhs[ $_ ] } "%s ", $fields[ $_ ] for 0 .. $#fields; } warn "\nintermdiaries written\n"; seek $_, 0, 0 for @fhs; warn "intermdiaries rewound\n"; ## Read each tmp file back and write to stdout print do{ local $/; readline( $_ ) } for @fhs; ## Just truncate the tmp files to zero bytes warn "Done; truncating and closing intermediaries\n"; truncate( $_, 0 ) and close( $_ ) for @fhs; __END__ [ 7:06:52.07] c:\test>631400 junk.dat >junk.xdat opened intermediaries 5001 intermdiaries written intermdiaries rewound Done; truncating and closing intermediaries

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: transposing and matching large datasets
by Anonymous Monk on Aug 09, 2007 at 18:08 UTC
    Thank you! I was thinking about it overnight. Two different analyses need the data in different formats. But after thinking about the comments, I asked some people if the dataset can be split in some way. It turns out that it can be split into 30 different files rather than one large one. But I'm going to give the code above a try (just for practice sake). Thanks a bunch and I will report back:)