in reply to transposing and matching large datasets
Just to confirm my understanding, you are hoping to create a file that has 1250 lines, with each line containing 500,010 space delimited fields?
What are you going to do with that file afterwards? How are you going to iterate that data? For example, it might make a lot more sense to prepend or append the 10 colums from file2 to file1--but it depends upon how you need to subsequently access the data.
If you are going to do further processing on that data--I assume this 500,010 column file isn't meant to be read by humans?--then just reading and splitting those long lines 1 at a time is going to be a slow, memory intensive process.
It would be interesting watching rdbms trying to do a join on a 500,010 column table :)
Update: Since you seem to be doing this regularly, you might find this useful. It will transpose any(*) space delimited file using minimal memory. it works by splitting the lines one at a time and accumulating the fields in separate temporary files, one per field. It then rewinds the temporaries, reads them back, and outputs the transposed records in sequence. It's coded to act as a command line filter; reading from stdin and writing to stdout. See after the __END__ tag for a usage. A 5000 line/650 field file took ~22 seconds so your 500,000 line file should take ~30 minutes. Once the file is transposed, merging it with your other file should be simple.
Update2:(*)Within the limits of your OS/CRT to hold 1 filehandle per field open. (max:2043 for me)
#! perl -slw use strict; my $tempdir = '/temp'; ## Read the first line of the file to determine how many columns my @fields = split ' ', <>; seek *ARGV, 0, 0; ## And rewind for the loop below. my @fhs; open $fhs[ $_ ], '+>', sprintf "$tempdir/%03d.tmp", $_ or die "Failed to open temp file $_; you probably ran out of handl +es: $!\n" for 0 .. $#fields; warn "opened intermediaries\n"; ## Read each line, split it and append each field to its file while( <> ) { printf STDERR "\r$.\t"; ## Activity indicator my @fields = split; printf { $fhs[ $_ ] } "%s ", $fields[ $_ ] for 0 .. $#fields; } warn "\nintermdiaries written\n"; seek $_, 0, 0 for @fhs; warn "intermdiaries rewound\n"; ## Read each tmp file back and write to stdout print do{ local $/; readline( $_ ) } for @fhs; ## Just truncate the tmp files to zero bytes warn "Done; truncating and closing intermediaries\n"; truncate( $_, 0 ) and close( $_ ) for @fhs; __END__ [ 7:06:52.07] c:\test>631400 junk.dat >junk.xdat opened intermediaries 5001 intermdiaries written intermdiaries rewound Done; truncating and closing intermediaries
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: transposing and matching large datasets
by Anonymous Monk on Aug 09, 2007 at 18:08 UTC |