transposing and matching large datasets

wannabemonk has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to transpose a very large dataset that is a little under 1.7GB. In addition to tranposing it, I need to merge it to another dataset using a matching id. Basically i need to something like:

dataset 1:
vr 1 2 3 5 
o1 a a b b 
o2 c c d d 
o3 e e f f 

dataset 2:
id date1 age 
1  2005  30  
2  2006  25 
3  2005  22
4  2004  23  
5  2006  25  

merged/tranposed dataset:
id date1 age o1 o2 o3
1  2005  30  a  c  e
2  2006  25  a  c  e
3  2004  22  b  d  f
4  2004  23  
5  2006  25  b  d  f
[download]

I have tried to use a combination of unix and perl, but the main issue it seems is that I run out of memory (since the script worked for a smaller dataset). dataset 1 (650 columns, 500,000rows), dataset 2 (10columns, 1250 rows). Is there a more memory efficient way to do this? Thanks in advance!

$ids = `head -1 dataset1`;  
@matchid = split(" ", $ids);
shift(@matchid);

open(IN, "dataset2" ) || die 
open(OUT, ">mergeddataset") || die ;

$line=<IN>; chomp $line;
$h = `cut -f1 dataset1`;
@header = split(/\n/,$h);
shift(@header);
print OUT  $line." ".join(" ",@header)."\n";

while ($line=<IN>)
{
chomp $line;
@match=split(" ",$line);
foreach $id (@header)
    {
          if ($id eq $match[1])
          {
          $column=`head -1 dataset1|tr -s "\t" "\n" | grep -n $id|cut 
+-f1 -d":"`;
      chomp $column;
      $a = `cut -f$column dataset1| tail +2 | tr -s '\n' ' '`;
          chomp $a;
      @b = split(/ +/, $a);
          print OUT $line." ".join(" ",@b)."\n";
          }
        }
}

close(OUT);
close(IN);
[download]

Comment on transposing and matching large datasets Select or Download Code

Replies are listed 'Best First'.
Re: transposing and matching large datasets by BrowserUk (Patriarch) on Aug 08, 2007 at 22:14 UTC
Just to confirm my understanding, you are hoping to create a file that has 1250 lines, with each line containing 500,010 space delimited fields? What are you going to do with that file afterwards? How are you going to iterate that data? For example, it might make a lot more sense to prepend or append the 10 colums from file2 to file1--but it depends upon how you need to subsequently access the data. If you are going to do further processing on that data--I assume this 500,010 column file isn't meant to be read by humans?--then just reading and splitting those long lines 1 at a time is going to be a slow, memory intensive process. It would be interesting watching rdbms trying to do a join on a 500,010 column table :) Update: Since you seem to be doing this regularly, you might find this useful. It will transpose any() space delimited file using minimal memory. it works by splitting the lines one at a time and accumulating the fields in separate temporary files, one per field. It then rewinds the temporaries, reads them back, and outputs the transposed records in sequence. It's coded to act as a command line filter; reading from stdin and writing to stdout. See after the __END__ tag for a usage. A 5000 line/650 field file took ~22 seconds so your 500,000 line file should take ~30 minutes. Once the file is transposed, merging it with your other file should be simple. Update2:()Within the limits of your OS/CRT to hold 1 filehandle per field open. (max:2043 for me) #! perl -slw use strict; my $tempdir = '/temp'; ## Read the first line of the file to determine how many columns my @fields = split ' ', <>; seek *ARGV, 0, 0; ## And rewind for the loop below. my @fhs; open $fhs[ $_ ], '+>', sprintf "$tempdir/%03d.tmp", $_ or die "Failed to open temp file $_; you probably ran out of handl +es: $!\n" for 0 .. $#fields; warn "opened intermediaries\n"; ## Read each line, split it and append each field to its file while( <> ) { printf STDERR "\r$.\t"; ## Activity indicator my @fields = split; printf { $fhs[ $_ ] } "%s ", $fields[ $_ ] for 0 .. $#fields; } warn "\nintermdiaries written\n"; seek $_, 0, 0 for @fhs; warn "intermdiaries rewound\n"; ## Read each tmp file back and write to stdout print do{ local $/; readline( $_ ) } for @fhs; ## Just truncate the tmp files to zero bytes warn "Done; truncating and closing intermediaries\n"; truncate( $_, 0 ) and close( $_ ) for @fhs; __END__ [ 7:06:52.07] c:\test>631400 junk.dat >junk.xdat opened intermediaries 5001 intermdiaries written intermdiaries rewound Done; truncating and closing intermediaries [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^2: transposing and matching large datasets by Anonymous Monk on Aug 09, 2007 at 18:08 UTC
Thank you! I was thinking about it overnight. Two different analyses need the data in different formats. But after thinking about the comments, I asked some people if the dataset can be split in some way. It turns out that it can be split into 30 different files rather than one large one. But I'm going to give the code above a try (just for practice sake). Thanks a bunch and I will report back:)	[reply]
Re: transposing and matching large datasets by GrandFather (Saint) on Aug 08, 2007 at 20:44 UTC
If the record lines in the files are fixed column width you may be able to use DBD::AnyData with DBI. DWIM is Perl's answer to Gödel	[reply]
Re^2: transposing and matching large datasets by wannabemonk (Initiate) on Aug 08, 2007 at 21:00 UTC
Ok I will look into it. Thank you.	[reply]
Re^2: transposing and matching large datasets by moritz (Cardinal) on Aug 09, 2007 at 07:10 UTC
If they are not fixed column width I'd still recommend to use a form of database, because this kind of problem is what databases are good at. Perl 6 in German	[reply]
Re^3: transposing and matching large datasets by BrowserUk (Patriarch) on Aug 09, 2007 at 07:30 UTC
because this kind of problem is what databases are good at. If the data is already in the database maybe--but only maybe. It depends on whether it is indexed correctly for this particular operation. If the data is not in a database, then in the time you spend writing the script to load the data into the database, the operation is completed using simple flat file operations. And that's before you actually load the data, index it, and then perform the join and export all the data back to a flat file. If there is an ongoing need for relational operations upon the dataset, then the costs of importing there may be amortisable over the long term. But to import the data to a db, just to join it and export it all again is a complete nonsense. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]