I am trying to transpose a very large dataset that is a little under 1.7GB. In addition to tranposing it, I need to merge it to another dataset using a matching id. Basically i need to something like:
dataset 1: vr 1 2 3 5 o1 a a b b o2 c c d d o3 e e f f dataset 2: id date1 age 1 2005 30 2 2006 25 3 2005 22 4 2004 23 5 2006 25 merged/tranposed dataset: id date1 age o1 o2 o3 1 2005 30 a c e 2 2006 25 a c e 3 2004 22 b d f 4 2004 23 5 2006 25 b d f

I have tried to use a combination of unix and perl, but the main issue it seems is that I run out of memory (since the script worked for a smaller dataset). dataset 1 (650 columns, 500,000rows), dataset 2 (10columns, 1250 rows). Is there a more memory efficient way to do this? Thanks in advance!

$ids = `head -1 dataset1`; @matchid = split(" ", $ids); shift(@matchid); open(IN, "dataset2" ) || die open(OUT, ">mergeddataset") || die ; $line=<IN>; chomp $line; $h = `cut -f1 dataset1`; @header = split(/\n/,$h); shift(@header); print OUT $line." ".join(" ",@header)."\n"; while ($line=<IN>) { chomp $line; @match=split(" ",$line); foreach $id (@header) { if ($id eq $match[1]) { $column=`head -1 dataset1|tr -s "\t" "\n" | grep -n $id|cut +-f1 -d":"`; chomp $column; $a = `cut -f$column dataset1| tail +2 | tr -s '\n' ' '`; chomp $a; @b = split(/ +/, $a); print OUT $line." ".join(" ",@b)."\n"; } } } close(OUT); close(IN);

In reply to transposing and matching large datasets by wannabemonk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.