I am trying to transpose a very large dataset that is a little under 1.7GB. In addition to tranposing it, I need to merge it to another dataset using a matching id. Basically i need to something like:
dataset 1:
vr 1 2 3 5
o1 a a b b
o2 c c d d
o3 e e f f
dataset 2:
id date1 age
1 2005 30
2 2006 25
3 2005 22
4 2004 23
5 2006 25
merged/tranposed dataset:
id date1 age o1 o2 o3
1 2005 30 a c e
2 2006 25 a c e
3 2004 22 b d f
4 2004 23
5 2006 25 b d f
I have tried to use a combination of unix and perl, but the main issue it seems is that I run out of memory (since the script worked for a smaller dataset). dataset 1 (650 columns, 500,000rows), dataset 2 (10columns, 1250 rows). Is there a more memory efficient way to do this? Thanks in advance!
$ids = `head -1 dataset1`;
@matchid = split(" ", $ids);
shift(@matchid);
open(IN, "dataset2" ) || die
open(OUT, ">mergeddataset") || die ;
$line=<IN>; chomp $line;
$h = `cut -f1 dataset1`;
@header = split(/\n/,$h);
shift(@header);
print OUT $line." ".join(" ",@header)."\n";
while ($line=<IN>)
{
chomp $line;
@match=split(" ",$line);
foreach $id (@header)
{
if ($id eq $match[1])
{
$column=`head -1 dataset1|tr -s "\t" "\n" | grep -n $id|cut
+-f1 -d":"`;
chomp $column;
$a = `cut -f$column dataset1| tail +2 | tr -s '\n' ' '`;
chomp $a;
@b = split(/ +/, $a);
print OUT $line." ".join(" ",@b)."\n";
}
}
}
close(OUT);
close(IN);
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.