Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
Sorry if this question is too easy or basic, but I'm new using Perl and I'm under much pressure to find an answer. I usually would use Stata or SAS for this, but I don't have either of them available right now so my hope is Perl...
I have two data sets:
d1.dat:
v1 v2 v3
880 670 xx
860 670 xx
870 320 xx
890 320 xx

d2.dat:
t1 t2 t3
880 670 yy
860 670 yy
555 667 yy
555 668 yy

What I need is: I need to add a third fourth column to the first data set, which is taken from the second one. So, the value of v1 and v2 is equal to the value of t1 and t2, I need to add the column t3 to d1.dat. The result should be:
v1 v2 v3 t3
880 670 xx yy
860 670 xx yy

I tried to use awk as well, but it's too slow (both files are about 500 MB...)
Thanks so much!

Mark.
  • Comment on merging to databases... Should be easy...

Replies are listed 'Best First'.
Re: merging to databases... Should be easy...
by gbarr (Monk) on Oct 03, 2001 at 20:02 UTC
    With the files at 500MB each, unless you have a machine with a lot of memory, thats a lot to read into a hash.

    If the rows are in any given order you could exploit that. If not then if you first sorted them with

    tail +2 d1.dat > d1.sorted tail +2 d2.dat > d2.sorted

    The tail +2 just removes the header line from each. You can then merge them with

    open(D1,"<d1.sorted"); open(D2,"<d2.sorted"); my $d1 = "v1 v2 v3"; my $d2 = "t1 t2 t3"; my ($d1k,$d2k)=('',''); for(;;) { my $c = $d1k cmp $d2k or print join(" ", split(/\s+/,$d1), (split(/\s+/, $d2))[2]),"\n"; if ($c <= 0) { last unless defined($d1 = <D1>); $d1k = ($d1 =~ /^(\d+\s+\d+)/)[0]; } if ($c >= 0) { last unless defined($d2 = <D2>); $d2k = ($d2 =~ /^(\d+\s+\d+)/)[0]; } }
Re: merging to databases... Should be easy...
by dragonchild (Archbishop) on Oct 03, 2001 at 19:22 UTC
    This sounds like a good usage for a hash of hashes. Now, this assumes that every pair of keys is unique. If you can guarantee they'll always be numbers, try using a list of lists.

    It'll take a long time, but do something like

    my %firstDataSet; foreach (my $line = <FIRST_FILE>) { my @data = split /\s+/, $line; $hash{$data[0]}{$data[1]} = $data[2]; } foreach (my $line = <SECOND_FILE>) { my @data = split /\s+/, $line; if ($hash{$data[0]}{$data[1]}) { print OUT_FILE "$data[0] $data[1] $hash{$data[0]}{$data[1]} $d +ata[2]\n"; } }

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      With 500MB files, I'd consider MLDBM if I were to go this route. What I'd probably do (if its an option for the OP) is stick with unix utilities (I'm assuming the fields are space delimited, and you don't mind sorted results):
      sed 's/ /|/' file1 | sort >tmp1 sed 's/ /|/' file2 | sort >tmp2 join tmp1 tmp2 | sed s/|/ /' > result
      A reply falls below the community's threshold of quality. You may see it by logging in.
Re: merging to databases... Should be easy...
by Fletch (Bishop) on Oct 03, 2001 at 20:01 UTC

    If both files are 500M you probably don't want to read them all into memory to work on them (unless you've got over a gig of RAM and/or swap handy). If you know that lines in the different files are always going to match up (line 25 in file one always corresponds to line 25 in file two):

    open( ONE, "file1" ) or die "open file1: $!"; open( TWO, "file2" ) or die "open file2: $!"; open( OUT, ">merged" ) or die "open merged: $!"; while( <ONE> ) { chomp; my $two = <TWO>; print OUT $_, (split( " ", $two ))[-1], "\n" } close( ONE ); close( TWO ); close( OUT );

    If you can't guarantee a one-to-one mapping between lines, then you probably want to look into using DB_File or the like to generate a hash on disk of the second file key'd by the first two columns (assuming that's what relates the values from the two different files). Then you'd read through the first file and pull the corresponding value from the hash.

Re: merging to databases... Should be easy...
by tommyw (Hermit) on Oct 04, 2001 at 01:35 UTC

    Since you mention Stata, SAS and awk, a non-perl solution's acceptable? Since you've got awk, you're on a unix system? Are you aware of join?

    Caveats: the files must be sorted, and it'll only join on one field, so you'ld need to prepare the files:

    tail +2 d1.dat | sort | sed 's/ /:/' > d1.sorted tail +2 d2.dat | sort | sed 's/ /:/' > d2.sorted join d1.sorted d2.sorted | sed 's/:/ /'
    The advantage of this over the perl version is that somebody else has already gone to the trouble of allowing all those useful command line options, to allow you to join on fields other than the first, or to output something other than the complete set of fields. Or, possibly more useful to your case: to include all the lines which don't match.

    And anyway, it never hurts to know about something else perl can render obsolete ;-)