merging to databases... Should be easy...

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: merging to databases... Should be easy... by gbarr (Monk) on Oct 03, 2001 at 20:02 UTC
With the files at 500MB each, unless you have a machine with a lot of memory, thats a lot to read into a hash. If the rows are in any given order you could exploit that. If not then if you first sorted them with `tail +2 d1.dat > d1.sorted tail +2 d2.dat > d2.sorted` [download] The tail +2 just removes the header line from each. You can then merge them with `open(D1,"<d1.sorted"); open(D2,"<d2.sorted"); my $d1 = "v1 v2 v3"; my $d2 = "t1 t2 t3"; my ($d1k,$d2k)=('',''); for(;;) { my $c = $d1k cmp $d2k or print join(" ", split(/\s+/,$d1), (split(/\s+/, $d2))[2]),"\n"; if ($c <= 0) { last unless defined($d1 = <D1>); $d1k = ($d1 =~ /^(\d+\s+\d+)/)[0]; } if ($c >= 0) { last unless defined($d2 = <D2>); $d2k = ($d2 =~ /^(\d+\s+\d+)/)[0]; } }` [download]	[reply] [d/l] [select]
Re: merging to databases... Should be easy... by dragonchild (Archbishop) on Oct 03, 2001 at 19:22 UTC
This sounds like a good usage for a hash of hashes. Now, this assumes that every pair of keys is unique. If you can guarantee they'll always be numbers, try using a list of lists. It'll take a long time, but do something like `my %firstDataSet; foreach (my $line = <FIRST_FILE>) { my @data = split /\s+/, $line; $hash{$data[0]}{$data[1]} = $data[2]; } foreach (my $line = <SECOND_FILE>) { my @data = split /\s+/, $line; if ($hash{$data[0]}{$data[1]}) { print OUT_FILE "$data[0] $data[1] $hash{$data[0]}{$data[1]} $d +ata[2]\n"; } }` [download] ------ We are the carpenters and bricklayers of the Information Age. Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.	[reply] [d/l]
Re: Re: merging to databases... Should be easy... by runrig (Abbot) on Oct 03, 2001 at 20:28 UTC
With 500MB files, I'd consider MLDBM if I were to go this route. What I'd probably do (if its an option for the OP) is stick with unix utilities (I'm assuming the fields are space delimited, and you don't mind sorted results): `sed 's/ /\|/' file1 \| sort >tmp1 sed 's/ /\|/' file2 \| sort >tmp2 join tmp1 tmp2 \| sed s/\|/ /' > result` [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: merging to databases... Should be easy... by Fletch (Bishop) on Oct 03, 2001 at 20:01 UTC
If both files are 500M you probably don't want to read them all into memory to work on them (unless you've got over a gig of RAM and/or swap handy). If you know that lines in the different files are always going to match up (line 25 in file one always corresponds to line 25 in file two): `open( ONE, "file1" ) or die "open file1: $!"; open( TWO, "file2" ) or die "open file2: $!"; open( OUT, ">merged" ) or die "open merged: $!"; while( <ONE> ) { chomp; my $two = <TWO>; print OUT $_, (split( " ", $two ))[-1], "\n" } close( ONE ); close( TWO ); close( OUT );` [download] If you can't guarantee a one-to-one mapping between lines, then you probably want to look into using `DB_File` or the like to generate a hash on disk of the second file key'd by the first two columns (assuming that's what relates the values from the two different files). Then you'd read through the first file and pull the corresponding value from the hash.	[reply] [d/l]
Re: merging to databases... Should be easy... by tommyw (Hermit) on Oct 04, 2001 at 01:35 UTC
Since you mention Stata, SAS and awk, a non-perl solution's acceptable? Since you've got awk, you're on a unix system? Are you aware of join? Caveats: the files must be sorted, and it'll only join on one field, so you'ld need to prepare the files: `tail +2 d1.dat \| sort \| sed 's/ /:/' > d1.sorted tail +2 d2.dat \| sort \| sed 's/ /:/' > d2.sorted join d1.sorted d2.sorted \| sed 's/:/ /'` [download] The advantage of this over the perl version is that somebody else has already gone to the trouble of allowing all those useful command line options, to allow you to join on fields other than the first, or to output something other than the complete set of fields. Or, possibly more useful to your case: to include all the lines which don't match. And anyway, it never hurts to know about something else perl can render obsolete ;-)	[reply] [d/l]