Outer Join on 2 files

healingtao has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I'm a perl newbie and have the following requirement: 1) I need to do an outer join on 2 files which have about 20 columns each. Both files have headers

2) The join needs to happen on key based on 3 columns from each file, so both have to be sorted. The key is from column 1,4,5 in both files but key column headers don't all match (can we match the key based on column index instead of header names?)

3) If the key matches, I need the flexibility to add specific columns from any of the files to the output file.

4) If there is no key match, take the existing key/data from file and add it to the output leaving the other columns blank (outer join).

5) Need to generate a separate output file leaving inputs intact.

6) The input/output files need to use '|' separators

Here is an example with 2 input files and an output file using only small sample of columns:

File_Deal - the key here is parent_cusp,deal,tranche

parent_cusp|cusp|isin|deal|tranche|det_date|col_type
38375U|36182D|36182D1|HMAG|HMBSWEE|20150416|mortgage
383333|361333|3618333|HABS|HABSDDE|20150330|mortgage2
[download]

File_ATT - the key here is Vendor, deal, tranche

Vendor|visp|barnembly|deal|tranche|Fund|subamt|colamt|basamt
38375U|3DD82D|36FF333|HMAG|HMBSWEE|9010|765423|364633|46566
38EE33|361DD3|36LLE33|H99S|HAOOODE|2330|377233|347433|34488
[download]

File_Output

parent_cusp|cusp|isin|deal|tranche|det_date|col_type|Fund|subamt|colam
+t|basamt
38375U|36182D|36182D1|HMAG|HMBSWEE|20150416|mortgage|9010|765423|36463
+3|46566
383333|361333|3618333|HABS|HABSDDE|20150330|mortgage2||||
38EE33|||H99S|HAOOODE|||2330|377233|347433|34488
[download]

7) The output needs to contain all the columns/data from file1 (File_Deal) and for file2 (File_ATT) join only starting from column 6 (Fund) until the last column (basamt).

8) based on the output - row 1 is a match so I just join from both files

9) row 2 is a mismatch from File_deal, but since it's outer join I just copy row 2 from file_deal and just add blanks (since it's missing from file_Att

10) row 2 is a mismatch from File_att but once again I need outer join. But here I need to copy the key from file_detail to output as well, so I just write 3 key columns into 1st, 4th, and 5th column(leave others blank), then attach data from File_ATT

Here is some of the code I'm starting with from sample I found but need to come up with a solution quickly as deadline is approaching, can you please help. Thanks in advance


#! /bin/env perl

my $File_Deal = $ARGV[0];
my $File_ATT = $ARGV[1];

open(F1, "<", $File_Deal);
open(F2, "<", $File_ATT);

my %hash = ();
while( <F1> ) {
  chomp;

  my($c, $c2, $c4, @val1) = split/,/, $_, -1;
$hash{$c1.$c2.$c4}[0] = $val1[0];
  $hash{$key}[1] = $val1[1];
  $hash{$key}[2] = $val1[2];

}
while( <F2> ) {
  chomp;

  my($c1,$c5, $c7, @val2) = split/,/, $_, -1;
$hash{$c1.$c5.$7}[3] = $val2[0];
  $hash{$key}[4] = $val2[1];
  $hash{$key}[5] = $val2[2];

}

for my $key (sort keys %hash) {
print "$key: $hash{$key}[0]:$hash{$key}[1]\n";   

}
[download]

Comment on Outer Join on 2 files Select or Download Code

Replies are listed 'Best First'.
Re: Outer Join on 2 files by Happy-the-monk (Canon) on Apr 18, 2015 at 07:56 UTC
It strikes me as odd that you split on comma (,) but in your data the delimiter is the pipe (\|) symbol. Maybe it would speed up things to use that instead. Otherwise, since your wording is SQL-speech, you might feel more at home using actual SQL on your CSV text files -- if that is what you prefer -- and go for DBD::CSV (and DBI) I for myself have been happiest with the modernised Text CSV approach found in Text::CSV_XS or Text::CSV. Cheers, Sören Créateur des bugs mobiles - let loose once, run everywhere. (hooked on the Perl Programming language)	[reply]
Re: Outer Join on 2 files by erix (Prior) on Apr 18, 2015 at 08:39 UTC
'outer' join is perhaps not the best term to use; what you describe is a full join (or full outer join; but the word 'outer' is redundant). The easiest way is probably indeed with DBI/DBD-CSV but I wanted to run this through postgres using foreign data wrapper 'file_fdw': cat pm-1123847.sh #!/bin/sh # # File_Deal - the key here is parent_cusp,deal,tranche # echo "parent_cusp\|cusp\|isin\|deal\|tranche\|det_date\|col_type 38375U\|36182D\|36182D1\|HMAG\|HMBSWEE\|20150416\|mortgage 383333\|361333\|3618333\|HABS\|HABSDDE\|20150330\|mortgage2" > /tmp/file_dea +l.txt # # File_ATT - the key here is Vendor, deal, tranche # echo "Vendor\|visp\|barnembly\|deal\|tranche\|Fund\|subamt\|colamt\|basamt 38375U\|3DD82D\|36FF333\|HMAG\|HMBSWEE\|9010\|765423\|364633\|46566 38EE33\|361DD3\|36LLE33\|H99S\|HAOOODE\|2330\|377233\|347433\|34488" > /tmp/fi +le_att.txt echo " drop foreign table if exists t1 cascade; drop foreign table if exists t2 cascade; drop server if exists s cascade; " \| psql -X echo " create server s foreign data wrapper file_fdw; create foreign table t1 ( parent_cusp text, cusp text, isin text, deal text, tranche text, det_date text, col_type text ) server s options (filename '/tmp/file_deal.txt', format 'csv', delimiter E'\|', +header 'true'); create foreign table t2 ( vendor text, visp text, barnembly text, deal text, tranche text, fund text, subamt text, colamt text, basamt text ) server s options (filename '/tmp/file_att.txt', format 'csv', delimiter E'\|', h +eader 'true'); select * from t1; select * from t2; select coalesce(t1.parent_cusp, t2.vendor) as parent_cusp , t1.cusp , t1.isin , coalesce(t1.deal, t2.deal) as deal , coalesce(t1.tranche, t2.tranche) as tranche , t1.det_date , t1.col_type , t2.fund , t2.subamt , t2.colamt , t2.basamt from t1 full join t2 on (t1.parent_cusp, t1.deal, t1.tranche) = (t2.vendor, t2.deal, t2.tranche) ; " \| psql -X echo "File_Output parent_cusp\|cusp\|isin\|deal\|tranche\|det_date\|col_type\|Fund\|subamt\|colam +t\|basamt 38375U\|36182D\|36182D1\|HMAG\|HMBSWEE\|20150416\|mortgage\|9010\|765423\|36463 +3\|46566 383333\|361333\|3618333\|HABS\|HABSDDE\|20150330\|mortgage2\|\|\|\| 38EE33\|\|\|H99S\|HAOOODE\|\|\|2330\|377233\|347433\|34488 " [download] $ ./pm-1123847.sh DROP FOREIGN TABLE DROP FOREIGN TABLE DROP SERVER CREATE SERVER CREATE FOREIGN TABLE CREATE FOREIGN TABLE parent_cusp \| cusp \| isin \| deal \| tranche \| det_date \| col_type -------------+--------+---------+------+---------+----------+--------- +-- 38375U \| 36182D \| 36182D1 \| HMAG \| HMBSWEE \| 20150416 \| mortgage 383333 \| 361333 \| 3618333 \| HABS \| HABSDDE \| 20150330 \| mortgage +2 (2 rows) vendor \| visp \| barnembly \| deal \| tranche \| fund \| subamt \| colamt + \| basamt --------+--------+-----------+------+---------+------+--------+------- +-+-------- 38375U \| 3DD82D \| 36FF333 \| HMAG \| HMBSWEE \| 9010 \| 765423 \| 364633 + \| 46566 38EE33 \| 361DD3 \| 36LLE33 \| H99S \| HAOOODE \| 2330 \| 377233 \| 347433 + \| 34488 (2 rows) parent_cusp \| cusp \| isin \| deal \| tranche \| det_date \| col_type + \| fund \| subamt \| colamt \| basamt -------------+--------+---------+------+---------+----------+--------- +--+------+--------+--------+-------- 38375U \| 36182D \| 36182D1 \| HMAG \| HMBSWEE \| 20150416 \| mortgage + \| 9010 \| 765423 \| 364633 \| 46566 383333 \| 361333 \| 3618333 \| HABS \| HABSDDE \| 20150330 \| mortgage +2 \| \| \| \| 38EE33 \| \| \| H99S \| HAOOODE \| \| + \| 2330 \| 377233 \| 347433 \| 34488 (3 rows) File_Output parent_cusp\|cusp\|isin\|deal\|tranche\|det_date\|col_type\|Fund\|subamt\|colam +t\|basamt 38375U\|36182D\|36182D1\|HMAG\|HMBSWEE\|20150416\|mortgage\|9010\|765423\|36463 +3\|46566 383333\|361333\|3618333\|HABS\|HABSDDE\|20150330\|mortgage2\|\|\|\| 38EE33\|\|\|H99S\|HAOOODE\|\|\|2330\|377233\|347433\|34488 [download] 'Export' is easy with `copy ( <some SQL> ) to '/tmp/file_output.txt' with (format csv, delimiter '\|', header true)` [download] I've not decided for myself how this compares to a variant with DBI. I guess I'll go make such a DBI/DBD-CSV version now... UPDATE: added the actual output file generation; some small text edits.	[reply] [d/l] [select]
Re: Outer Join on 2 files by hdb (Monsignor) on Apr 18, 2015 at 08:55 UTC
Could not resist, here is a column index driven approach. use strict; use warnings; use Data::Dumper; my $data1 = <<EO1; 38375U\|36182D\|36182D1\|HMAG\|HMBSWEE\|20150416\|mortgage 383333\|361333\|3618333\|HABS\|HABSDDE\|20150330\|mortgage2 EO1 my $data2 = <<EO2; 38375U\|3DD82D\|36FF333\|HMAG\|HMBSWEE\|9010\|765423\|364633\|46566 38EE33\|361DD3\|36LLE33\|H99S\|HAOOODE\|2330\|377233\|347433\|34488 EO2 sub process_file { my ( $file, $keys, $in, $out, $hash ) = @_; open my $fh, "<", $file; while(<$fh>) { chomp; my @attr = split /[\|]/; @{ $hash->{join '\|', @attr[@$keys]} }[ @$out ] = @attr[ @$in ]; } close $fh; } my %result; process_file( \$data1, [0,3,4], [0..6], [0..6], \%result ); process_file( \$data2, [0,3,4], [0,3,4,5..8], [0,3,4,7..10], \%result +); print Dumper \%result; [download] Dealing with ignoring the header line and printing is left as an exercise...	[reply] [d/l]
Re^2: Outer Join on 2 files by healingtao (Novice) on May 04, 2015 at 05:30 UTC
Thanks very much to all. I used some of your ideas and got it to work	[reply]
Re: Outer Join on 2 files by FreeBeerReekingMonk (Deacon) on Apr 18, 2015 at 08:12 UTC
Here is a method, if I understand your wording right, to use header TEXT indexes. use strict; use warnings; my $File_Deal = $ARGV[0]; my $File_ATT = $ARGV[1]; open(F1, "<", $File_Deal) or die $!; my $header_line; my @header_array; # File_Deal chomp( $header_line = <F1> ); @header_array = split('\\|', $header_line ); my %HEADER_File_Deal = map { $header_array[$_] => $_ } (0..@header_arr +ay-1); my %hash = (); while( <F1> ) { chomp; my @values = split(/\\|/, $_); # parent_cusp\|cusp\|isin\|deal\|tranche\|det_date\|col_type my $key = $values[$HEADER_File_Deal{"parent_cusp"}] . $values[$HEADER_File_Deal{"deal"}] . $values[$HEADER_File_Deal{"tranche"}]; print "key=$key is ( @values )\n"; } [download]	[reply] [d/l]
Re^2: Outer Join on 2 files by FreeBeerReekingMonk (Deacon) on Apr 18, 2015 at 08:40 UTC
...or not... here is the way to fetch it on index `use strict; use warnings; my $File_Deal = $ARGV[0]; my $File_ATT = $ARGV[1]; open(F1, "<", $File_Deal) or die $!; my $header_line; my @header_array; # File_Deal header, throw away chomp( $header_line = <F1> ); my %hash = (); while( <F1> ) { chomp; my @values = split(/\\|/, $_); # 0=parent_cusp\|1=cusp\|2=isin\|3=deal\|4=tranche\|5=det_date\|6=col_type my $key = $values[1-1] . $values[4-1] . $values[5-1]; print "key=$key is ( @values )\n"; }` [download]	[reply] [d/l]