Merge Columns of Multiple files based on Multiple Common Column

deepak2027 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Merge Columns of Multiple files based on Multiple Common Column by Corion (Patriarch) on May 03, 2013 at 05:43 UTC
Have you seen join - join two files according to a common key?	[reply]
Re^2: Merge Columns of Multiple files based on Multiple Common Column by deepak2027 (Initiate) on May 03, 2013 at 06:57 UTC
Corion, thx for the response, but this join is not able to join based on multiple columns I am not sure how to assign the values to @left_key_colls and @right_key_colls	[reply]
Re^3: Merge Columns of Multiple files based on Multiple Common Column by Corion (Patriarch) on May 03, 2013 at 07:21 UTC
Either you hardcode them in the program, or you read Getopt::Long, which will explain to you how an option is handled if it is given multiple times. Also, as I now see, there is documentation at the bottom of the file which shows exactly your use case of specifying multiple columns. Maybe you should read that documentation first?	[reply]
Re^3: Merge Columns of Multiple files based on Multiple Common Column by hdb (Monsignor) on May 03, 2013 at 07:45 UTC
I have tried the following: (`join.pl` is Corion's script) C:\scripts>perl join.pl --null "--" --left 1,2,3 --right 1,2,3 --delim +iter " " --missing 1 --missing 2 file1.txt file2.txt > file12.txt C:\scripts>perl join.pl --null "--" --left 1,2,3 --right 1,2,3 --delim +iter " " --missing 1 --missing 2 file12.txt file3.txt ID NAME date Val1 Val2 Val3 Val4 Val101 Val102 Val103 Val104 Val101 Va +l102 Val103 Val104 1 ABC 04-05-2013 10 11 12 13 50 51 52 53 -- -- -- -- 2 XYZ 04-05-2013 21 22 23 24 -- -- -- -- -- -- -- -- 1 ABC 05-05-2013 30 31 32 33 -- -- -- -- 60 61 62 63 2 XYZ 05-05-2013 41 42 43 44 -- -- -- -- 71 72 73 74 [download] So nearly there...	[reply] [d/l] [select]
Re: Merge Columns of Multiple files based on Multiple Common Column by hdb (Monsignor) on May 03, 2013 at 07:07 UTC
Storing all data in a hash, so this might break down once your files become huge. There is absolutely no error handling in there, so if the file format varies or the entries have spaces in them, it will not work anymore. So please take it as proof of concept and not as production code. use strict; use warnings; sub readfile { my ( $filename, $hashref, $headref ) = @_; open my $fh, "<", $filename or die "Cannot open $filename!\n"; my $headers = <$fh>; my @h = split /\s/, $headers; $headref->{$_}++ for @h[3..$#h]; while( <$fh> ) { my @line = split /\s/; $hashref->{$line[0]}{$line[1]}{$line[2]}{$h[$_]} = $line[$_] f +or 3..$#h; } close $fh; } my %joined; my %headers; for my $file ( qw/file1.txt file2.txt file3.txt/ ) { readfile( $file, \%joined, \%headers ); } print "ID NAME date ", join( " ", sort keys %headers ), "\n"; for my $id ( sort keys %joined ) { for my $name ( sort keys %{$joined{$id}} ) { for my $date ( sort keys %{$joined{$id}{$name}} ) { print "$id $name $date "; print join " ", map { $joined{$id}{$name}{$date}{$_} // "- +-" } sort keys %headers; print "\n"; } } } [download]	[reply] [d/l]
Re^2: Merge Columns of Multiple files based on Multiple Common Column by karlgoethebier (Abbot) on May 03, 2013 at 10:55 UTC
ŤStoring all data in a hash, so this might break down once your files become huge.ť Mmh, why not (storing all data in a hash)? I take a look at my box @home. It has 8 GBytes RAM. Sometime ago something like this was a high-end server, short time later a high-end work station and now it is a standard desktop solution: `Karls-Mac-mini:monks karl$ top PhysMem: 833M wired, 1541M active, 260M inactive, 2634M used, 5558M fr +ee.` [download] So i don't have any pain in the ass to store my data in a big hash on my box. And if i have such huge files, i have another problem. Best regards, Karl ŤThe Crux of the Biscuit is the Apostropheť	[reply] [d/l]
Re: Merge Columns of Multiple files based on Multiple Common Column by Zzenmonk (Sexton) on May 03, 2013 at 15:30 UTC
Hi Whatever the above you use, you need to have a concept. Databases are doing this type of joins with two basic algorithms: sort merge or nested loop. The nested loop works the following way: Loop around one of the table. Set up the key made of your 3 columns. Loop around the second table and merge the rows each time the key match. Move to the next row of the first table and repeat the process. I do not recommend this algorithm because it has an awful performance impact. Actually it is one of two ways to take down a server for a couple of hours or weeks! The second way to do it is called sort merge: Sort file A on all 3 columns of your key in ascending way. Eventually use the sort command of the OS. It will use less RAM and be probably faster. Sort file B exactly the same way. Retrieve the first row of file A. Check if the keys of the first row of file B match. They should. Work yourself down file B as long as the keys match. When the keys in B do not match any more, move to the next row of A. Start at the same position in B and work yourself down as long as keys match. ... The above assumes one of both file has only one row per composite key ( your 3 columns). If this is the case you will process both files only once and your files are merged. If you have a so called many to many relation between the keys of your two files, you will have to work down file B several times. I recommend you start merging the two smallest files first, then you sort the result and add the next file with a new merge. Eventually search CPAN or the net for sort merge examples or modules. K The best medicine against depression is a cold beer!	[reply]