deepak2027 has asked for the wisdom of the Perl Monks concerning the following question:

Hi All, I am a new bie for perl and perl monks.. I had gone through many posts of perl monks but could not find the answer for a problem which i m having. I have a task of merging columns of multiple files(tab seperated) based on multiple common columns. e.g.

File1.txt
ID NAME date Val1 Val2 Val3 Val4
1 ABC 04-05-2013 10 11 12 13
2 XYZ 04-05-2013 21 22 23 24
1 ABC 05-05-2013 30 31 32 33
2 XYZ 05-05-2013 41 42 43 44


File2.txt
ID NAME date Val101 Val102 Val103 Val104
1 ABC 04-05-2013 50 51 52 53


File3.txt
ID NAME date Val101 Val102 Val103 Val104
1 ABC 05-05-2013 60 61 62 63
2 XYZ 05-05-2013 71 72 73 74


All these files should merge as below
result.txt
ID NAME date Val1 Val2 Val3 Val4 Val101 Val102 Val103 Val104
1 ABC 04-05-2013 10 11 12 13 50 51 52 53
2 XYZ 04-05-2013 21 22 23 24 "--" "--" "--" "--"
1 ABC 05-05-2013 30 31 32 33 60 61 62 63
2 XYZ 05-05-2013 41 42 43 44 71 72 73 74

The Data Files cand have rows in few thousands
  • Comment on Merge Columns of Multiple files based on Multiple Common Column

Replies are listed 'Best First'.
Re: Merge Columns of Multiple files based on Multiple Common Column
by Corion (Patriarch) on May 03, 2013 at 05:43 UTC

      Corion, thx for the response, but this join is not able
      to join based on multiple columns
      I am not sure how to assign the values to @left_key_colls and @right_key_colls

        Either you hardcode them in the program, or you read Getopt::Long, which will explain to you how an option is handled if it is given multiple times.

        Also, as I now see, there is documentation at the bottom of the file which shows exactly your use case of specifying multiple columns. Maybe you should read that documentation first?

        I have tried the following: (join.pl is Corion's script)

        C:\scripts>perl join.pl --null "--" --left 1,2,3 --right 1,2,3 --delim +iter " " --missing 1 --missing 2 file1.txt file2.txt > file12.txt C:\scripts>perl join.pl --null "--" --left 1,2,3 --right 1,2,3 --delim +iter " " --missing 1 --missing 2 file12.txt file3.txt ID NAME date Val1 Val2 Val3 Val4 Val101 Val102 Val103 Val104 Val101 Va +l102 Val103 Val104 1 ABC 04-05-2013 10 11 12 13 50 51 52 53 -- -- -- -- 2 XYZ 04-05-2013 21 22 23 24 -- -- -- -- -- -- -- -- 1 ABC 05-05-2013 30 31 32 33 -- -- -- -- 60 61 62 63 2 XYZ 05-05-2013 41 42 43 44 -- -- -- -- 71 72 73 74

        So nearly there...

Re: Merge Columns of Multiple files based on Multiple Common Column
by hdb (Monsignor) on May 03, 2013 at 07:07 UTC

    Storing all data in a hash, so this might break down once your files become huge. There is absolutely no error handling in there, so if the file format varies or the entries have spaces in them, it will not work anymore. So please take it as proof of concept and not as production code.

    use strict; use warnings; sub readfile { my ( $filename, $hashref, $headref ) = @_; open my $fh, "<", $filename or die "Cannot open $filename!\n"; my $headers = <$fh>; my @h = split /\s/, $headers; $headref->{$_}++ for @h[3..$#h]; while( <$fh> ) { my @line = split /\s/; $hashref->{$line[0]}{$line[1]}{$line[2]}{$h[$_]} = $line[$_] f +or 3..$#h; } close $fh; } my %joined; my %headers; for my $file ( qw/file1.txt file2.txt file3.txt/ ) { readfile( $file, \%joined, \%headers ); } print "ID NAME date ", join( " ", sort keys %headers ), "\n"; for my $id ( sort keys %joined ) { for my $name ( sort keys %{$joined{$id}} ) { for my $date ( sort keys %{$joined{$id}{$name}} ) { print "$id $name $date "; print join " ", map { $joined{$id}{$name}{$date}{$_} // "- +-" } sort keys %headers; print "\n"; } } }
      «Storing all data in a hash, so this might break down once your files become huge.»

      Mmh, why not (storing all data in a hash)?

      I take a look at my box @home. It has 8 GBytes RAM. Sometime ago something like this was a high-end server, short time later a high-end work station and now it is a standard desktop solution:

      Karls-Mac-mini:monks karl$ top PhysMem: 833M wired, 1541M active, 260M inactive, 2634M used, 5558M fr +ee.

      So i don't have any pain in the ass to store my data in a big hash on my box.

      And if i have such huge files, i have another problem.

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

Re: Merge Columns of Multiple files based on Multiple Common Column
by Zzenmonk (Sexton) on May 03, 2013 at 15:30 UTC

    Hi

    Whatever the above you use, you need to have a concept. Databases are doing this type of joins with two basic algorithms: sort merge or nested loop.

    The nested loop works the following way:

    • Loop around one of the table.
    • Set up the key made of your 3 columns.
    • Loop around the second table and merge the rows each time the key match.
    • Move to the next row of the first table and repeat the process.

    I do not recommend this algorithm because it has an awful performance impact. Actually it is one of two ways to take down a server for a couple of hours or weeks!

    The second way to do it is called sort merge:

    • Sort file A on all 3 columns of your key in ascending way. Eventually use the sort command of the OS. It will use less RAM and be probably faster.
    • Sort file B exactly the same way.
    • Retrieve the first row of file A.
    • Check if the keys of the first row of file B match. They should.
    • Work yourself down file B as long as the keys match.
    • When the keys in B do not match any more, move to the next row of A.
    • Start at the same position in B and work yourself down as long as keys match.
    • ...

    The above assumes one of both file has only one row per composite key ( your 3 columns). If this is the case you will process both files only once and your files are merged. If you have a so called many to many relation between the keys of your two files, you will have to work down file B several times.

    I recommend you start merging the two smallest files first, then you sort the result and add the next file with a new merge.

    Eventually search CPAN or the net for sort merge examples or modules.

    K

    The best medicine against depression is a cold beer!