Matching two files based on one column common in each

bluray has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perlmonks,

This post is bit of follow up to my previous post. I got a refined code from Marshall and it worked fine. But, now I am using the same code in another file which has several repeat IDs. I want to match this with another file which is a subset of the first file. Though, I was able to get the result, because of the repeat IDs, my output has less number of rows than the first file. How can I tweak the code below to resolve this repeat IDs.

#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV_XS;

open (my $FILE1, '<', "File1.csv") or die "cannot open file1 $!\n";
open (my $FILE2, '<', "File2.csv") or die "cannot open file3 $!\n";
open (my $FILE3, '>', "OutputFile.csv") or die "cannot open file3 $!\n
+";

my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ , 
                               sep_char => "\t", always_quote =>1});

print $FILE3 "Match\t".<$FILE1>; # header for file3
<$FILE2>;                        # skip header line of file2

my %file1; 
my %file2;

while (my $row = $csv->getline($FILE1)) 
{
    chomp $row;
    my @fields = @$row;
    my $id = $fields[1];
    my $newfield=["",@fields];
    if (exists($file1{$id})){
    print "repeat $id\n";
    }
    $file1{$id}=$newfield;
    
}

while (my $row = $csv->getline($FILE2)) 
{
    chomp $row;
    my @fields = @$row;
    my $id = $fields[0];
    $file2{$id}=["","",@fields,"","","","","","","","","","","","","",
+"","",""];
}

foreach my $id1 (keys %file1)
{
    if (exists $file2{$id1})
    {
        $file1{$id1}[0] ="HK";             #both files
        $csv->print ($FILE3, $file1{$id1});
    }
    else
    {
        $file1{$id1}[0] ="NOT_HK";                #file1 only
        $csv->print ($FILE3, $file1{$id1});
    }
}
[download]

Comment on Matching two files based on one column common in each Download Code

Replies are listed 'Best First'.
Re: Matching two files based on one column common in each by moritz (Cardinal) on Sep 28, 2011 at 16:08 UTC
It's hard to help you without example inputs and expected output, so I can just give you some vague directions. It might work by reading the second file into a hash first, and then going through each row of the first file and print it to the third file, augmented with the information from the hash. Or phrased differently, you probably need just one hash, not two. Perl 6 - second systems done right	[reply]
Re: Matching two files based on one column common in each by Marshall (Canon) on Sep 28, 2011 at 16:05 UTC
Yes, I remember this code from: Issues with Column headings. In the original problem statement, there was a need to check whether some id exists in file2 that does not in file1. That is why %file1 was created. If you look at the code you posted, there are 3 main steps: (1) make the hash %file1 (ids in file1), (2) make %file2 (ids in file2), (3) process keys (all unique id's) in %file1. Step(4) process all unique ids in %file2 is not there anymore - so the data structure for it is not needed either. So, the %file1 hash is not needed. The idea is to combine step1 and step3 together as a new step(3) and get rid of step (1) altogether. Take out the step 1 code. And then modify step(3): instead of `foreach my $id1 (keys %file1){...}`, just use the first part of what was step(1): `while (my $row = $csv->getline($FILE1)) { # $row is a reference to a row my @fields = @$row; # this explicitly de-references my $id1 = $fields[1]; if (exists $file2{$id1}) { $csv->print ($FILE3, "HL", @fields); #both files } else { $csv->print ($FILE3,"NOT_HK", @fields); #file1 only } }` [download] I didn't test this, but that should give you a repeated line if an id in file1 repeats on a different line. I do not know why you added "chomp $row;". That's not needed. $row is a reference to an array that the csv module creates when it reads the line from the file. The program won't bomb, but this line doesn't do anything useful.	[reply] [d/l] [select]
Re^2: Matching two files based on one column common in each by bluray (Sexton) on Sep 28, 2011 at 21:09 UTC
Hi Marshall, Thanks for the code. I did changed the code a bit to make it work. Now, I am getting the correct output file. `while (my $row = $csv->getline($FILE1)) { # $row is a reference to a row my @fields = @$row; # this explicitly de-references my $id1 = $fields[1]; if (exists $file2{$id1}) { my $fields_ref= \@fields; unshift(@$fields_ref, "HK"); $csv->print ($FILE3, $fields_ref); # $csv->print ($FILE3, "HK", @fields); #both files } else { my $fields_ref = \@fields; unshift (@$fields_ref, "NOT_HK"); $csv->print ($FILE3, $fields_ref); } }` [download]	[reply] [d/l]