mao9856 has asked for the wisdom of the Perl Monks concerning the following question:

Hello! I want to match two large data files to match and print only identical id data.

file1:

ABS0056

ABS0057

ABS0058

ABS0059

...........

file2:

id “ABS0056”; name “SAM”;

id “ABS0059”; name “JOE”;

id “ABS0060”; name “MARY”;

id “ABS0057”; name “BILL”;

id “ABS0057”; name “BILL”;

id “ABS0056”; name “SAM”;

id “ABS0065”; name “RONIE”;

id “ABS0061”; name “STEPHAN”

id “ABS0057”; name “BILL”;

id “ABS0056”; name “SAM”;

........

I used awk to remove semicolon and inverted commas. These two columns are separated by tab. So my file looks like this:

file3

ABS0056 SAM

ABS0059 JOE

ABS0060 MARY

ABS0057 BILL

ABS0057 BILL

ABS0056 SAM

ABS0065 RONIE

ABS0061 STEPHAN

ABS0057 BILL

ABS0056 SAM

..............

I want my output in following data format:

ABS0056 SAM

ABS0057 BILL

ABS0059 JOE

I have tried using code as below:

#!/usr/bin/env perl use strict; use warnings; open FILE1, "< file1" or die; my $keyRef; while (<FILE1>) { chomp; $keyRef->{$_} = 1; } close FILE1; open FILE3, "< file3" or die; while (<FILE2>) { chomp; my ($testKey, $name) = split("\t", $_); if (defined $keyRef->{$testKey}) { print STDOUT "$_\n"; } } close FILE3;

Thanking in advance

Replies are listed 'Best First'.
Re: print data with matching id from two different files
by hippo (Archbishop) on Oct 31, 2017 at 11:16 UTC

    As you parse file2 you are neglecting to remove the quotes and the semi-colon, thus the key will never match.

    Update: You aren't removing the leading "id " either. Are you sure those are tabs in your data?

    Probably best to invoke item 2 from the Basic debugging checklist:

    my ($testKey, $name) = split("\t", $_); print "DEBUG: key is '$testKey', name is '$name'\n";

      Please excuse me I forgot to mention that I removed semicolon and inverted commas using awk. So my written code is for a file3 without semicolon and inverted commas.Yes,there is tab in file3.

        In that case the likely problem is here:

        open FILE3, "< file3" or die; while (<FILE2>) {

        As you can clearly see, you open FILE3 but try to read from FILE2 which is not opened. Running your code gives you this warning:

        Name "main::FILE2" used only once: possible typo at 1202408.pl line 14.

        which should have alerted you to this. Always address the warnings.

Re: print data with matching id from two different files
by thanos1983 (Parson) on Oct 31, 2017 at 10:53 UTC

    Hello mao9856,

    One possible way it to store keys of both files based on id e.g. file keys qw(ABS0056 ABS0057 ABS0058) etc as you read your file push them into an array. Create a hash from the second file with key the id and value the name e.g. $hash{"ABS0056"} = "SAM"; while you read the file.

    Then simply compare keys and find the common ones, iterate over the hash with the common keys remove the ones that are not common and voila. :D

    Hope this helps you to proceed. Show us the effort of resolving your problem not just this is what I want to do do it for me.

    Update: Module to compare arrays List::Compare.

    Update2: Maybe this small sample of code will get you to the right direction. This is more or less 3/4 of the solution. There are many ways to approach your problem but this is one out of them. Sample of code below:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; sub cleanString { my ( $str ) = @_; my ( $first , $second ) = split( / / , $str ); return substr( $second , 1 , length($second) -2 ); } my $File1 = 'file1.txt'; open (my $fh1, "<", $File1) or die "Error opening file1: $!\n"; my @keys; my $keyRef; while (<$fh1>) { chomp; next unless $_ ne ''; $keyRef->{$_} = 1; push @keys, $_; } close $fh1 or warn "Could not clode file1: $!\n"; # print Dumper $keyRef; # print Dumper \@keys; my $File2 = 'file2.txt'; open (my $fh2, "<", $File2) or die "Error opening file2: $!\n"; my %hash; while (<$fh2>) { chomp; next unless $_ ne ''; my @elements = sort( map { s/^\s+//; # strip leading spaces s/\s+$//; # strip trailing spaces $_ # return the modified string } split ';', $_ ); # print Dumper \@elements; $hash{cleanString($elements[0])} = cleanString($elements[1]); } close $fh2 or warn "Could not clode file1: $!\n"; print Dumper \%hash; __END__ $ perl test.pl $VAR1 = { 'ABS0059' => 'JOE', 'ABS0060' => 'MARY', 'ABS0057' => 'BILL', 'ABS0061' => 'STEPHAN', 'ABS0065' => 'RONIE', 'ABS0056' => 'SAM' };

    BR / Thanos

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: print data with matching id from two different files
by wjw (Priest) on Oct 31, 2017 at 13:42 UTC

    As I see it, there are two things that you need to do to solve a problem like this for yourself:

    • Use either the perl debugger perl -d myfile.pl" or, as mentioned, use Dumper, or you can use print statements - so that you can see the results, or lack thereof, of what your code is doing.
    • Use Perl documentation
    .

    You can check out the Perl debugger, and in particular: Stepping through code which, for the code you are writing should really be all you need. Doing those two things will save you lot of time and effort when it comes to answering questions like this.

    One other thing you might consider doing as you are starting out: Literally write everything you need to do to get from what you have (two files with data), to what you want to get out at the end of your program. I have been programming for years (not that I am all that great at it) and still find that I save myself a lot of confusion by taking the time to write it out in plain language first. Often times it helps me predict ahead of time exactly where I am likely to have questions, thus allowing me to do some research ahead of time, so that by the time I get to coding, I have already addressed those areas with which I am unskilled or unfamiliar

    Hope that is useful to you...

    ...the majority is always wrong, and always the last to know about it...

    A solution is nothing more than a clearly stated problem...

Re: print data with matching id from two different files
by mao9856 (Sexton) on Nov 02, 2017 at 11:05 UTC

    I tried a new code and realized that file2 (similarly file3)contains more than 18000 rows and it has repeating values of ids and names

    I tried following code to get desired output:

    #!usr/bin/env perl #use strict; use warnings; open (FILE1,"< file1”); while (!eof(FILE1)) { chomp ($element1 = (<FILE1>)); open (FILE3"< file3”); while (!eof(FILE3) { $element2 = (<FILE3>); if ($element2 =~ /$element1/) {print $element2;} } }

    This code is matching but printing values in replicates (for example: output has ABS0057 BILL = 3 times, ABS0056 SAM = 3 times) since file3 itself contain replicate values. I want to print one in output that shows match (say ABS0057 from file1 matches ABS0057 BILL from file3, output should print just one ABS0057 BILL)

      #use strict;

      Never do this. Never comment out strict. If your script only compiles without strict, put it back in and fix all the errors first before doing anything else. strict is there to stop you making mistakes.

        #!usr/bin/env perl use strict; use warnings; open (FILE1,"< file1"); while (!eof(FILE1)) { chomp (my$element1 = (<FILE1>)); open (FILE3,"< file3"); while (!eof(FILE3)) { my$element2 = (<FILE3>); if ($element2 =~ /$element1/) {print $element2;} } }