mao9856 has asked for the wisdom of the Perl Monks concerning the following question:

I have two huge files. File1 has one column and File2 has two columns as follows:

File1

ABCD12

XYZ13

EFGT45

UVWZ34

TSR78

........

File2

ID121 ABC14

ID122 EFG87

ID145 XYZ43

ID157 TSR11

ID181 ABC31

ID962 YTS27

ID529 EFG56

ID684 TSR07

ID921 BAMD80

.............

I want to match first column of File1 and starting three alphabets of second column of File2 and print Ids of those are matched.

Desired output:

ID121 ABC14

ID122 EFG87

ID145 XYZ43

ID157 TSR11

ID181 ABC31

ID529 EFG56

ID684 TSR07

I tried foloowing code:

#!/usr/bin/perl use strict; use warnings; my ($f1,$f2,@patterns,%patts,$f2_rec,$f2_field); $f1 = $ARGV[0]; $f2 = $ARGV[1]; open(PATT,"<", $f1) or die; @patterns = <PATT>; chomp(@patterns); close(PATT) or die; @patts{@patterns} = (1) x @patterns; open(FILE,"<", $f2) or die; while (defined ($f2_rec = <FILE>)) { chomp $f2_rec; $f2_field = (split(/ /,$f2_rec))[0]; if(exists($patts{$f2_field})) { print "$f2_rec\n"; } } close(FILE) or die;

It know it won't separate matching initial three alphabets, but it will match exact values with 'error: use of unitialized value' until use warnings is blocked. Please help.

  • Comment on print all data matching identical three alphabets from two different files
  • Download Code

Replies are listed 'Best First'.
Re: print all data matching identical three alphabets from two different files
by toolic (Bishop) on Nov 15, 2017 at 14:29 UTC
    One way is to create the hash differently. Only keep the 1st 3 letters of File1 as the keys. Then, grab the 2nd column of File2 and, again, only use the 1st 3 letters.
    use strict; use warnings; my ($f1,$f2,%patts,$f2_rec,$f2_field); $f1 = $ARGV[0]; $f2 = $ARGV[1]; open(PATT,"<", $f1) or die; while (<PATT>) { chomp; $patts{substr $_, 0, 3} = 1; } close(PATT) or die; open(FILE,"<", $f2) or die; while (defined ($f2_rec = <FILE>)) { chomp $f2_rec; $f2_field = (split(/ /,$f2_rec))[1]; $f2_field = substr $f2_field, 0, 3; if(exists($patts{$f2_field})) { print "$f2_rec\n"; } } close(FILE) or die;
Re: print all data matching identical three alphabets from two different files
by thanos1983 (Parson) on Nov 15, 2017 at 16:06 UTC

    Hello mao9856,

    Just to add some minor ideas here on the answer of the fellow monk toolic. I would also add a regex to skip the blank lines with next, where it seems to exist on your file(s) with data.

    Also I would change the die statements of close file to warn. For me it is not necessary to stop the whole script in case a file can not close but I would like to know it, this is why I would use warn and not die.

    I would also suggest mao9856 to read this article Don't Open Files in the old way.

    Sample of code including output based on all the minor modifications:

    #!/usr/bin/perl use strict; use warnings; my (%patts, $f2_rec, $f2_field); my $f1 = $ARGV[0]; my $f2 = $ARGV[1]; open(my $fh1,"<", $f1) or die "Failled to open '$f1' $!"; while (<$fh1>) { chomp; next if /^\s*$/; $patts{substr $_, 0, 3} = 1; } close($fh1) or warn "Failled to close '$f1' $!"; open(my $fh2,"<", $f2) or die "Failled to open '$f2' $!"; while (defined ($f2_rec = <$fh2>)) { chomp $f2_rec; next if $f2_rec =~ /^\s*$/; $f2_field = (split(/ /,$f2_rec))[1]; $f2_field = substr $f2_field, 0, 3; if(exists($patts{$f2_field})) { print "$f2_rec\n"; } } close($fh2) or warn "Failled to close '$f2' $!"; # update changing open to clo +se thank to Laurent_R for pointing out __END__ $ perl test.pl file1.txt file2.txt ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID529 EFG56 ID684 TSR07

    Hope this helps, BR.

    Update: Thanks to fellow monk Laurent_R for noticing a typo I have update the sample of code.

    Seeking for Perl wisdom...on the process of learning...not there...yet!
Re: print all data matching identical three alphabets from two different files
by kcott (Archbishop) on Nov 15, 2017 at 23:22 UTC

    G'day mao9856,

    Here's another (less busy) way to do it.

    #!/usr/bin/env perl use strict; use warnings; use Inline::Files; my %match; ++$match{substr $_, 0, 3} while <MATCH_DATA>; while (<PARSE_DATA>) { print if $match{substr +(split)[1], 0, 3}; } __MATCH_DATA__ ABCD12 XYZ13 EFGT45 UVWZ34 TSR78 __PARSE_DATA__ ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID962 YTS27 ID529 EFG56 ID684 TSR07 ID921 BAMD80

    Output:

    ID121 ABC14 ID122 EFG87 ID145 XYZ43 ID157 TSR11 ID181 ABC31 ID529 EFG56 ID684 TSR07

    I've used Inline::Files just to show the technique. It's good you've used the 3-argument form of open; but less good that you've used package variables for the filehandles — prefer lexical filehandles instead. Also, your error reporting (i.e. or die) is rubbish: either spend a lot more time on this tedious and error-prone task yourself, or just let Perl do it for you with the autodie pragma.

    Please post data within <code>...</code> tags as you did with your code. This makes it a lot less work for you; your data isn't subject to HTML interpretation (e.g. special characters and whitespace compression); and it makes it a lot easier for us to paste it directly into any example code we might provide.

    — Ken

Re: print all data matching identical three alphabets from two different files
by mao9856 (Sexton) on Dec 22, 2017 at 14:07 UTC

    Thank you all for help:)