My code sucks, please help me understand why.

mirage4d has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all.

So, I'm relatively new to Perl, and I've managed to screw up a Perl task that should be quite trivial. I have two files - one of ~16k records and another of ~240k records. I want to extract a string from each line of the first file(16k records), and subsequently check each record of the second file(~240k records) for a matching string. My take on it is below. Seems like it should work, but a couple of notes:

1. It doesn't work. My logic seems correct to me, but no match is found using an array of ids extracted from the 16k record file when looped against every element of the 240k record file. There should be ~16k matches. If I hard code one of the values from the array, a match is indeed found.

2. Even if it was working, it is no doubt horribly inefficient. I suspect I should use hashes somehow here, but I don't understand enough about those structures to implement them in this case.

So if anyone can shed some light as to where I have dropped the ball, I would really appreciate it. Any tips on how to make this script more efficient using hashes would be greatly appreciated as well. Style critique is appreciated also. Thanks!

#!/usr/bin/perl

use strict;
use warnings;

open (PIDS, $ARGV[0]);
open (FIDS, $ARGV[1]);
open (OUTPUT, ">$ARGV[2]");

my @pids = <PIDS>;
my @fids = <FIDS>;
my @pidCans;
my @fidCans;
my $result;
my $pidCan;
my $fid;
my $pid;

foreach $pid (@pids) {

   $result = '';
   $pid =~ /\|(.*)$/;
   $result = $1;
   push (@pidCans, $result);
}


foreach $pidCan (@pidCans)  {

  chomp $pidCan;

    foreach $fid (@fids)  {
      chomp $fid;
      print "Comparing $fid to $pidCan" . "\n";
         if ($fid =~ /$pidCan/) {
         print "FOUND A MATCH.\n";
         print OUTPUT $fid . "\n";
         }
   }
}

close PIDS;
close FIDS;
close OUTPUT;
[download]

Comment on My code sucks, please help me understand why. Download Code

Replies are listed 'Best First'.
Re: My code sucks, please help me understand why. by moritz (Cardinal) on Oct 14, 2009 at 17:20 UTC
It would help a lot to know how your input data looks like (a few lines of each file would be enough), and if PIDS or FIDS is the larger one. I guess your logic problem (or one of them, at least) is this line: `if ($fid =~ /$pidCan/) {` Which interprets `$pidCan` as a regex, not just as a constant string to search for. You can make that a literal search by writing `if ($fid =~ /\Q$pidCan\E/)` instead. And yes, chances are that your program can be made much faster, but for that it would really help to know the structure of your input data. There are also some stylistic weaknesses in your code - for example it's bad practice to declare all variables up front - that way they can be seen in the whole program. It's easier to avoid errors though if you keep the scope of each variable small. For example `$result` can and should be declared inside the loop: `foreach $pid (@pids) { my $result; $pid =~ /\\|(.)$/; $result = $1; push (@pidCans, $result); }` [download] Or even better avoided completely: `foreach $pid (@pids) { $pid =~ /\\|(.)$/; push @pidCans, "$1"; }` [download] Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: My code sucks, please help me understand why. by mirage4d (Novice) on Oct 14, 2009 at 17:39 UTC
Thanks for the quick response moritz. Sample lines listed below: Lines from PIDS (this is the smaller (~16k record) file: `dn: cn=***,ou=users,\|810221480 dn: cn=*,ou=users,\|810039655 dn: cn=*,ou=users,\|810086196 dn: cn=*,ou=users,\|810008482 dn: cn=**,ou=users,\|810224329` [download] Lines from FIDS (larger, ~240k records): `810000001;08/17/1957 810000002;12/02/1975 810000003;12/22/1982 810000004;11/01/1967 810000005;02/07/1981 810000006;12/27/1967 810000007;05/09/1981 810000008;05/14/1976 810000009;10/24/1981 810000010;11/17/1943` [download] I'm trying to match on the 810 ids. Thanks for your help.	[reply] [d/l] [select]
Re^3: My code sucks, please help me understand why. by moritz (Cardinal) on Oct 14, 2009 at 17:56 UTC
Since you can extract the IDs easily for each file, you can put the IDs from PIDS into a hash, and query that while iterating over FIDS: `my %ids; while (my $line = <PIDS>) { chomp $line; my (undef, $id) = split /\\|/, $line; $ids{$id} = 1; } while (my $line = <FIDS>) { my ($id, $rest) = split /;/, $line, 2; if ($ids{$id}) { print "Found id '$id' in line $line"; } }` [download] Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re^4: My code sucks, please help me understand why. by mirage4d (Novice) on Oct 15, 2009 at 15:32 UTC
Re^5: My code sucks, please help me understand why. by moritz (Cardinal) on Oct 15, 2009 at 17:09 UTC
Re^3: My code sucks, please help me understand why. (regex) by ikegami (Patriarch) on Oct 14, 2009 at 17:46 UTC
`my @pids; { open(my $pids_fh, '<', $pids_qfn) or die("Can't open PIDS file \"$pids_qfn\": $!\n"); push @pids, /\\|(\d+)$/ while <$pids_fh>; } my $pids_pat = map qr/$_/, join '\|', #map quotemeta, # We're only dealing with digits @pids; open(my $fids_fh, '<', $fids_qfn) or die("Can't open FIDS file \"$fids_qfn\": $!\n"); while (<$fids_fh>) { print if /^$pids_pat;/; }` [download] If you need extra speed and your Perl is older than 5.10, change `my $pids_pat = map qr/$_/, join '\|', #map quotemeta, # We're only dealing with digits @pids;` [download] to `use Regexp::List qw( ); my $pids_pat = Regexp::List->new()->list2re(@pids);` [download] 5.10 already does the optimisation Regexp::List does.	[reply] [d/l] [select]
Re^3: My code sucks, please help me understand why. (hash) by ikegami (Patriarch) on Oct 14, 2009 at 17:54 UTC
An alternative to me previous reply would be to use a hash instead of a regular expression `my %pids; { open(my $pids_fh, '<', $pids_qfn) or die("Can't open PIDS file \"$pids_qfn\": $!\n"); while (<$pids_fh>) { my ($pid) = /\\|(\d+)$/ or next; ++$pids{$pid}; } } open(my $fids_fh, '<', $fids_qfn) or die("Can't open FIDS file \"$fids_qfn\": $!\n"); while (<$fids_fh>) { my ($pid) = /^(\d+);/ or next; print if $pids{$pid}; }` [download]	[reply] [d/l]