mirage4d has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all.

So, I'm relatively new to Perl, and I've managed to screw up a Perl task that should be quite trivial. I have two files - one of ~16k records and another of ~240k records. I want to extract a string from each line of the first file(16k records), and subsequently check each record of the second file(~240k records) for a matching string. My take on it is below. Seems like it should work, but a couple of notes:

1. It doesn't work. My logic seems correct to me, but no match is found using an array of ids extracted from the 16k record file when looped against every element of the 240k record file. There should be ~16k matches. If I hard code one of the values from the array, a match is indeed found.

2. Even if it was working, it is no doubt horribly inefficient. I suspect I should use hashes somehow here, but I don't understand enough about those structures to implement them in this case.

So if anyone can shed some light as to where I have dropped the ball, I would really appreciate it. Any tips on how to make this script more efficient using hashes would be greatly appreciated as well. Style critique is appreciated also. Thanks!

#!/usr/bin/perl use strict; use warnings; open (PIDS, $ARGV[0]); open (FIDS, $ARGV[1]); open (OUTPUT, ">$ARGV[2]"); my @pids = <PIDS>; my @fids = <FIDS>; my @pidCans; my @fidCans; my $result; my $pidCan; my $fid; my $pid; foreach $pid (@pids) { $result = ''; $pid =~ /\|(.*)$/; $result = $1; push (@pidCans, $result); } foreach $pidCan (@pidCans) { chomp $pidCan; foreach $fid (@fids) { chomp $fid; print "Comparing $fid to $pidCan" . "\n"; if ($fid =~ /$pidCan/) { print "FOUND A MATCH.\n"; print OUTPUT $fid . "\n"; } } } close PIDS; close FIDS; close OUTPUT;

Replies are listed 'Best First'.
Re: My code sucks, please help me understand why.
by moritz (Cardinal) on Oct 14, 2009 at 17:20 UTC

    It would help a lot to know how your input data looks like (a few lines of each file would be enough), and if PIDS or FIDS is the larger one.

    I guess your logic problem (or one of them, at least) is this line:

    if ($fid =~ /$pidCan/) {

    Which interprets $pidCan as a regex, not just as a constant string to search for.

    You can make that a literal search by writing if ($fid =~ /\Q$pidCan\E/) instead.

    And yes, chances are that your program can be made much faster, but for that it would really help to know the structure of your input data.

    There are also some stylistic weaknesses in your code - for example it's bad practice to declare all variables up front - that way they can be seen in the whole program. It's easier to avoid errors though if you keep the scope of each variable small. For example $result can and should be declared inside the loop:

    foreach $pid (@pids) { my $result; $pid =~ /\|(.*)$/; $result = $1; push (@pidCans, $result); }

    Or even better avoided completely:

    foreach $pid (@pids) { $pid =~ /\|(.*)$/; push @pidCans, "$1"; }
    Perl 6 - links to (nearly) everything that is Perl 6.

      Thanks for the quick response moritz. Sample lines listed below:

      Lines from PIDS (this is the smaller (~16k record) file:

      dn: cn=*****,ou=users,|810221480 dn: cn=*****,ou=users,|810039655 dn: cn=*****,ou=users,|810086196 dn: cn=*****,ou=users,|810008482 dn: cn=*****,ou=users,|810224329
      Lines from FIDS (larger, ~240k records):
      810000001;08/17/1957 810000002;12/02/1975 810000003;12/22/1982 810000004;11/01/1967 810000005;02/07/1981 810000006;12/27/1967 810000007;05/09/1981 810000008;05/14/1976 810000009;10/24/1981 810000010;11/17/1943

      I'm trying to match on the 810* ids. Thanks for your help.

        Since you can extract the IDs easily for each file, you can put the IDs from PIDS into a hash, and query that while iterating over FIDS:

        my %ids; while (my $line = <PIDS>) { chomp $line; my (undef, $id) = split /\|/, $line; $ids{$id} = 1; } while (my $line = <FIDS>) { my ($id, $rest) = split /;/, $line, 2; if ($ids{$id}) { print "Found id '$id' in line $line"; } }
        Perl 6 - links to (nearly) everything that is Perl 6.
        my @pids; { open(my $pids_fh, '<', $pids_qfn) or die("Can't open PIDS file \"$pids_qfn\": $!\n"); push @pids, /\|(\d+)$/ while <$pids_fh>; } my $pids_pat = map qr/$_/, join '|', #map quotemeta, # We're only dealing with digits @pids; open(my $fids_fh, '<', $fids_qfn) or die("Can't open FIDS file \"$fids_qfn\": $!\n"); while (<$fids_fh>) { print if /^$pids_pat;/; }

        If you need extra speed and your Perl is older than 5.10, change

        my $pids_pat = map qr/$_/, join '|', #map quotemeta, # We're only dealing with digits @pids;
        to
        use Regexp::List qw( ); my $pids_pat = Regexp::List->new()->list2re(@pids);

        5.10 already does the optimisation Regexp::List does.

        An alternative to me previous reply would be to use a hash instead of a regular expression
        my %pids; { open(my $pids_fh, '<', $pids_qfn) or die("Can't open PIDS file \"$pids_qfn\": $!\n"); while (<$pids_fh>) { my ($pid) = /\|(\d+)$/ or next; ++$pids{$pid}; } } open(my $fids_fh, '<', $fids_qfn) or die("Can't open FIDS file \"$fids_qfn\": $!\n"); while (<$fids_fh>) { my ($pid) = /^(\d+);/ or next; print if $pids{$pid}; }