Different counts between perl and grep

herda05 has asked for the wisdom of the Perl Monks concerning the following question:

Monks, I'm stumped on this one. I have two files, file1 an original with 37621 names in it, 1 per line. I produced it using grep > file1. I have a second file, file2 with 37585 names in it, that are exactly the same as what is in file1, except there are 36 missing. I thought, well, I can use perl to get the differences. So off I went and wrote a small script:

    while (<FH1>) {
        chomp;
        my $line = $_;
        #print __LINE__ . " $_\n";
        my ($var1,$var2) = split(/:/,$line);
        $var2 = substr($var2,1); #remove 1st space results from substr
        $diffHash1{$var2} = $var2;
    }
    close(FH1);
    getCount(\%diffHash1); #$sclar = keys(%hash)
    my $cnt1 = getCount(\%diffHash1,'1');
    
    print "Reading in $file2\n";
    open FH2, "< $file1" || die "Counld not open $file2: $!";
    while (<FH2>) {
        chomp;
        my $line = $_;
        my ($var1,$var2) = split(/:/,$line);
        $var2 = substr($var2,1);
        $diffHash2{$var2} = $var2;
    }
    close(FH2);
    print "Comparing $file1 and $file2\n";
    my $cnt2 = getCount(\%diffHash2,'1');
    print "Count in $file1: $cnt1\n";
    print "Count in $file2: $cnt2\n";
    
    if ($cnt1 gt $cnt2) {
        while (my($k1,$v1) =  each(%diffHash1)) {
            my $line = $k1;
            if (!$diffHash2{$line}) {
                $resHash{$line} = $line;
            }
        }
    } else {
        while (my($k1,$v1) =  each(%diffHash2)) {
            my $line = $k1;
            if (!$diffHash1{$line}) {
                $resHash{$line} = $line;
            }
        }
    }
[download]

getCount just does a keys <%hash> into scalar to get a count of keys. Problem is, perl thinks there are only 37585 lines in file1.Huh?
Since everything is in a format of insert_job: <jn> I can check perl:

grep insert_job | wc -l
[download]

which returns 37621. Alright, me thinks, maybe there are blank lines, or some other entries that I'm not coding for. So I sort the two files, then do a diff redirect and get 36 names that aren't in file2. Question is, where am I letting perl down in my code? How come perl isn't seeing all the jobs in file1?

Comment on Different counts between perl and grep Select or Download Code

Replies are listed 'Best First'.
Re: Different counts between perl and grep by Tanktalus (Canon) on Aug 29, 2009 at 05:13 UTC
Maybe your perl code is right. The perl code and the grep are doing two different things. The perl code is populating a hash, which means that you will get collisions if the same key (job number) is inserted twice. The grep, however, is less picky. Duplicates will get printed out. `grep insert_job \| sort -u \| wc -l` [download] That might print out something a bit closer to perl, assuming there is no other data on the line. Or, in perl, try this: `my $dupes = 0; while (<FH1>) { chomp; my ($var1,$var2) = split(/:/,$_); $var2 = substr($var2,1); #remove 1st space results from substr # here is the important bit: if (exists $diffHash1{$var2}) { ++$dupes; print "Dupe on line $.: $var2\n"; } $diffHash1{$var2} = $var2; } print "$dupes dupes found in file1.\n";` [download] This will tell you about any dupes (not the original line, just the additional lines - we could add that, too, but I'll leave it to you if you decide you want to do that). And then, if you total the count in the file plus the dupes, you'll get what your original grep count is. That's not to say that the rest of your code is clean and doesn't require any stylistic changes, but we'll focus on the problem first, and worry about style later ;-)	[reply] [d/l] [select]
Re: Different counts between perl and grep by herda05 (Acolyte) on Aug 29, 2009 at 05:13 UTC
All, please ignore this. I revisited my intial assumptions (that all names in the original data were unique!) and discovered that perl, since it's so smart, is correct. Since hash keys automagically take care of duplicates (behavior that I'm relying on when analysis kicks in later!), they disappear in the hash but remain in the file. Please return to your regularly scheduled programming. Dan H.	[reply]
Re: Different counts between perl and grep by bv (Friar) on Aug 29, 2009 at 14:38 UTC
Since you already solved your problem, here's an alternate solution for getting the list out WITH duplicates. You are currently storing the same value as the key in each hash. Instead, store the count of how many times you saw that value, something like this: `my ($count,%lines); while(<>) { $count++; $lines{$_}++; } print scalar(keys %lines), " unique lines seen\n"; print "$count total lines\n"; while ( my ($l, $c) = each %lines ) { for (1 .. $c) { print "$l\n"; } }` [download] This may only be useful in other cases than yours, but I thought I'd throw it out there. `$,=' ';$\=',';$_=[qw,Just another Perl hacker,];print@$_;`	[reply] [d/l] [select]