halfcountplus has asked for the wisdom of the Perl Monks concerning the following question:

I want to compare the number of times a string occurs in a set of files. I read each file into a single line using:
my $all = do {local $/; <IN>};
Then i got the number of instances using McDarren's code from Counting in regular expressions and did this:
#!/usr/bin/perl -w use strict; my $bit="ace"; my (%hash); while (<DATA>) { if ($_ =~ /$bit/) { my @ray=($_ =~ /$bit/g); my $c=scalar @ray; $hash{$_}=$c; } } my @rank=sort{$hash{$b}<=>$hash{$a}}keys%hash; print @rank; __DATA__ In the wire game, a "mob" composed of dozens of grifters simulates a "wire store", i.e., a place where results from horse races are received by telegram and posted on a large board, while also being read aloud by an announcer. The griftee is given secret foreknowledge of the race results minutes before the race is broadcast, and is therefore able to place a sure bet at the wire store. In reality, of course, the con artists who set up the wire store are the providers of the inside information, and the mark eventually is led to place a large bet, thinking it to be a sure win. At this point, some mistake is made, which actually makes the bet a loss.
Is there a less troublesome way to do this? Even better, is there a way to correspond a hash key to an array value without "whiling" thru -- so i could invoke the number of occurances (from %hash) to the sorted array value (which would equal the name of a hash key)?

Replies are listed 'Best First'.
Re: ranking number of occurances
by ikegami (Patriarch) on Mar 16, 2008 at 22:37 UTC
    while (<DATA>) { my $c = () = /\Q$bit\E/g; $hash{$_} = $c if $c; } my @rank = sort{ $hash{$b} <=> $hash{$a} } keys %hash; my $to_print = 10; $to_print = @rank if @rank < 10; print "$_: $hash{$_}\n" for @rank[0..$to_print-1];

    Avoided intermediary array. Converted $bit from plain text to a regexp using \Q..\E. Produced better output.

    If there are two (or more) identical lines, they will only appear once in the hash, but that doesn't look like a problem.

    Update: I just noticed there's a question at the bottom. But I don't understand what you're asking anyway.

Re: ranking number of occurances
by FunkyMonk (Bishop) on Mar 16, 2008 at 22:45 UTC
    If you really want to
    compare the number of times a string occurs in a set of files
    as opposed to lines in which the pattern occurred, you'll need to add to the hash's value:
    $hash{$_} += $c;

    for your code, or

    $hash{$_} += $c if $c;

    for ikegami's

      I don't think so. Your code says that line/file "a48754a4397543a43753a" contains "a" 8 times. Even if $_ represents a file instead of a line, no change is needed.

      If he wants the total, then he'd need

      $hash{$_} = $c; # Per line/file count $total += $c; # Total count
Re: ranking number of occurances
by grizzley (Chaplain) on Mar 17, 2008 at 07:51 UTC

    I would do this in such a way:

    #!/usr/bin/perl -w use strict; my $bit="ace"; my (%hash); while (<DATA>) { while($_ =~ /$bit/g) { $hash{$_}++; } }

    And then you can put it into an array, where on position i is a reference to array of lines having i occurences of $bit:

    my $bit="ace"; my (%hash, @rank); while (<DATA>) { while($_ =~ /$bit/g) { $hash{$_}++; } push @{$rank[$hash{$_}]}, $_; } # see how the structure looks like # use Data::Dumper; # print Dumper \@rank; for(@rank) { for(@$_) { print; } }