in reply to some forking help

What you need is to not invoke an external script to do what Perl does quickly. {grin}
my %hash_one = ('string_one' => 0, 'string_two' => 0, 'string_three' => 0, 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/$_/, 0]; } # then walk the data, trying all the regexen: @ARGV = qw(file.txt); close ARGV; while (<>) { for (keys %hash_one) { $hash_one{$_}[1]++ if qr/$hash_one{$_}[0]/; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater

-- Randal L. Schwartz, Perl hacker

Replies are listed 'Best First'.
Re: Re: some forking help
by mstone (Deacon) on Dec 24, 2001 at 23:21 UTC

    hmm.. I smell a chance to test my understanding. Creating and storing qr// expressions takes extra work, but beats this simpler form:

    my %hash_one = { 'string_one' => 0, 'string_two' => 0, 'string_three' => 0, }; @ARGV = qw(file.txt); close ARGV; while (<>) { for $key (keys %hash_one) { $hash_one{ $key }++ if (/$key/); } }

    because qr// lets perl precompile the regexp. That would pay off in cases like this, where we'll be looping through the same set of regexps over and over again, yes?

Re(2): some forking help
by dmmiller2k (Chaplain) on Dec 24, 2001 at 23:50 UTC

    With a 100Mb file and 50+ strings to search for, there could be some speed advantage to forking separate processes for each search string and letting them run in parallel. Especially, if the regexen are precompiled before forking.

    Of course, the sheer simplicity of merlyn's solution probably more than compensates for the overall savings in time through the use of parallelism, when you realize that the tricky task of gathering up the individual counts from each of the child processes is not as straightforward as it may at first glance appear.

    dmm

    You can give a man a fish and feed him for a day ...
    Or, you can
    teach him to fish and feed him for a lifetime
      The only way in which a fork()ing solution would be faster than the solutions posted so far, would be in a MP machine, where each process could scan the file separatedly. This, assuming that the file fits within the buffer cache.

      Otherwise, the price of the context switches will make this solution run slower.

      Just my $0.02 :)

      Merry Christmas to all the fellow monks!

        I'm not sure. My gut feeling is that searching a file is fairly I/O bound, and therefore would involve a significant amount of waiting for the disk regardless; why not capitalize on that by waiting in parallel?

        dmm

        You can give a man a fish and feed him for a day ...
        Or, you can
        teach him to fish and feed him for a lifetime
Re: Re: some forking help
by blakem (Monsignor) on Jan 15, 2002 at 17:39 UTC
    merlyn, I hate to critique code that was written on Christmas Eve, but this looks to have three separate bugs.

    There are two major issues in the while(<>) loop. First, $_ plays a dual role in the inner for loop, with the looping value clobbering the data from the file. Adding an inner loop var (i.e. for my $key) will avoid clobbering $_.

    The second bug involves the if qr/$hash_one{$_}[0]/ construct. This doesn't seem to be executing the regex, just compiling it (again??) and returning a true value. You can either drop the qr, leaving /$hash_one{$_}[0]/ or explicitly bind it with $_ =~ qr/$hash_one{$_}[0]/ or perhaps just $_ =~ $hash_one{$_}[0]

    The third issue is more subtle, but still a bug. You aren't quoting special chars when compiling regexes for literal strings... qr/$_/ really should be qr/\Q$_\E/

    With those three issues out of the way we have:

    #!/usr/bin/perl -wT use strict; my %hash_one = ('string_one' => 0, 'string_two' => 0, '[[[string_three' => 0, # test special chars behavio +r 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/\Q$_\E/, 0]; } # then walk the data, trying all the regexen: # Replaced with <DATA> - blakem # @ARGV = qw(file.txt); # close ARGV; while (<DATA>) { for my $key (keys %hash_one) { $hash_one{$key}[1]++ if $_ =~ $hash_one{$key}[0]; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater print "$_ => $hash_one{$_}\n" for keys %hash_one; __DATA__ 1 string_one string_two 2 string_two [[[string_three [[[string_three 3 [[[string_three string_four string_four string_four 4 string_four doesn'tmatchanything
    Which works correctly and outputs:
    string_four => 4 string_six => 0 string_five => 0 string_one => 1 string_seven => 0 [[[string_three => 3 string_two => 2
    Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-)

    -Blake

      Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-)
      I've often written code for replies here and on Usenet without testing, right in the reply buffer. And I take my licks when I guess wrong. Thank you for debugging my code.

      -- Randal L. Schwartz, Perl hacker