JohnATmbd has asked for the wisdom of the Perl Monks concerning the following question:

Here's what I'm doing now:
my %hash_one = ('string_one' => 0, 'string_two' => 0, 'string_three' => 0, 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0) foreach $i(keys %hash_one){ $hash_one{$i} = `grep -c '$hash_one{$i}' file.txt`; }
What I want to do is get rid of the loop and grep the file for each string all at once and have the values dumped into the right part of the hash.
There are >50 things I'm looking for and the file is about 100Mb.
I understand that using fork() may be my solution, and I've looked into it but I'm afraid I'm just not getting it.
Any help would be greatly appreciated
Thanks
John
ps. I don't need to have it in a hash, I just need to be able to hook up the right string to the right count later.

Replies are listed 'Best First'.
Re: some forking help
by merlyn (Sage) on Dec 24, 2001 at 22:16 UTC
    What you need is to not invoke an external script to do what Perl does quickly. {grin}
    my %hash_one = ('string_one' => 0, 'string_two' => 0, 'string_three' => 0, 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/$_/, 0]; } # then walk the data, trying all the regexen: @ARGV = qw(file.txt); close ARGV; while (<>) { for (keys %hash_one) { $hash_one{$_}[1]++ if qr/$hash_one{$_}[0]/; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater

    -- Randal L. Schwartz, Perl hacker

      hmm.. I smell a chance to test my understanding. Creating and storing qr// expressions takes extra work, but beats this simpler form:

      my %hash_one = { 'string_one' => 0, 'string_two' => 0, 'string_three' => 0, }; @ARGV = qw(file.txt); close ARGV; while (<>) { for $key (keys %hash_one) { $hash_one{ $key }++ if (/$key/); } }

      because qr// lets perl precompile the regexp. That would pay off in cases like this, where we'll be looping through the same set of regexps over and over again, yes?

      With a 100Mb file and 50+ strings to search for, there could be some speed advantage to forking separate processes for each search string and letting them run in parallel. Especially, if the regexen are precompiled before forking.

      Of course, the sheer simplicity of merlyn's solution probably more than compensates for the overall savings in time through the use of parallelism, when you realize that the tricky task of gathering up the individual counts from each of the child processes is not as straightforward as it may at first glance appear.

      dmm

      You can give a man a fish and feed him for a day ...
      Or, you can
      teach him to fish and feed him for a lifetime
        The only way in which a fork()ing solution would be faster than the solutions posted so far, would be in a MP machine, where each process could scan the file separatedly. This, assuming that the file fits within the buffer cache.

        Otherwise, the price of the context switches will make this solution run slower.

        Just my $0.02 :)

        Merry Christmas to all the fellow monks!

      merlyn, I hate to critique code that was written on Christmas Eve, but this looks to have three separate bugs.

      There are two major issues in the while(<>) loop. First, $_ plays a dual role in the inner for loop, with the looping value clobbering the data from the file. Adding an inner loop var (i.e. for my $key) will avoid clobbering $_.

      The second bug involves the if qr/$hash_one{$_}[0]/ construct. This doesn't seem to be executing the regex, just compiling it (again??) and returning a true value. You can either drop the qr, leaving /$hash_one{$_}[0]/ or explicitly bind it with $_ =~ qr/$hash_one{$_}[0]/ or perhaps just $_ =~ $hash_one{$_}[0]

      The third issue is more subtle, but still a bug. You aren't quoting special chars when compiling regexes for literal strings... qr/$_/ really should be qr/\Q$_\E/

      With those three issues out of the way we have:

      #!/usr/bin/perl -wT use strict; my %hash_one = ('string_one' => 0, 'string_two' => 0, '[[[string_three' => 0, # test special chars behavio +r 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/\Q$_\E/, 0]; } # then walk the data, trying all the regexen: # Replaced with <DATA> - blakem # @ARGV = qw(file.txt); # close ARGV; while (<DATA>) { for my $key (keys %hash_one) { $hash_one{$key}[1]++ if $_ =~ $hash_one{$key}[0]; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater print "$_ => $hash_one{$_}\n" for keys %hash_one; __DATA__ 1 string_one string_two 2 string_two [[[string_three [[[string_three 3 [[[string_three string_four string_four string_four 4 string_four doesn'tmatchanything
      Which works correctly and outputs:
      string_four => 4 string_six => 0 string_five => 0 string_one => 1 string_seven => 0 [[[string_three => 3 string_two => 2
      Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-)

      -Blake

        Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-)
        I've often written code for replies here and on Usenet without testing, right in the reply buffer. And I take my licks when I guess wrong. Thank you for debugging my code.

        -- Randal L. Schwartz, Perl hacker

Re: some forking help
by JohnATmbd (Initiate) on Dec 25, 2001 at 02:52 UTC
    Ok I've tried different versions of this program
    #!/usr/bin/perl use strict; print get_time() ."\n"; my $count = 0; my $pr_regex= "program.jsp?id=1"; $pr_regex = qr/\Q$pr_regex\E/oi; #open(LOGFILE,"file.txt"); @ARGV = qw(file.txt); close ARGV; #while (<LOGFILE>) { while (<>) { $count ++ if m/$pr_regex/oi; } print qq|$count\n|; print "\n" .get_time() ."\n"; exit; sub get_time { my ($sec,$min,$hour,@junk) = localtime(time); $min = '0' . $min if ($min<10); $sec = '0' . $sec if ($sec<10); return qq|$hour:$min:$sec|; }

    and the output is :

    bash-2.03$ perl -w agrsel_mark3.cgi
    14:27:05
    203

    14:27:26

    so around 20 seconds to find one string. That's after a little tweaking to get it down from 26 seconds.
    Here's a version of my original (just looking for one string though):

    #!/usr/bin/perl use strict; print get_time() ."\n"; my $count = 0; my $pr_regex= "program.jsp?id=1"; $count = `grep -c '$pr_regex' file.txt`; print qq|$count\n|; print "\n" .get_time() ."\n"; exit; sub get_time { my ($sec,$min,$hour,@junk) = localtime(time); $min = '0' . $min if ($min<10); $sec = '0' . $sec if ($sec<10); return qq|$hour:$min:$sec|; }


    and the output:
    bash-2.03$ perl -w agrsel_mark4.cgi
    14:27:34
    203


    14:27:40

    about 6 seconds.
    actually running the full program it takes about 8 seconds a string over the first 68 strings, not quite 9 minutes.
    And the regex version takes about 26 minutes to run the first 68 strings.
    a little quick math tells me I'm looking at 2 hours versus 6 hours when I start really using the program.
    I've tried the reg_ex version a few different ways but the time doesn't get any better then 20 seconds.
    Any ideas on how to jump this up a little?
    Thanks again
    John

      Try a test version that looks for more than one string. You'll have to run grep 50 times to find 50 strings, while a regexp loop will search each line for all 50.

      The regexp loop should scale better for large numbers of regexps, too. Iterating a loop and searching for a pattern match are relatively fast, compared to reading information from the disk or spawning a subshell.

Re: some forking help
by JohnATmbd (Initiate) on Dec 25, 2001 at 00:23 UTC
    Thanks for your input, I thought that I'd be able to save some time with running over 50 (turns out to be 93)process against a 100 Mb file (just under a million lines).
    But you guys think the precompiled regex and a loop is a better solution so I'll go that way. 'dmmiller2k' likes how simple it is and so do I.
    now,... is it better to load the file into an array (@ARGV = qw(file.txt);) or to just go through it line by line
    (while FILENAME) {blah blah blah}?
    Thanks again
    John
      You'll eat up a lot of memory by reading the file into an array, and there'll be a hit for the initial read. Going line by line is a little gentler, though it's hard to say if the memory re-use will perform better. If you run out of swap space, you'll definitely have trouble. I nearly always use while.