some forking help

JohnATmbd has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: some forking help by merlyn (Sage) on Dec 24, 2001 at 22:16 UTC
What you need is to not invoke an external script to do what Perl does quickly. {grin} my %hash_one = ('string_one' => 0, 'string_two' => 0, 'string_three' => 0, 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/$_/, 0]; } # then walk the data, trying all the regexen: @ARGV = qw(file.txt); close ARGV; while (<>) { for (keys %hash_one) { $hash_one{$_}[1]++ if qr/$hash_one{$_}[0]/; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: some forking help by mstone (Deacon) on Dec 24, 2001 at 23:21 UTC
hmm.. I smell a chance to test my understanding. Creating and storing qr// expressions takes extra work, but beats this simpler form: `my %hash_one = { 'string_one' => 0, 'string_two' => 0, 'string_three' => 0, }; @ARGV = qw(file.txt); close ARGV; while (<>) { for $key (keys %hash_one) { $hash_one{ $key }++ if (/$key/); } }` [download] because qr// lets perl precompile the regexp. That would pay off in cases like this, where we'll be looping through the same set of regexps over and over again, yes?	[reply] [d/l]
Re(2): some forking help by dmmiller2k (Chaplain) on Dec 24, 2001 at 23:50 UTC
With a 100Mb file and 50+ strings to search for, there could be some speed advantage to forking separate processes for each search string and letting them run in parallel. Especially, if the regexen are precompiled before forking. Of course, the sheer simplicity of merlyn's solution probably more than compensates for the overall savings in time through the use of parallelism, when you realize that the tricky task of gathering up the individual counts from each of the child processes is not as straightforward as it may at first glance appear. dmm You can give a man a fish and feed him for a day ... Or, you can teach him to fish and feed him for a lifetime	[reply]
Re: Re(2): some forking help by fokat (Deacon) on Dec 25, 2001 at 00:32 UTC
The only way in which a `fork()`ing solution would be faster than the solutions posted so far, would be in a MP machine, where each process could scan the file separatedly. This, assuming that the file fits within the buffer cache. Otherwise, the price of the context switches will make this solution run slower. Just my $0.02 :) Merry Christmas to all the fellow monks!	[reply]
Re(4): some forking help by dmmiller2k (Chaplain) on Dec 25, 2001 at 01:13 UTC
Re: Re: some forking help by blakem (Monsignor) on Jan 15, 2002 at 17:39 UTC
merlyn, I hate to critique code that was written on Christmas Eve, but this looks to have three separate bugs. There are two major issues in the `while(<>)` loop. First, $_ plays a dual role in the inner for loop, with the looping value clobbering the data from the file. Adding an inner loop var (i.e. `for my $key`) will avoid clobbering $_. The second bug involves the `if qr/$hash_one{$_}[0]/` construct. This doesn't seem to be executing the regex, just compiling it (again??) and returning a true value. You can either drop the qr, leaving `/$hash_one{$_}[0]/` or explicitly bind it with `$_ =~ qr/$hash_one{$_}[0]/` or perhaps just `$_ =~ $hash_one{$_}[0]` The third issue is more subtle, but still a bug. You aren't quoting special chars when compiling regexes for literal strings... `qr/$_/` really should be `qr/\Q$_\E/` With those three issues out of the way we have: #!/usr/bin/perl -wT use strict; my %hash_one = ('string_one' => 0, 'string_two' => 0, '[[[string_three' => 0, # test special chars behavio +r 'string_four' => 0, 'string_five' => 0, 'string_six' => 0, 'string_seven' => 0); # first, create an array ref, element 0 is a qr// of the key, and elem +ent 1 is the count: for (keys %hash_one) { $hash_one{$_} = [qr/\Q$_\E/, 0]; } # then walk the data, trying all the regexen: # Replaced with <DATA> - blakem # @ARGV = qw(file.txt); # close ARGV; while (<DATA>) { for my $key (keys %hash_one) { $hash_one{$key}[1]++ if $_ =~ $hash_one{$key}[0]; } } # finally, replace the arrayref with just the count: $_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater print "$_ => $hash_one{$_}\n" for keys %hash_one; __DATA__ 1 string_one string_two 2 string_two [[[string_three [[[string_three 3 [[[string_three string_four string_four string_four 4 string_four doesn'tmatchanything [download] Which works correctly and outputs: `string_four => 4 string_six => 0 string_five => 0 string_one => 1 string_seven => 0 [[[string_three => 3 string_two => 2` [download] Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-) -Blake	[reply] [d/l] [select]
Re: Re: Re: some forking help by merlyn (Sage) on Jan 15, 2002 at 17:51 UTC
Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-) I've often written code for replies here and on Usenet without testing, right in the reply buffer. And I take my licks when I guess wrong. Thank you for debugging my code. -- Randal L. Schwartz, Perl hacker	[reply]
Re: some forking help by JohnATmbd (Initiate) on Dec 25, 2001 at 02:52 UTC
Ok I've tried different versions of this program `#!/usr/bin/perl use strict; print get_time() ."\n"; my $count = 0; my $pr_regex= "program.jsp?id=1"; $pr_regex = qr/\Q$pr_regex\E/oi; #open(LOGFILE,"file.txt"); @ARGV = qw(file.txt); close ARGV; #while (<LOGFILE>) { while (<>) { $count ++ if m/$pr_regex/oi; } print qq\|$count\n\|; print "\n" .get_time() ."\n"; exit; sub get_time { my ($sec,$min,$hour,@junk) = localtime(time); $min = '0' . $min if ($min<10); $sec = '0' . $sec if ($sec<10); return qq\|$hour:$min:$sec\|; }` [download] and the output is : bash-2.03$ perl -w agrsel_mark3.cgi 14:27:05 203 14:27:26 so around 20 seconds to find one string. That's after a little tweaking to get it down from 26 seconds. Here's a version of my original (just looking for one string though): #!/usr/bin/perl use strict; print get_time() ."\n"; my $count = 0; my $pr_regex= "program.jsp?id=1"; $count = `grep -c '$pr_regex' file.txt`; print qq\|$count\n\|; print "\n" .get_time() ."\n"; exit; sub get_time { my ($sec,$min,$hour,@junk) = localtime(time); $min = '0' . $min if ($min<10); $sec = '0' . $sec if ($sec<10); return qq\|$hour:$min:$sec\|; } [download] and the output: bash-2.03$ perl -w agrsel_mark4.cgi 14:27:34 203 14:27:40 about 6 seconds. actually running the full program it takes about 8 seconds a string over the first 68 strings, not quite 9 minutes. And the regex version takes about 26 minutes to run the first 68 strings. a little quick math tells me I'm looking at 2 hours versus 6 hours when I start really using the program. I've tried the reg_ex version a few different ways but the time doesn't get any better then 20 seconds. Any ideas on how to jump this up a little? Thanks again John	[reply] [d/l] [select]
Re: Re: some forking help by mstone (Deacon) on Dec 25, 2001 at 21:38 UTC
Try a test version that looks for more than one string. You'll have to run grep 50 times to find 50 strings, while a regexp loop will search each line for all 50. The regexp loop should scale better for large numbers of regexps, too. Iterating a loop and searching for a pattern match are relatively fast, compared to reading information from the disk or spawning a subshell.	[reply]
Re: some forking help by JohnATmbd (Initiate) on Dec 25, 2001 at 00:23 UTC
Thanks for your input, I thought that I'd be able to save some time with running over 50 (turns out to be 93)process against a 100 Mb file (just under a million lines). But you guys think the precompiled regex and a loop is a better solution so I'll go that way. 'dmmiller2k' likes how simple it is and so do I. now,... is it better to load the file into an array (@ARGV = qw(file.txt);) or to just go through it line by line (while FILENAME) {blah blah blah}? Thanks again John	[reply]
Re: Re: some forking help by chromatic (Archbishop) on Dec 25, 2001 at 01:42 UTC
You'll eat up a lot of memory by reading the file into an array, and there'll be a hit for the initial read. Going line by line is a little gentler, though it's hard to say if the memory re-use will perform better. If you run out of swap space, you'll definitely have trouble. I nearly always use while.	[reply]