Re: some forking help
by merlyn (Sage) on Dec 24, 2001 at 22:16 UTC
|
What you need is to not invoke an external script to do what Perl does quickly. {grin}
my %hash_one = ('string_one' => 0,
'string_two' => 0,
'string_three' => 0,
'string_four' => 0,
'string_five' => 0,
'string_six' => 0,
'string_seven' => 0);
# first, create an array ref, element 0 is a qr// of the key, and elem
+ent 1 is the count:
for (keys %hash_one) {
$hash_one{$_} = [qr/$_/, 0];
}
# then walk the data, trying all the regexen:
@ARGV = qw(file.txt);
close ARGV;
while (<>) {
for (keys %hash_one) {
$hash_one{$_}[1]++ if qr/$hash_one{$_}[0]/;
}
}
# finally, replace the arrayref with just the count:
$_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater
-- Randal L. Schwartz, Perl hacker | [reply] [d/l] |
|
|
my %hash_one = {
'string_one' => 0,
'string_two' => 0,
'string_three' => 0,
};
@ARGV = qw(file.txt);
close ARGV;
while (<>) {
for $key (keys %hash_one) {
$hash_one{ $key }++ if (/$key/);
}
}
because qr// lets perl precompile the regexp. That would pay off in cases like this, where we'll be looping through the same set of regexps over and over again, yes?
| [reply] [d/l] |
|
|
With a 100Mb file and 50+ strings to search for, there could be some speed advantage to forking separate processes for each search string and letting them run in parallel. Especially, if the regexen are precompiled before forking.
Of course, the sheer simplicity of merlyn's solution probably more than compensates for the overall savings in time through the use of parallelism, when you realize that the tricky task of gathering up the individual counts from each of the child processes is not as straightforward as it may at first glance appear.
dmm
You can give a man a fish and feed him for a day ...
Or, you can teach him to fish and feed him for a lifetime
| [reply] |
|
|
The only way in which a fork()ing solution would be faster than the solutions posted so far, would be in a MP machine, where each process could scan the file separatedly. This, assuming that the file fits within the buffer cache.
Otherwise, the price of the context switches will make this solution run slower.
Just my $0.02 :)
Merry Christmas to all the fellow monks!
| [reply] |
|
|
|
|
merlyn, I hate to critique code that was written on Christmas Eve, but this looks to have three separate bugs.
There are two major issues in the while(<>) loop. First, $_ plays a dual role in the inner for loop, with the looping value clobbering the data from the file. Adding an inner loop var (i.e. for my $key) will avoid clobbering $_.
The second bug involves the if qr/$hash_one{$_}[0]/ construct. This doesn't seem to be executing the regex, just compiling it (again??) and returning a true value. You can either drop the qr, leaving /$hash_one{$_}[0]/ or explicitly bind it with $_ =~ qr/$hash_one{$_}[0]/ or perhaps just $_ =~ $hash_one{$_}[0]
The third issue is more subtle, but still a bug. You aren't quoting special chars when compiling regexes for literal strings... qr/$_/ really should be qr/\Q$_\E/
With those three issues out of the way we have:
#!/usr/bin/perl -wT
use strict;
my %hash_one = ('string_one' => 0,
'string_two' => 0,
'[[[string_three' => 0, # test special chars behavio
+r
'string_four' => 0,
'string_five' => 0,
'string_six' => 0,
'string_seven' => 0);
# first, create an array ref, element 0 is a qr// of the key, and elem
+ent 1 is the count:
for (keys %hash_one) {
$hash_one{$_} = [qr/\Q$_\E/, 0];
}
# then walk the data, trying all the regexen:
# Replaced with <DATA> - blakem
# @ARGV = qw(file.txt);
# close ARGV;
while (<DATA>) {
for my $key (keys %hash_one) {
$hash_one{$key}[1]++ if $_ =~ $hash_one{$key}[0];
}
}
# finally, replace the arrayref with just the count:
$_ = $_->[1] for values %hash_one; # works in perl 5.5 and greater
print "$_ => $hash_one{$_}\n" for keys %hash_one;
__DATA__
1 string_one
string_two
2 string_two
[[[string_three
[[[string_three
3 [[[string_three
string_four
string_four
string_four
4 string_four
doesn'tmatchanything
Which works correctly and outputs:
string_four => 4
string_six => 0
string_five => 0
string_one => 1
string_seven => 0
[[[string_three => 3
string_two => 2
Those bugs make me think you coded that whole thing right here in the pm form box w/o running it through any sample data.... in a perverse sort of way, thats more impressive than if it had been totally clean the first time out. ;-)
-Blake
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: some forking help
by JohnATmbd (Initiate) on Dec 25, 2001 at 02:52 UTC
|
Ok I've tried different versions of this program
#!/usr/bin/perl
use strict;
print get_time() ."\n";
my $count = 0;
my $pr_regex= "program.jsp?id=1";
$pr_regex = qr/\Q$pr_regex\E/oi;
#open(LOGFILE,"file.txt");
@ARGV = qw(file.txt);
close ARGV;
#while (<LOGFILE>) {
while (<>) {
$count ++ if m/$pr_regex/oi;
}
print qq|$count\n|;
print "\n" .get_time() ."\n";
exit;
sub get_time {
my ($sec,$min,$hour,@junk) = localtime(time);
$min = '0' . $min if ($min<10);
$sec = '0' . $sec if ($sec<10);
return qq|$hour:$min:$sec|;
}
and the output is :
bash-2.03$ perl -w agrsel_mark3.cgi
14:27:05
203
14:27:26
so around 20 seconds to find one string. That's after a little tweaking to get it down from 26 seconds.
Here's a version of my original (just looking for one string though):
#!/usr/bin/perl
use strict;
print get_time() ."\n";
my $count = 0;
my $pr_regex= "program.jsp?id=1";
$count = `grep -c '$pr_regex' file.txt`;
print qq|$count\n|;
print "\n" .get_time() ."\n";
exit;
sub get_time {
my ($sec,$min,$hour,@junk) = localtime(time);
$min = '0' . $min if ($min<10);
$sec = '0' . $sec if ($sec<10);
return qq|$hour:$min:$sec|;
}
and the output:
bash-2.03$ perl -w agrsel_mark4.cgi
14:27:34
203
14:27:40
about 6 seconds.
actually running the full program it takes about 8 seconds a string over the first 68 strings, not quite 9 minutes.
And the regex version takes about 26 minutes to run the first 68 strings.
a little quick math tells me I'm looking at 2 hours versus 6 hours when I start really using the program.
I've tried the reg_ex version a few different ways but the time doesn't get any better then 20 seconds.
Any ideas on how to jump this up a little?
Thanks again
John
| [reply] [d/l] [select] |
|
|
Try a test version that looks for more than one string. You'll have to run grep 50
times to find 50 strings, while a regexp loop will search each line for all 50.
The regexp loop should scale better for large numbers of regexps, too. Iterating a loop and searching for a pattern match are relatively fast, compared to reading information from the disk or spawning a subshell.
| [reply] |
Re: some forking help
by JohnATmbd (Initiate) on Dec 25, 2001 at 00:23 UTC
|
Thanks for your input, I thought that I'd be able to save some time with running over 50 (turns out to be 93)process against a 100 Mb file (just under a million lines).
But you guys think the precompiled regex and a loop is a better solution so I'll go that way. 'dmmiller2k' likes how simple it is and so do I.
now,... is it better to load the file into an array (@ARGV = qw(file.txt);) or to just go through it line by line (while FILENAME) {blah blah blah}?
Thanks again
John | [reply] |
|
|
You'll eat up a lot of memory by reading the file into an array, and there'll be a hit for the initial read. Going line by line is a little gentler, though it's hard to say if the memory re-use will perform better. If you run out of swap space, you'll definitely have trouble. I nearly always use while.
| [reply] |