Re: How to speed up multiple regex in a loop for a big data?
by Corion (Patriarch) on May 25, 2006 at 06:23 UTC
|
Your code contains a problem if you have "overlap".in your source or target variable names. You should create one large regular expression that willsearch and replace the names in one go to avoid circles/sequences of replacing names instead of looping over your replacements per line:
my $re = join '\bŠ\b', reverse keys %mapper;
while (<IN>){
s/\b($re)\b/$mapper{$1}/gei;
};
There are modules for conveniently building such regular expressions in a more optimal way, like Regex::PreSuf | [reply] [d/l] |
|
|
I thought \b \b pairs will do the trick since they will match word boundaries. FYI, my mapping file has no redundant entries. Please correct me if I am wrong
| [reply] |
|
|
%mapper = (
foo => 'zap',
zap => 'foo',
)
Here, foo will be replaced by "zap" in your loop and then again by "foo". But if that can't happen, all you'll gain with the large regex is speed ;) | [reply] [d/l] |
|
|
|
|
| [reply] |
Re: How to speed up multiple regex in a loop for a big data?
by salva (Canon) on May 25, 2006 at 08:19 UTC
|
you can try generating a subroutine on the fly to perform all the substitutions:
open(MAP, "<$new_name_map_file");
while (<MAP>) {
chomp;
tr/A-Z/a-z/;
@map_line = split (/\t/);
$mapper{$map_line[0]} = $map_line[1];
}
close(MAP);
my $sub = "sub { ";
for my $name (sort keys %mapper) {
my $qname = quotemeta $name;
my $qrepl = quotemeta $mapper{$name};
$sub .= "s{\b$qname\b}{$qrepl}g; ";
}
$sub .= "}";
$sub = eval $sub;
die if $@;
open(IN, "<input_file");
open(OUT, ">input_file.new");
while (<IN>) {
print "%";
tr/A-Z/a-z/;
$sub->();
print OUT "$_";
}
close(IN);
clse(OUT);
this way, the regular expresions are compiled just once, and also, the inner loop and the sort are removed from the while loop. | [reply] [d/l] |
|
|
Isn't the cost of calling a sub significantly higher than running the same code in the loop? It's always been my policy to eliminate subs were possible when absolute speed is required. (But I could be wrong).
| [reply] |
|
|
well, calling subs in Perl is not as expensive as people usually thing...
use Benchmark 'cmpthese';
my $a = 'foo bar doz' x 100;
$a .= ' hello '.$a;
my $sub = sub {
/\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/;
/\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/;
};
cmpthese(-3, { loop => sub {
for (($a) x 10) {
for my $i (1..8) {
/\bhello\b/;
}
}
},
sub => sub {
for (($a) x 10) {
$sub->()
}
},
inline => sub {
for (($a) x 10) {
/\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/;
/\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/;
}
}
outputs...
Rate loop sub inline
loop 4157/s -- -5% -16%
sub 4363/s 5% -- -12%
inline 4943/s 19% 13% --
and anyway, it's easy to modify my code to remove the subroutine call from the loop just moving the loop inside the sub:
open(MAP, "<$new_name_map_file");
while (<MAP>) {
chomp;
tr/A-Z/a-z/;
@map_line = split (/\t/);
$mapper{$map_line[0]} = $map_line[1];
}
close(MAP);
my $sub = <<'EOS';
sub {
while (<IN>) {
print "%";
tr/A-Z/a-z/;
EOS
for my $name (sort keys %mapper) {
my $qname = quotemeta $name;
my $qrepl = quotemeta $mapper{$name};
$sub .= "s{\b$qname\b}{$qrepl}g;\n";
}
$sub .= <<'EOS'
print OUT $_;
}
}
EOS
$sub = eval $sub;
die if $@;
open(IN, "<input_file");
open(OUT, ">input_file.new");
$sub->();
close(IN);
clse(OUT);
| [reply] [d/l] [select] |
|
|
|
|
|
|
Re: How to speed up multiple regex in a loop for a big data?
by Samy_rio (Vicar) on May 25, 2006 at 06:17 UTC
|
Hi, If I understood your question correctly, try this & see the Benchmark,
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark 'cmpthese';
cmpthese(-1, {
method1 => '&yours',
method2 => '&new_one',
});
sub yours{
my %mapper;
open(MAP, "<map.ini");
while (<MAP>) {
chomp;
tr/A-Z/a-z/;
my @map_line = split (/\t/);
$mapper{$map_line[0]} = $map_line[1];
}
close(MAP);
open(IN, "<input1.txt");
open(OUT, ">input_new.txt");
while (<IN>) {
#print "%";
tr/A-Z/a-z/;
foreach my $key (sort keys %mapper) {
s/\b$key\b/$mapper{$key}/g;
}
print OUT "$_";
}
close(IN);
close(OUT);
}
sub new_one{
open(MAP, "map.ini");
my $map = do{local $/;<MAP>};
close(MAP);
my %mapper;
$map = lc($map);
%mapper = map{split/\t/,$_} split(/\n/, $map);
open(IN, "input1.txt");
my $input = do{local $/;<IN>};
close(IN);
open(OUT, ">input_new.txt");
$input = lc($input);
foreach my $key (sort keys %mapper) {
$input =~ s/\b$key\b/$mapper{$key}/g;
}
print OUT $input;
close(OUT);
}
__END__
Rate method1 method2
method1 908/s -- -13%
method2 1050/s 16% --
I have check with less number of lines file only. You can check with your input.
Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';
| [reply] [d/l] [select] |
Re: How to speed up multiple regex in a loop for a big data?
by cdarke (Prior) on May 25, 2006 at 07:03 UTC
|
One thing that leaps out at me is that you are sorting the keys of %mapper each time you read a record from input_file. Not sure how big a key is, but it might be worth sorting the keys just once before you read input_file, since you don't appear to alter the hash, this should be safe (although I don't know how big a key is). e.g.
my @map_keys = sort keys %mapper;
...
foreach $key (@map_keys) {
...
| [reply] |
|
|
Come to think of it, why are you sorting the keys?
| [reply] |
|
|
Hey, you're right!
I was about to example a case for partial keys getting substituted, but then I realized the keys are wrapped in word boundries, so there's no chance of partial key substitution, and no reason to sort.
Drop the sort and save some time.
| [reply] |
|
|
You are right on unnecessary sorting. I copied over that part from somewhere and wasn't aware of what I was doing...
| [reply] |
Re: How to speed up multiple regex in a loop for a big data?
by wazzuteke (Hermit) on May 25, 2006 at 17:22 UTC
|
I don't think I saw this in any of the comments, but some ideas may be:
- study each line. Take a look at the perldoc about that one; it may or may not help depending on the number of patterns, the pattern, the line, etc...
- The '//s' modifier can sometimes help; treating each line as a single line (or the entire file???)
- Try pre-compiling or inline compiling the expressions:
- my $regex = qr/<--some regex-->/; $line =~ $regex; if they are constant-ish
- Or use the '//o' modifier to compile the regex in a loop. (this one may not help you as much)
Just some more ideas. They may or may not work, though are nice goodies for the future if not.
print map{chr}(45,45,104,97,124,124,116,97,45,45);
... and I probably posted this while I was at work => whitepages.com | inc.
| [reply] [d/l] [select] |