How to speed up multiple regex in a loop for a big data?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to speed up multiple regex in a loop for a big data? by Corion (Patriarch) on May 25, 2006 at 06:23 UTC
Your code contains a problem if you have "overlap".in your source or target variable names. You should create one large regular expression that willsearch and replace the names in one go to avoid circles/sequences of replacing names instead of looping over your replacements per line: `my $re = join '\b¦\b', reverse keys %mapper; while (<IN>){ s/\b($re)\b/$mapper{$1}/gei; };` [download] There are modules for conveniently building such regular expressions in a more optimal way, like Regex::PreSuf	[reply] [d/l]
Re^2: How to speed up multiple regex in a loop for a big data? by MonkInPleasanton (Initiate) on May 25, 2006 at 06:31 UTC
I thought \b \b pairs will do the trick since they will match word boundaries. FYI, my mapping file has no redundant entries. Please correct me if I am wrong	[reply]
Re^3: How to speed up multiple regex in a loop for a big data? by Corion (Patriarch) on May 25, 2006 at 09:51 UTC
You're correct with the \b word boundaries. The problem case I am thinking of is the following renaming setup: `%mapper = ( foo => 'zap', zap => 'foo', )` [download] Here, foo will be replaced by "zap" in your loop and then again by "foo". But if that can't happen, all you'll gain with the large regex is speed ;)	[reply] [d/l]
Re^4: How to speed up multiple regex in a loop for a big data? by MonkInPleasanton (Initiate) on May 25, 2006 at 14:28 UTC
Re^2: How to speed up multiple regex in a loop for a big data? by planetscape (Chancellor) on May 26, 2006 at 03:29 UTC
Or grinder's Regexp::Assemble. :-) HTH, planetscape	[reply]
Re: How to speed up multiple regex in a loop for a big data? by salva (Canon) on May 25, 2006 at 08:19 UTC
you can try generating a subroutine on the fly to perform all the substitutions: `open(MAP, "<$new_name_map_file"); while (<MAP>) { chomp; tr/A-Z/a-z/; @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); my $sub = "sub { "; for my $name (sort keys %mapper) { my $qname = quotemeta $name; my $qrepl = quotemeta $mapper{$name}; $sub .= "s{\b$qname\b}{$qrepl}g; "; } $sub .= "}"; $sub = eval $sub; die if $@; open(IN, "<input_file"); open(OUT, ">input_file.new"); while (<IN>) { print "%"; tr/A-Z/a-z/; $sub->(); print OUT "$_"; } close(IN); clse(OUT);` [download] this way, the regular expresions are compiled just once, and also, the inner loop and the sort are removed from the while loop.	[reply] [d/l]
Re^2: How to speed up multiple regex in a loop for a big data? by ruzam (Curate) on May 25, 2006 at 15:55 UTC
Isn't the cost of calling a sub significantly higher than running the same code in the loop? It's always been my policy to eliminate subs were possible when absolute speed is required. (But I could be wrong).	[reply]
Re^3: How to speed up multiple regex in a loop for a big data? by salva (Canon) on May 25, 2006 at 16:36 UTC
well, calling subs in Perl is not as expensive as people usually thing... `use Benchmark 'cmpthese'; my $a = 'foo bar doz' x 100; $a .= ' hello '.$a; my $sub = sub { /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; }; cmpthese(-3, { loop => sub { for (($a) x 10) { for my $i (1..8) { /\bhello\b/; } } }, sub => sub { for (($a) x 10) { $sub->() } }, inline => sub { for (($a) x 10) { /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; } }` [download] outputs... `Rate loop sub inline loop 4157/s -- -5% -16% sub 4363/s 5% -- -12% inline 4943/s 19% 13% --` [download] and anyway, it's easy to modify my code to remove the subroutine call from the loop just moving the loop inside the sub: open(MAP, "<$new_name_map_file"); while (<MAP>) { chomp; tr/A-Z/a-z/; @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); my $sub = <<'EOS'; sub { while (<IN>) { print "%"; tr/A-Z/a-z/; EOS for my $name (sort keys %mapper) { my $qname = quotemeta $name; my $qrepl = quotemeta $mapper{$name}; $sub .= "s{\b$qname\b}{$qrepl}g;\n"; } $sub .= <<'EOS' print OUT $_; } } EOS $sub = eval $sub; die if $@; open(IN, "<input_file"); open(OUT, ">input_file.new"); $sub->(); close(IN); clse(OUT); [download]	[reply] [d/l] [select]
Re^4: How to speed up multiple regex in a loop for a big data? by MonkInPleasanton (Initiate) on May 25, 2006 at 16:45 UTC
Re^5: How to speed up multiple regex in a loop for a big data? by salva (Canon) on May 25, 2006 at 16:58 UTC
Re^4: How to speed up multiple regex in a loop for a big data? by ruzam (Curate) on May 25, 2006 at 17:10 UTC
Re: How to speed up multiple regex in a loop for a big data? by Samy_rio (Vicar) on May 25, 2006 at 06:17 UTC
Hi, If I understood your question correctly, try this & see the Benchmark, #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; cmpthese(-1, { method1 => '&yours', method2 => '&new_one', }); sub yours{ my %mapper; open(MAP, "<map.ini"); while (<MAP>) { chomp; tr/A-Z/a-z/; my @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); open(IN, "<input1.txt"); open(OUT, ">input_new.txt"); while (<IN>) { #print "%"; tr/A-Z/a-z/; foreach my $key (sort keys %mapper) { s/\b$key\b/$mapper{$key}/g; } print OUT "$_"; } close(IN); close(OUT); } sub new_one{ open(MAP, "map.ini"); my $map = do{local $/;<MAP>}; close(MAP); my %mapper; $map = lc($map); %mapper = map{split/\t/,$_} split(/\n/, $map); open(IN, "input1.txt"); my $input = do{local $/;<IN>}; close(IN); open(OUT, ">input_new.txt"); $input = lc($input); foreach my $key (sort keys %mapper) { $input =~ s/\b$key\b/$mapper{$key}/g; } print OUT $input; close(OUT); } __END__ Rate method1 method2 method1 908/s -- -13% method2 1050/s 16% -- [download] I have check with less number of lines file only. You can check with your input. Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re: How to speed up multiple regex in a loop for a big data? by cdarke (Prior) on May 25, 2006 at 07:03 UTC
One thing that leaps out at me is that you are sorting the keys of %mapper each time you read a record from input_file. Not sure how big a key is, but it might be worth sorting the keys just once before you read input_file, since you don't appear to alter the hash, this should be safe (although I don't know how big a key is). e.g. my @map_keys = sort keys %mapper; ... foreach $key (@map_keys) { ...	[reply]
Re^2: How to speed up multiple regex in a loop for a big data? by cdarke (Prior) on May 25, 2006 at 07:06 UTC
Come to think of it, why are you sorting the keys?	[reply]
Re^3: How to speed up multiple regex in a loop for a big data? by ruzam (Curate) on May 25, 2006 at 15:51 UTC
Hey, you're right! I was about to example a case for partial keys getting substituted, but then I realized the keys are wrapped in word boundries, so there's no chance of partial key substitution, and no reason to sort. Drop the sort and save some time.	[reply]
Re^3: How to speed up multiple regex in a loop for a big data? by MonkInPleasanton (Initiate) on May 25, 2006 at 16:06 UTC
You are right on unnecessary sorting. I copied over that part from somewhere and wasn't aware of what I was doing...	[reply]
Re: How to speed up multiple regex in a loop for a big data? by wazzuteke (Hermit) on May 25, 2006 at 17:22 UTC
I don't think I saw this in any of the comments, but some ideas may be: study each line. Take a look at the perldoc about that one; it may or may not help depending on the number of patterns, the pattern, the line, etc... The '//s' modifier can sometimes help; treating each line as a single line (or the entire file???) Try pre-compiling or inline compiling the expressions: `my $regex = qr/<--some regex-->/; $line =~ $regex;` if they are constant-ish Or use the '//o' modifier to compile the regex in a loop. (this one may not help you as much) Just some more ideas. They may or may not work, though are nice goodies for the future if not. `print map{chr}(45,45,104,97,124,124,116,97,45,45);` ... and I probably posted this while I was at work => whitepages.com \| inc.	[reply] [d/l] [select]