Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hellow fellow monks, I wrote a code to update names of variables in another code to follow a new naming rules. It does:
read in a mapping table read in an input file; replace some variable names with new ones (regex); write it to an input file;
The code looks like:
open(MAP, "<$new_name_map_file"); while (<MAP>) { chomp; tr/A-Z/a-z/; @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); open(IN, "<input_file"); open(OUT, ">input_file.new"); while (<IN>) { print "%"; tr/A-Z/a-z/; foreach $key (sort keys %mapper) { s/\b$key\b/$mapper{$key}/g; } print OUT "$_"; } close(IN); clse(OUT);
I have ~15000 lines in the input file and ~8000 lines(entries) in the new naming mapping file. My script runs hours to finish it. Please let me know if you know better way to speed the code up. Please assume I can't buy any hardware:-( From a poor monk.

Replies are listed 'Best First'.
Re: How to speed up multiple regex in a loop for a big data?
by Corion (Patriarch) on May 25, 2006 at 06:23 UTC

    Your code contains a problem if you have "overlap".in your source or target variable names. You should create one large regular expression that willsearch and replace the names in one go to avoid circles/sequences of replacing names instead of looping over your replacements per line:

    my $re = join '\bŠ\b', reverse keys %mapper; while (<IN>){ s/\b($re)\b/$mapper{$1}/gei; };

    There are modules for conveniently building such regular expressions in a more optimal way, like Regex::PreSuf

      I thought \b \b pairs will do the trick since they will match word boundaries. FYI, my mapping file has no redundant entries. Please correct me if I am wrong

        You're correct with the \b word boundaries. The problem case I am thinking of is the following renaming setup:

        %mapper = ( foo => 'zap', zap => 'foo', )

        Here, foo will be replaced by "zap" in your loop and then again by "foo". But if that can't happen, all you'll gain with the large regex is speed ;)

Re: How to speed up multiple regex in a loop for a big data?
by salva (Canon) on May 25, 2006 at 08:19 UTC
    you can try generating a subroutine on the fly to perform all the substitutions:
    open(MAP, "<$new_name_map_file"); while (<MAP>) { chomp; tr/A-Z/a-z/; @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); my $sub = "sub { "; for my $name (sort keys %mapper) { my $qname = quotemeta $name; my $qrepl = quotemeta $mapper{$name}; $sub .= "s{\b$qname\b}{$qrepl}g; "; } $sub .= "}"; $sub = eval $sub; die if $@; open(IN, "<input_file"); open(OUT, ">input_file.new"); while (<IN>) { print "%"; tr/A-Z/a-z/; $sub->(); print OUT "$_"; } close(IN); clse(OUT);
    this way, the regular expresions are compiled just once, and also, the inner loop and the sort are removed from the while loop.
      Isn't the cost of calling a sub significantly higher than running the same code in the loop? It's always been my policy to eliminate subs were possible when absolute speed is required. (But I could be wrong).
        well, calling subs in Perl is not as expensive as people usually thing...
        use Benchmark 'cmpthese'; my $a = 'foo bar doz' x 100; $a .= ' hello '.$a; my $sub = sub { /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; }; cmpthese(-3, { loop => sub { for (($a) x 10) { for my $i (1..8) { /\bhello\b/; } } }, sub => sub { for (($a) x 10) { $sub->() } }, inline => sub { for (($a) x 10) { /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; /\bhello\b/; } }
        outputs...
        Rate loop sub inline loop 4157/s -- -5% -16% sub 4363/s 5% -- -12% inline 4943/s 19% 13% --
        and anyway, it's easy to modify my code to remove the subroutine call from the loop just moving the loop inside the sub:
        open(MAP, "<$new_name_map_file"); while (<MAP>) { chomp; tr/A-Z/a-z/; @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); my $sub = <<'EOS'; sub { while (<IN>) { print "%"; tr/A-Z/a-z/; EOS for my $name (sort keys %mapper) { my $qname = quotemeta $name; my $qrepl = quotemeta $mapper{$name}; $sub .= "s{\b$qname\b}{$qrepl}g;\n"; } $sub .= <<'EOS' print OUT $_; } } EOS $sub = eval $sub; die if $@; open(IN, "<input_file"); open(OUT, ">input_file.new"); $sub->(); close(IN); clse(OUT);
Re: How to speed up multiple regex in a loop for a big data?
by Samy_rio (Vicar) on May 25, 2006 at 06:17 UTC

    Hi, If I understood your question correctly, try this & see the Benchmark,

    #!/usr/bin/perl use strict; use warnings; use Benchmark 'cmpthese'; cmpthese(-1, { method1 => '&yours', method2 => '&new_one', }); sub yours{ my %mapper; open(MAP, "<map.ini"); while (<MAP>) { chomp; tr/A-Z/a-z/; my @map_line = split (/\t/); $mapper{$map_line[0]} = $map_line[1]; } close(MAP); open(IN, "<input1.txt"); open(OUT, ">input_new.txt"); while (<IN>) { #print "%"; tr/A-Z/a-z/; foreach my $key (sort keys %mapper) { s/\b$key\b/$mapper{$key}/g; } print OUT "$_"; } close(IN); close(OUT); } sub new_one{ open(MAP, "map.ini"); my $map = do{local $/;<MAP>}; close(MAP); my %mapper; $map = lc($map); %mapper = map{split/\t/,$_} split(/\n/, $map); open(IN, "input1.txt"); my $input = do{local $/;<IN>}; close(IN); open(OUT, ">input_new.txt"); $input = lc($input); foreach my $key (sort keys %mapper) { $input =~ s/\b$key\b/$mapper{$key}/g; } print OUT $input; close(OUT); } __END__ Rate method1 method2 method1 908/s -- -13% method2 1050/s 16% --

    I have check with less number of lines file only. You can check with your input.

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

Re: How to speed up multiple regex in a loop for a big data?
by cdarke (Prior) on May 25, 2006 at 07:03 UTC
    One thing that leaps out at me is that you are sorting the keys of %mapper each time you read a record from input_file. Not sure how big a key is, but it might be worth sorting the keys just once before you read input_file, since you don't appear to alter the hash, this should be safe (although I don't know how big a key is). e.g.
    my @map_keys = sort keys %mapper;
    ...
    foreach $key (@map_keys) {
    ...
      Come to think of it, why are you sorting the keys?
        Hey, you're right!

        I was about to example a case for partial keys getting substituted, but then I realized the keys are wrapped in word boundries, so there's no chance of partial key substitution, and no reason to sort.

        Drop the sort and save some time.
        You are right on unnecessary sorting. I copied over that part from somewhere and wasn't aware of what I was doing...
Re: How to speed up multiple regex in a loop for a big data?
by wazzuteke (Hermit) on May 25, 2006 at 17:22 UTC
    I don't think I saw this in any of the comments, but some ideas may be:

  • study each line. Take a look at the perldoc about that one; it may or may not help depending on the number of patterns, the pattern, the line, etc...
  • The '//s' modifier can sometimes help; treating each line as a single line (or the entire file???)
  • Try pre-compiling or inline compiling the expressions:
    • my $regex = qr/<--some regex-->/; $line =~ $regex; if they are constant-ish
    • Or use the '//o' modifier to compile the regex in a loop. (this one may not help you as much)
    Just some more ideas. They may or may not work, though are nice goodies for the future if not.

    print map{chr}(45,45,104,97,124,124,116,97,45,45);
    ... and I probably posted this while I was at work => whitepages.com | inc.