hi,

time ago a monk asked how to s/// using an hash for key/value substitution.

knowing the keys of the hash, it would be easy to:

s/(key1|key2|key3)/$hash{$1)/g;
in a more generic way, the RE can be built using qr:
my $re = join '|', keys %dict; $re = qr|($re)|; s/$re/$hash{$1}/g;
with a CPAN module, you can:
use Regexp::Assemble; my $ra = Regexp::Assemble->new; $ra->add($_) for keys %dict;
which returns a more clever pattern. Or:
s/(\w+)/$dict{$1}||$1/ge;
(Note some differences between those approaches. The last one will "update" the string also when it shouldn't, by replacing with the same text. this can be a problem if i want to match 'bcd' in 'abcde' -- in this case s/// will change abcde with abcde and pass over, it does not check bcd!)

I looked on perlre to findout if we can do assertions on a RE, (?{code}) is an "always true" statement1, a pity we can't use to stop the parsing. but after some digging i found that (?(?{cond})|(?!)) fails if cond isn't true, so:

s/(\w+)(?(?{$dict{$1}})|(?!))/$dict{$1}/g;
This way avoid to prepare the RE before, but match bcd against abcde.

This seems to be a misleading (yet usefull) feature. it means that (${code}) isn't an "always true" assertion. But is it reliable enough to be used?

1 -- seems this will be changed in perl 6.

Update: corrected a missing ) in the last RE.

Replies are listed 'Best First'.
Re: regex s/// using hash
by ikegami (Patriarch) on Oct 16, 2007 at 13:40 UTC
    • This seems to be a misleading (yet usefull) feature. it means that (${code}) isn't an "always true" assertion. But is it reliable enough to be used?

      In (?(cond)|...), cond is not a regexp. While the syntax is similar, ?{code} has a different meaning there than in regexs.

      In regexs, (?{code}) always matches. The return value is stored in $^R.

      In (?(?{code})|...), (?{code})'s return value is used to determine which sub-regexs to use.

    • But is it reliable enough to be used?

      Yes, it's reliable. The only catch is that these blocks are closures, so you can run into problem when you use lexical (my) variables declared outside of the regex in these blocks. I always use package (our) variables. If you have a lexical variable you don't want to convert, you could always create an alias to it.

      our %pkg_hash; local *pkg_hash = \%lex_hash;
    • You seem to have problems deciding whether the keys are strings or regexs. Either way, your snippets are buggy. If they're strings, your 2nd and 3rd snippet should be

      my $re = join '|', map quotemeta, keys %dict; s/($re)/$hash{$1}/g;
      use Regexp::List qw( ); my $re = Regexp::List->new()->list2re(keys %dict); s/$re/$hash{$1}/g;

      quotemeta converts strings to regexs, and Regexp::List works with strings (while Regexp::Assemble works on regexs).

      If the keys are regex, your snippets won't work except for the simplests of regexs (those without any metachars) because you won't be able to lookup the appropriate hash entry in the replace expression.

      i'm not using (?(cond)|..) but (? (?{code}) | (?!) )

      i'm also a bit confused about the closures: being used in scope those regexp aren't returned so i can't see side-effect of using our or my, what am i missing?

      Nice, i was looking for Regexp::List but i found Regexp::Assemble, i thought i was recalling wrongly. Oha

        i'm not using (?(cond)|..) but (? (?{code}) | (?!) )

        Adding whitespace doesn't change what it is.

        i'm also a bit confused about the closures: being used in scope those regexp aren't returned so i can't see side-effect of using our or my, what am i missing

        A closure occurs when a sub persists longer than the scope in which it was created. The code in (?{code}), (??{code}) and (?(?{code})|..) is compiled into a sub. The regex persists beyond the scope in which it is created (to avoid needless recompiling of the regex), causing a closure.

        sub func_lex { my ($var) = @_; '' =~ /(?{ print("$var\n"); })/; } sub func_pkg { our ($var) = @_; '' =~ /(?{ print("$var\n"); })/; } func_lex('foo'); # foo func_lex('bar'); # foo !!! func_pkg('foo'); # foo func_pkg('bar'); # bar

        Update: Added func_pkg to example.

Re: regex s/// using hash
by moritz (Cardinal) on Oct 16, 2007 at 14:17 UTC
    my $re = join '|', keys %dict;

    It's worth noting that the keys will be interpreted as regexes, which is not always what you want. If you don't want that, use

    my $re = join '|', map { quotemeta $_ } keys %dict;
    instead
Re: regex s/// using hash (heavy backtracking)
by lodin (Hermit) on Oct 16, 2007 at 20:37 UTC

    I used this feature to improve kyle's solution in Re^2: Finding a random position within a long string (Activeperl Build 822). The pattern then looked like

    s/(?(?{rand() >= 0.1})(?!))./?/g
    and proved to be a lot faster than having the conditional in the replacement part of s///.

    However, your pattern will be very slow due to backtracking. Most of the time the $dict{$1} lookup will fail and the regex engine will backtrack. Each word in the string (per /\w+/) will render

    length * (1 + length) / 2
    matches. For a short question on here on Perlmonks, with say 130 words and an average word length of 4, you'll get 1300 attempted matches against the string. As you see, it quickly gets quite heavy.

    lodin