scodes has asked for the wisdom of the Perl Monks concerning the following question:

Hey all.
Here is a small sample of a series of regex substitutions strings:
s/^\s+\(Text [aA].* (\d+:\d+ .*$)/\.T1 "$1/g; s/^\s+\(Text [rR].* (\d+:\d+ .*$)/\.T2 "$1/g;

Here is also a hash that is heading in the right direction ... I think ;-)

%chartran = ( agrave => "s/\\\xE0/\\\[agrave]/og", aacute => "s/\\\xE1/\\\[aacute]/og", acirc => "s/\\\xE2/\\\[acirc]/og", auml => "s/\\\xE4/\\\[auml]/og", Agrave => "s/\\\xC0/\\\[Agrave]/og", Aacute => "s/\\\xC1/\\\[Aacute]/og", );
I cycle through this hash as follows, and I know there is a simpler way to do this as well, though, here is what I have got so far:
while (<VRUN>) { foreach $testchar ( keys %chartran ) { if (eval ( "$chartran{$testchar}" )) { ... write out results .... } } }

So I know I could do all this really simply. Looked at qr// and like this idea, and I think the solution is an eval or an /ee modifier, but how, exactly :)

Replies are listed 'Best First'.
Re: large hash of regex substitution strings
by ikegami (Patriarch) on Oct 06, 2007 at 01:11 UTC

    (Please put your code in <c>...</c> tags. It'll handle escaping the necessary character and it'll place the line breaks for you.)

    There's a lot of needless work here. Perl code and regexs are being parsed and compiled over and over again. Also, there's no reason to use /o anymore. It does nothing more than complicate things.

    my %chartran = ( "\xE1" => 'aacute', "\xE2" => 'acirc', "\xE4" => 'auml', "\xC0" => 'Agrave', "\xC1" => 'Aacute', ); my $re = '[' . (join '', keys %chartran) . ']'; $re = qr/$re/; while (<VRUN>) { s/($re)/$chartran{$1}/g; print; }

    Of course, you could simply use core module HTML::Entities's encode_entities method.

    use HTML::Entities qw( encode_entities ); while (<VRUN>) { print encode_entities($_); }
      Hmmmph. I hadn't looked at HTML::Entities before. I'm already used to using CGI (or CGI::Pretty) and its encodeHTML function, which seems to do pretty much the same thing – (Take a string and substitute escaped HTML for the nonstandard characters.) Is there an advantage to using HTML::Entities? Or is it just that it's a smaller standalone module?

      throop

        I never looked at CGI's escapeHTML, so I took a peek.

        escapeHTML/unescapeHTML only converts a few characters.
        That means you you can't place unicode characters in an iso-latin-1 document, only iso-latin-1 characters.
        That means any but a few entities won't be understood. For example, it's unable to unescape &eacute;, even if it maps to a character in the specified character set.

        HTML::Entities is familiar with all entities.
        HTML::Entities can numerically encode any range of characters.
        HTML::Entities can decode any range of characters.

        escapeHTML has some workarounds for browser issues and for &quot; being accidentally omitted from HTML 3.2.

      Thanks. What about the following example ?
      s/^\s+\(Text [aA].* (\d+:\d+ .*$)/\.T1 "$1/g;
      I ask as I have about 50 regexs to work with.
      I could build this out like this:
      %search ( 1 => "s/^\s+\(Text [aA].* (\d+:\d+ .*$)/", 2 => ..... ); $replace ( 1 => "/\.T1 \"$1/", 2 => ..... ); Now I'd like to do something like this, and I know qr// fits in to the equation, I just dont know how .... yet :) while <VRUN> { s/$search/$replace/g; }
      Thanks for taking a further look at this. Thanks again.

        Is there a pattern between the different operations? If not, you might be stuck with

        my @ops = ( sub { s/^\s+\(Text [aA].* (\d+:\d+ .*$)/\.T1 "$1/g; }, ... ); while (<$fh>) { foreach my $op (@ops) { $op->(); } }

        The reason it can't be simplified much is the $1 in the replace expression. Often, when reaching this point, it's time to look into a templating system. It's hard to tell if that's the case here since I'm only getting a very small picture of what you are doing.

        Further to the above:
        %search ( R1 => "/^\s+\(Text [aA].* (\d+:\d+ .*$)/", R2 => ..... ); $replace ( R1 => "/\.T1 \"$1/", R2 => ..... ); while <VRUN> { foreach $rule ( keys %search ) { s/$search{$rule}/$replace{$rule}/g; } }
        My only concern is the substring match in R1/S1. Can I use qr// to make this more efficient, and if so, what is the correct syntax. Would qr// be required on both sides of the s// ? Do i need to use an eval or an /ee modifier to get the substitution to happen ? Thanks again all you p'gurus for your help on this :)