in reply to Re: Hex regex fails in subroutine
in thread Hex regex fails in subroutine

It's UTF8. The data I receive is from a rest api so CP1252 characters are UTF8 encoded. This data is going into a database that will be used for the web with UFT8 encoding so I need to replace these characters accordingly.

Replies are listed 'Best First'.
Re^3: Hex regex fails in subroutine
by NERDVANA (Priest) on Sep 30, 2023 at 07:16 UTC
    If the data is always UTF-8 encoded, it might save some effort to decode that first and then look for the problem characters?

    BTW, you can do all this in a single pass, if performance matters.

    sub convert_to_html_entities { my $str= shift; utf8::decode($str); $str =~ s/[\x{201A}-\x{2122}]/ '&#'.ord($&).';' /ger; }

    You could even just wholesale replace all non-ascii characters to completely sidestep the encoding problem:

    sub convert_nonascii_to_html_entities { my $str= shift; utf8::decode($str); $str =~ s/[^\x20-\x7E]/ '&#'.ord($&).';' /ger; }

      See also haukex's article on dynamic regex alternations.


      Give a man a fish:  <%-{-{-{-<

        Definitely a useful technique, but a single set of characters should perform much faster than an alternation list. Of course you could use that technique to build the set of characters.