Substitute some Unicodes with their escapes

jjmoka has asked for the wisdom of the Perl Monks concerning the following question:

I've received some old code which was running on Perl 5.8 but stopped to work on 5.10. (both old, so Unicode support is anyhow limited, ie. no use feature ':5.12') as preamble)

This code should substitute every Unicode/UTF-8 in received (very long) strings (UTF-8) with some escapes listed in a separate file. This 'dictionary' file is sourced into a hash where the keys are the UTF-8 char while the values are the escaped strings:

     ex. $_table{'Ö'}; # gives '&Ouml;'
[download]

A builtin ? is returned for the Unicodes missing in that hash. The current implementation goes with the approach to convert the strings into bytes:

my $bytes = pack( "C*", unpack( "U0C*", $$sgml_r ));
[download]

move into the bytes world

   use bytes;
    ....         #all will happen wrapped here
   no bytes;
[download]

In this context, it searches for not ASCII bytes (out of the range space/tilde)

   $bytes =~ s/([^\ -\~]+.*)/$self->_fixup($1)/ego;
[download]

the _fixup function will perform a check on the length of the non-ASCII sequence (5,4,3,2 bytes) (yes, 6 not considered), looking up in the hash and RECURSIVELY goes through the remaining sequence of bytes. (All the used functions length/substr/concatenation(.) are then occurring in bytes context) I could try to see what exactly went wrong between v5.8 and v5.10, but I'm wondering if this approach isn't right in the first place. There are some good ideas probably, but I spot many steps which are considered and documented as bad practice when working with Unicode in Perl. I've successfully tested a simple solution which just checks every single Unicode character:

    $$sgml_r =~ s/(.)/$self->_mapchar($1)/eg;
[download]

where _mapchar is the function which performs the lookup with conversion for non-ASCII chars

sub _mapchar {
   my ($char) = @_;
   if ( $char !~ /[\r\n\s]/) {
        my $nbytes = length encode_utf8($char);
        if ($nbytes > 1) {
            $char = exists $_table{$char} ? $_table{$char} : '?';
        }
   }
   return $char;
}
[download]

Apart from further testing it, this solution is going to check EVERY char which really doesn't seem good practice too. The rate of strings to process is high and each isn't also that short. Moreover every string could not even have a Unicode out of the ASCII space, or maybe just a few.

Both solutions don't seem fine. Interested to any Perlish consideration from the experts, to better figure out what surely avoid

Comment on Substitute some Unicodes with their escapes Select or Download Code

Replies are listed 'Best First'.
Re: Substitute some Unicodes with their escapes by Corion (Patriarch) on Jun 09, 2020 at 13:56 UTC
Have you looked at HTML::Entities, which has `$encoded = encode_entities($input, '^\n\x20-\x25\x27-\x7e');` [download] ... which should be more or less what your table does. If you have other characters that should not be encoded, you can add them to the above range. Update: See also HTML::HTML5::Entities, which even allows you to bring your own entity tables.	[reply] [d/l]
Re^2: Substitute some Unicodes with their escapes by jjmoka (Beadle) on Jun 09, 2020 at 14:57 UTC
That's another good idea. Thank you	[reply]
Re^2: Substitute some Unicodes with their escapes by Anonymous Monk on Jun 09, 2020 at 15:05 UTC
Superficial glances at the source-code of each module indicates that both of them in their own way know about "&Ouml." I think you'll be in business with either one of them.	[reply]
Re: Substitute some Unicodes with their escapes by haukex (Archbishop) on Jun 09, 2020 at 11:07 UTC
Two things that are still unclear to me from your description are whether you've got `use utf8;` for code like `$_table{'Ö'}` (or if that was just an example here), and whether your strings containing Unicode are properly decoded or not - please show one of your strings using `Dump` from Devel::Peek, and show an SSCCE that reproduces your situation. In any case, a regex that matches non-ASCII characters (or bytes, depending on whether your string is decoded or not) is `/([^\x00-\x7F])/`. Minor edits.	[reply] [d/l] [select]
Re^2: Substitute some Unicodes with their escapes by jjmoka (Beadle) on Jun 09, 2020 at 11:31 UTC
Yes `$_table{'Ö'}` is an example. There is no `use utf8;`. No identifiers neither Unicode strings are builtin into the module. The hash is sourced via this function: `my %_table = (); sub load_map { open( UTF8, "<:encoding(utf8)", 'tab.bin' ) \|\| die "can't open t +ab.bin : $!"; while( <UTF8> ) { chomp; my $offset = index($_,' '); my $bin = substr($_,$offset+1); my $esc = substr($_,0,$offset); $_table{$bin} = $esc; } close( UTF8 ); }` [download] A `Dump` of a (short) received string is this: `SV = PVMG(0x12fca50) at 0x134f698 REFCNT = 4 FLAGS = (PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x1361200 " <entry>R\303\226CHLING EN +GINEERING PLASTICS (UK) LIMITED</entry>"\0 [UTF8 " + <entry>R\x{d6}CHLING ENGINEERING PLASTICS (UK) LIMITED</entry +>"] CUR = 85 LEN = 88` [download] Going to think about that SSCCE. Thanks UPDATE: here the SSCCE: use strict; use utf8; # used only for this SSCCE to set scalar $SGML at line 4 +4 use Devel::Peek qw (Dump); use Encode qw(encode_utf8); binmode(STDOUT, ":utf8"); my %_table = (); # ----------------------------------- sub load_map { while( <DATA> ) { chomp; my $offset = index($_,' '); my $bin = substr($_,$offset+1); my $esc = substr($_,0,$offset); $_table{$bin} = $esc; } } # ----------------------------------- sub _mapchar { my ($char) = @_; if ( $char !~ /[\r\n\s]/) { my $nbytes = length encode_utf8($char); if ($nbytes > 1) { $char = exists $_table{$char} ? $_table{$char} : '?'; } } return $char; } # ----------------------------------- sub escapeUTF8 { my ( $sgml_r) = @_; Dump $$sgml_r; $$sgml_r =~ s/(.)/_mapchar($1)/eg; } load_map(); my $SGML='RÖCHLING'; print "1: $SGML\n"; escapeUTF8(\$SGML); print "2: $SGML\n"; __DATA__ &dollar; $ Ö Ö » » ~ [download] it works, but still the regex is on every char	[reply] [d/l] [select]
Re^3: Substitute some Unicodes with their escapes by choroba (Cardinal) on Jun 09, 2020 at 13:53 UTC
As noted before, you don't need to encode each character. Also, you can build a regex that matches all the keys of the hash, than you don't need to call any subroutine from /e. This way, you only replace the characters you know how to replace, so it will work even for the $ which is ASCII 36. `my $chars = join "", keys %_table; sub escapeUTF8 # Now needs a better name! { my ( $sgml_r) = @_; $$sgml_r =~ s/([\Q$chars\E])/$_table{$1}/g; }` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^4: Substitute some Unicodes with their escapes by jjmoka (Beadle) on Jun 09, 2020 at 14:54 UTC
Re^3: Substitute some Unicodes with their escapes by haukex (Archbishop) on Jun 09, 2020 at 23:49 UTC
UPDATE: here the SSCCE Thank you very much, that's very helpful! `use utf8; # used only for this SSCCE to set scalar $SGML at line 44` Just to nitpick this comment: the pragma is also necessary so that the `DATA` section is read as UTF-8 as well. it works, but still the regex is on every char Yes, that's true. There are a couple of different approaches on how to solve this - you could use the modules that Corion suggested (but that would replace the entire functionality of the code you inherited; you'd have to be sure that there isn't any tricky legacy behavior that you need to preserve), you could build a regex dynamically to match only those characters that have an entry in the hash (but in the root node you said "A builtin ? is returned for the Unicodes missing in that hash."), or my approach to answering this question so far has been to preserve as much of the original behavior as makes sense while still modernizing a bit. To that end, the regex that I suggested seems to work fine on this small bit of sample data. Also, note that in this case, the whole `if length encode_utf8($char) > 1` logic isn't needed, because in UTF-8, the bytes `0x00`-`0x7F` map 1:1 to ASCII and are always single bytes, while any characters `>= 0x80` are guaranteed to be multibyte. `if ( $char !~ /[\r\n\s]/ )` Note you have to be careful with this one: under Unicode matching rules, `\s` will match Unicode whitespace characters as well, so for example if you were to have a table entry ` ` , because of this regex it wouldn't be applied! You probably want the `/a` modifier, and the regex could be simplified to just `\s`. However, because `[^\x00-\x7F]` only matches on non-ASCII characters anyway, the `$char !~ /\s/a` test will always be true anyway, and so it can be omitted as well. In fact, in the below code I've inlined the entire `sub _mapchar`. By the way, in the root node you said you're using the bytes pragma, note that its documentation says "Use of this module for anything other than debugging purposes is strongly discouraged." Read more... code (2 kB)	[reply] [d/l] [select]
Re^4: Substitute some Unicodes with their escapes by jjmoka (Beadle) on Jun 10, 2020 at 00:16 UTC