comment on

UPDATE: here the SSCCE

Thank you very much, that's very helpful!

use utf8; # used only for this SSCCE to set scalar $SGML at line 44

Just to nitpick this comment: the pragma is also necessary so that the DATA section is read as UTF-8 as well.

it works, but still the regex is on every char

Yes, that's true. There are a couple of different approaches on how to solve this - you could use the modules that Corion suggested (but that would replace the entire functionality of the code you inherited; you'd have to be sure that there isn't any tricky legacy behavior that you need to preserve), you could build a regex dynamically to match only those characters that have an entry in the hash (but in the root node you said "A builtin ? is returned for the Unicodes missing in that hash."), or my approach to answering this question so far has been to preserve as much of the original behavior as makes sense while still modernizing a bit.

To that end, the regex that I suggested seems to work fine on this small bit of sample data. Also, note that in this case, the whole if length encode_utf8($char) > 1 logic isn't needed, because in UTF-8, the bytes 0x00-0x7F map 1:1 to ASCII and are always single bytes, while any characters >= 0x80 are guaranteed to be multibyte.

if ( $char !~ /[\r\n\s]/ )

Note you have to be careful with this one: under Unicode matching rules, \s will match Unicode whitespace characters as well, so for example if you were to have a table entry   , because of this regex it wouldn't be applied! You probably want the /a modifier, and the regex could be simplified to just \s. However, because [^\x00-\x7F] only matches on non-ASCII characters anyway, the $char !~ /\s/a test will always be true anyway, and so it can be omitted as well. In fact, in the below code I've inlined the entire sub _mapchar.

By the way, in the root node you said you're using the bytes pragma, note that its documentation says "Use of this module for anything other than debugging purposes is strongly discouraged."

use warnings;
use strict;
use utf8;
use open qw/:std :utf8/;
use Devel::Peek qw/Dump/;
use Data::Dump;

my %_table;
sub load_map {
    while (<DATA>) {
        chomp;
        my ($esc,$bin) = split / /, $_, 2;
        $_table{$bin} = $esc;
    }
}

sub escapeUTF8 {
    my ($sgml_r) = @_;
    $$sgml_r =~ s{([^\x00-\x7F])}
        { exists $_table{$1} ? $_table{$1} : '?' }eg;
}

load_map();
dd \%_table;

use Test::More tests=>1;
my $SGML='RÖCHLING ';
Dump $SGML;
escapeUTF8(\$SGML);
is $SGML, 'R&Ouml;CHLING&nbsp;';

__DATA__
&dollar; $
&Ouml; Ö
&raquo; »
&nbsp;
[download]

Output:

1..1
{
  "\$"   => "&dollar;",
  "\xA0" => "&nbsp;",
  "\xBB" => "&raquo;",
  "\xD6" => "&Ouml;",
}
SV = PV(0x12fca50) at 0x134f698
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0x1361200 "R\303\226CHLING\302\240"\0 [UTF8 "R\x{d6}CHLING\x{a0
+}"]
  CUR = 11
  LEN = 13
  COW_REFCNT = 1
ok 1
[download]

In reply to Re^3: Substitute some Unicodes with their escapes by haukex
in thread Substitute some Unicodes with their escapes by jjmoka

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.