comment on

I've received some old code which was running on Perl 5.8 but stopped to work on 5.10. (both old, so Unicode support is anyhow limited, ie. no use feature ':5.12') as preamble)

This code should substitute every Unicode/UTF-8 in received (very long) strings (UTF-8) with some escapes listed in a separate file. This 'dictionary' file is sourced into a hash where the keys are the UTF-8 char while the values are the escaped strings:

     ex. $_table{'Ö'}; # gives '&Ouml;'
[download]

A builtin ? is returned for the Unicodes missing in that hash. The current implementation goes with the approach to convert the strings into bytes:

my $bytes = pack( "C*", unpack( "U0C*", $$sgml_r ));
[download]

move into the bytes world

   use bytes;
    ....         #all will happen wrapped here
   no bytes;
[download]

In this context, it searches for not ASCII bytes (out of the range space/tilde)

   $bytes =~ s/([^\ -\~]+.*)/$self->_fixup($1)/ego;
[download]

the _fixup function will perform a check on the length of the non-ASCII sequence (5,4,3,2 bytes) (yes, 6 not considered), looking up in the hash and RECURSIVELY goes through the remaining sequence of bytes. (All the used functions length/substr/concatenation(.) are then occurring in bytes context) I could try to see what exactly went wrong between v5.8 and v5.10, but I'm wondering if this approach isn't right in the first place. There are some good ideas probably, but I spot many steps which are considered and documented as bad practice when working with Unicode in Perl. I've successfully tested a simple solution which just checks every single Unicode character:

    $$sgml_r =~ s/(.)/$self->_mapchar($1)/eg;
[download]

where _mapchar is the function which performs the lookup with conversion for non-ASCII chars

sub _mapchar {
   my ($char) = @_;
   if ( $char !~ /[\r\n\s]/) {
        my $nbytes = length encode_utf8($char);
        if ($nbytes > 1) {
            $char = exists $_table{$char} ? $_table{$char} : '?';
        }
   }
   return $char;
}
[download]

Apart from further testing it, this solution is going to check EVERY char which really doesn't seem good practice too. The rate of strings to process is high and each isn't also that short. Moreover every string could not even have a Unicode out of the ASCII space, or maybe just a few.

Both solutions don't seem fine. Interested to any Perlish consideration from the experts, to better figure out what surely avoid

In reply to Substitute some Unicodes with their escapes by jjmoka

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.