jjmoka has asked for the wisdom of the Perl Monks concerning the following question:
This code should substitute every Unicode/UTF-8 in received (very long) strings (UTF-8) with some escapes listed in a separate file. This 'dictionary' file is sourced into a hash where the keys are the UTF-8 char while the values are the escaped strings:
A builtin ? is returned for the Unicodes missing in that hash. The current implementation goes with the approach to convert the strings into bytes:ex. $_table{'Ö'}; # gives 'Ö'
move into the bytes worldmy $bytes = pack( "C*", unpack( "U0C*", $$sgml_r ));
In this context, it searches for not ASCII bytes (out of the range space/tilde)use bytes; .... #all will happen wrapped here no bytes;
the _fixup function will perform a check on the length of the non-ASCII sequence (5,4,3,2 bytes) (yes, 6 not considered), looking up in the hash and RECURSIVELY goes through the remaining sequence of bytes. (All the used functions length/substr/concatenation(.) are then occurring in bytes context) I could try to see what exactly went wrong between v5.8 and v5.10, but I'm wondering if this approach isn't right in the first place. There are some good ideas probably, but I spot many steps which are considered and documented as bad practice when working with Unicode in Perl. I've successfully tested a simple solution which just checks every single Unicode character:$bytes =~ s/([^\ -\~]+.*)/$self->_fixup($1)/ego;
where _mapchar is the function which performs the lookup with conversion for non-ASCII chars$$sgml_r =~ s/(.)/$self->_mapchar($1)/eg;
Apart from further testing it, this solution is going to check EVERY char which really doesn't seem good practice too. The rate of strings to process is high and each isn't also that short. Moreover every string could not even have a Unicode out of the ASCII space, or maybe just a few.sub _mapchar { my ($char) = @_; if ( $char !~ /[\r\n\s]/) { my $nbytes = length encode_utf8($char); if ($nbytes > 1) { $char = exists $_table{$char} ? $_table{$char} : '?'; } } return $char; }
Both solutions don't seem fine. Interested to any Perlish consideration from the experts, to better figure out what surely avoid
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Substitute some Unicodes with their escapes
by Corion (Patriarch) on Jun 09, 2020 at 13:56 UTC | |
by jjmoka (Beadle) on Jun 09, 2020 at 14:57 UTC | |
by Anonymous Monk on Jun 09, 2020 at 15:05 UTC | |
|
Re: Substitute some Unicodes with their escapes
by haukex (Archbishop) on Jun 09, 2020 at 11:07 UTC | |
by jjmoka (Beadle) on Jun 09, 2020 at 11:31 UTC | |
by choroba (Cardinal) on Jun 09, 2020 at 13:53 UTC | |
by jjmoka (Beadle) on Jun 09, 2020 at 14:54 UTC | |
by haukex (Archbishop) on Jun 09, 2020 at 23:49 UTC | |
by jjmoka (Beadle) on Jun 10, 2020 at 00:16 UTC |