Hi Ken!
Tried to find a general solution to the problem reported in Re: uparse - Parse Unicode strings.
Short explanation of the problem:
There are two basic ways to get correct UNICODE input from the elements in @ARGV:
A script that expects UNICODE data from @ARGV cannot easily detect if the implicit decoding is in effect, especially because -CAL makes the behaviour locale-dependent.
The best solution I could find is to check if the data in question is already marked to be in UTF-8. Encode::is_utf8 (or the equivalent utf8::is_utf8) may be used to check this flag, which results in a small modification to your script:
diff --git a/uparse b/uparse index f5edb92..b05e12a 100755 --- a/uparse +++ b/uparse @@ -23,11 +23,11 @@ use constant { NO_PRINT => "\N{REPLACEMENT CHARACTER}", }; -use Encode 'decode'; +use Encode qw(decode is_utf8); use Unicode::UCD 'charinfo'; for my $raw_str (@ARGV) { - my $str = decode('UTF-8', $raw_str); + my $str = is_utf8($raw_str) ? $raw_str : decode('UTF-8', $raw_str +); print "\n", SEP1; print "String: '$str'\n"; print SEP1;
What do you think about this?
Greetings,
-jo
In reply to Decoding @ARGV [Was: uparse - Parse Unicode strings]
by jo37
in thread uparse - Parse Unicode strings
by kcott
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |