Re: CPAN's URI.pm versus Japanse as Unicode?

I see two problems here: first, your source file is not declared as UTF-8 with use utf8;, which means that my $href="https://マリウス.com/"; is actually giving the string "https://\343\203\236\343\203\252\343\202\246\343\202\271.com/". Second, URI is encoding that with Punycode, which IMHO is one correct approach, as the URI documentation states that it works with URIs as per RFC 2396 and RFC 2732, which I think only support US-ASCII.

If you add the use utf8;, you get the output =xn--gckvb8fzb.com, which is the correct Punycode domain name of "マリウス.com" ("\x{30de}\x{30ea}\x{30a6}\x{30b9}.com").

What is unclear to me is what your goal is? Why do you (think you) need a URI object with unicode characters in it?

Comment on Re: CPAN's URI.pm versus Japanse as Unicode? Select or Download Code

Replies are listed 'Best First'.
Re^2: CPAN's URI.pm versus Japanse as Unicode? by mldvx4 (Hermit) on Dec 11, 2022 at 12:21 UTC
Thanks, though adding `use utf8` does not affect the result perhaps I need to convert from Punycode. Is there a module for converting from Punycode to Unicode? Working with the host names as Punycode is not really an option, as far as a I can tell, because the host name needs to remain human-readable. The goal is to extract the host name from the URI and the host name happens to be Japanese as Unicode, as is wont to happen.	[reply] [d/l]
Re^3: CPAN's URI.pm versus Japanse as Unicode? by haukex (Archbishop) on Dec 11, 2022 at 12:50 UTC
Thanks, though adding use utf8 does not affect the result Yes, it does. ... the host name needs to remain human-readable. The goal is to extract the host name from the URI and the host name happens to be Japanese as Unicode, ... Corion already pointed you to Net::IDN::Encode as one possibility. use warnings; use strict; use utf8; use open qw/:std :encoding(UTF-8)/; use URI; use Net::IDN::Encode qw/domain_to_unicode/; my $href="https://マリウス.com/"; my $uri = URI->new($href); my $domain = domain_to_unicode($uri->host); print $domain,"\n"; # prints "マリウス.com"	[reply]