vincentaxhe has asked for the wisdom of the Perl Monks concerning the following question:

use URL::Encode 'url_decode_utf8'; use utf8::all; $string = '%E6%88%91%E7%88%B1c++'; print url_decode_utf8("$string");
It end up with '+' to space, in '/usr/share/perl5/site_perl/URL/Encode/PP.pm' it has code
... $EncodeMap{"\x20"} = '+'; ... $s =~ y/+/\x20/;
It intends to protect '+', but failed to convert it back. comment the two lines on the contrary works fine.

Why all this?

Replies are listed 'Best First'.
Re: why URL::Encode deliberately mistreat '+'
by cavac (Prior) on Jun 28, 2024 at 05:39 UTC

    URI Encoding is a strange beast. Different parts of Web-URLs have different encoding schemes:

    urischeme://my.domain.gov/myuripath?uriargument1&uriargument2

    1. The domain can have optional, special encoding for Unicode, especially Umlauts.
    2. The path can have plus signs, that are "encoded" as plus signs, whereas spaces are encoded as hex value %20
    3. The arguments (after the question mark) encode spaces as plus signs, and the plus sign is encoded as hex value

    Hope i didn't mix this up. It's still very early in the morning for me. Here is the relevant code from my own webserver, URI.pm, which seems to work reasonably well:

    sub encode_uri($orig) { my @oparts = split/\//, $orig; my @eparts; foreach my $opart (@oparts) { push @eparts, encode_uri_part($opart); } return join('/', @eparts); } sub encode_uri_part($orig) { $orig = encode_utf8($orig); my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\:\~]/) { $encoded .= $part; }elsif($part eq ' ') { $encoded .= '+'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub encode_uri_path($orig, $encodeslashes = 0) { my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\/\:\~]/) { if($encodeslashes && $part eq '/') { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), + 2)); } else { $encoded .= $part; } }elsif($part eq ' ') { $encoded .= '%20'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub decode_uri($orig) { my @oparts = split/\//, $orig; my @dparts; foreach my $opart (@oparts) { push @dparts, decode_uri_part($opart); } return join('/', @dparts); } sub decode_uri_part($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '+') { $decoded .= ' '; } elsif($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; } # This is similar to decode_uri_part, but treats the plus sign literal +ly instead of as space sub decode_uri_path($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; }

    Note: the *_part() functions refer to the argument parts after the question mark, the *_path() does the URI path.

    Note 2: The doFPad() just pads a string with leading zeros, it's from Padding.pm

    Note 3: Yes, this part of my codebase needs refactoring for speed and clarity. But it currently works for me and its not used in any time-critical paths, so it's low priority ("never touch a running system").

    Note 4: The different encoding schemes come from the origin of the different parts (at least for their historical reasons): Domains need an encoding scheme that compatible with older DNS servers. Path need to be compatible with file systems. And arguments originate from HTML forms with a variety of character sets.

Re: why URL::Encode deliberately mistreat '+'
by ikegami (Patriarch) on Jun 28, 2024 at 18:33 UTC

    «+» is a reserved character.

    Of those, the spec says

    If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

    Historically, «+» has been used as an encoding for spaces in the query portion of HTTP URIs. As such, it needs to be encoded in the query portion of HTTP URIs if nothing else.

    For example, when you submit a query of «c++» to Google using Firefox and Chrome, they encode the URL as https://www.google.com/search?q=c%2B%2B&....

    URL::Encode doesn't know if it's a path component (where «+» isn't special) or a query component (where it is), but it's usually used for query components, so it treats «+» as the encoding of a space, and thus decodes «c++» into «c␠␠»

    URI, being more context aware, can provide a more appropriate decoding.

    $ perl -e' use v5.14; use URI qw( ); my $uri = URI->new( "https://example.com/a+b?c+d=e+f" ); say for $uri->path, $uri->query_form; ' /a+b c d e f