URI Encoding is a strange beast. Different parts of Web-URLs have different encoding schemes:

urischeme://my.domain.gov/myuripath?uriargument1&uriargument2

  1. The domain can have optional, special encoding for Unicode, especially Umlauts.
  2. The path can have plus signs, that are "encoded" as plus signs, whereas spaces are encoded as hex value %20
  3. The arguments (after the question mark) encode spaces as plus signs, and the plus sign is encoded as hex value

Hope i didn't mix this up. It's still very early in the morning for me. Here is the relevant code from my own webserver, URI.pm, which seems to work reasonably well:

sub encode_uri($orig) { my @oparts = split/\//, $orig; my @eparts; foreach my $opart (@oparts) { push @eparts, encode_uri_part($opart); } return join('/', @eparts); } sub encode_uri_part($orig) { $orig = encode_utf8($orig); my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\:\~]/) { $encoded .= $part; }elsif($part eq ' ') { $encoded .= '+'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub encode_uri_path($orig, $encodeslashes = 0) { my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\/\:\~]/) { if($encodeslashes && $part eq '/') { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), + 2)); } else { $encoded .= $part; } }elsif($part eq ' ') { $encoded .= '%20'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub decode_uri($orig) { my @oparts = split/\//, $orig; my @dparts; foreach my $opart (@oparts) { push @dparts, decode_uri_part($opart); } return join('/', @dparts); } sub decode_uri_part($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '+') { $decoded .= ' '; } elsif($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; } # This is similar to decode_uri_part, but treats the plus sign literal +ly instead of as space sub decode_uri_path($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; }

Note: the *_part() functions refer to the argument parts after the question mark, the *_path() does the URI path.

Note 2: The doFPad() just pads a string with leading zeros, it's from Padding.pm

Note 3: Yes, this part of my codebase needs refactoring for speed and clarity. But it currently works for me and its not used in any time-critical paths, so it's low priority ("never touch a running system").

Note 4: The different encoding schemes come from the origin of the different parts (at least for their historical reasons): Domains need an encoding scheme that compatible with older DNS servers. Path need to be compatible with file systems. And arguments originate from HTML forms with a variety of character sets.


In reply to Re: why URL::Encode deliberately mistreat '+' by cavac
in thread why URL::Encode deliberately mistreat '+' by vincentaxhe

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.