URI Encoding is a strange beast. Different parts of Web-URLs have different encoding schemes:
urischeme://my.domain.gov/myuripath?uriargument1&uriargument2
Hope i didn't mix this up. It's still very early in the morning for me. Here is the relevant code from my own webserver, URI.pm, which seems to work reasonably well:
sub encode_uri($orig) { my @oparts = split/\//, $orig; my @eparts; foreach my $opart (@oparts) { push @eparts, encode_uri_part($opart); } return join('/', @eparts); } sub encode_uri_part($orig) { $orig = encode_utf8($orig); my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\:\~]/) { $encoded .= $part; }elsif($part eq ' ') { $encoded .= '+'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub encode_uri_path($orig, $encodeslashes = 0) { my $encoded = ''; my @parts = split//, $orig; foreach my $part (@parts) { if($part =~ /^[a-zA-Z0-9\/\:\~]/) { if($encodeslashes && $part eq '/') { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), + 2)); } else { $encoded .= $part; } }elsif($part eq ' ') { $encoded .= '%20'; } else { $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2)) +; } } return $encoded; } sub decode_uri($orig) { my @oparts = split/\//, $orig; my @dparts; foreach my $opart (@oparts) { push @dparts, decode_uri_part($opart); } return join('/', @dparts); } sub decode_uri_part($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '+') { $decoded .= ' '; } elsif($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; } # This is similar to decode_uri_part, but treats the plus sign literal +ly instead of as space sub decode_uri_path($orig) { my $decoded = ''; return $decoded unless defined($orig); my @parts = split//, $orig; while(scalar @parts) { my $part = shift @parts; if($part eq '%') { $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts) +); } else { $decoded .= $part; } } return $decoded; }
Note: the *_part() functions refer to the argument parts after the question mark, the *_path() does the URI path.
Note 2: The doFPad() just pads a string with leading zeros, it's from Padding.pm
Note 3: Yes, this part of my codebase needs refactoring for speed and clarity. But it currently works for me and its not used in any time-critical paths, so it's low priority ("never touch a running system").
Note 4: The different encoding schemes come from the origin of the different parts (at least for their historical reasons): Domains need an encoding scheme that compatible with older DNS servers. Path need to be compatible with file systems. And arguments originate from HTML forms with a variety of character sets.
In reply to Re: why URL::Encode deliberately mistreat '+'
by cavac
in thread why URL::Encode deliberately mistreat '+'
by vincentaxhe
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |