URI Encoding is a strange beast. Different parts of Web-URLs have different encoding schemes:
urischeme://my.domain.gov/myuripath?uriargument1&uriargument2
- The domain can have optional, special encoding for Unicode, especially Umlauts.
- The path can have plus signs, that are "encoded" as plus signs, whereas spaces are encoded as hex value %20
- The arguments (after the question mark) encode spaces as plus signs, and the plus sign is encoded as hex value
Hope i didn't mix this up. It's still very early in the morning for me. Here is the relevant code from my own webserver, URI.pm, which seems to work reasonably well:
sub encode_uri($orig) {
my @oparts = split/\//, $orig;
my @eparts;
foreach my $opart (@oparts) {
push @eparts, encode_uri_part($opart);
}
return join('/', @eparts);
}
sub encode_uri_part($orig) {
$orig = encode_utf8($orig);
my $encoded = '';
my @parts = split//, $orig;
foreach my $part (@parts) {
if($part =~ /^[a-zA-Z0-9\:\~]/) {
$encoded .= $part;
}elsif($part eq ' ') {
$encoded .= '+';
} else {
$encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2))
+;
}
}
return $encoded;
}
sub encode_uri_path($orig, $encodeslashes = 0) {
my $encoded = '';
my @parts = split//, $orig;
foreach my $part (@parts) {
if($part =~ /^[a-zA-Z0-9\/\:\~]/) {
if($encodeslashes && $part eq '/') {
$encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)),
+ 2));
} else {
$encoded .= $part;
}
}elsif($part eq ' ') {
$encoded .= '%20';
} else {
$encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2))
+;
}
}
return $encoded;
}
sub decode_uri($orig) {
my @oparts = split/\//, $orig;
my @dparts;
foreach my $opart (@oparts) {
push @dparts, decode_uri_part($opart);
}
return join('/', @dparts);
}
sub decode_uri_part($orig) {
my $decoded = '';
return $decoded unless defined($orig);
my @parts = split//, $orig;
while(scalar @parts) {
my $part = shift @parts;
if($part eq '+') {
$decoded .= ' ';
} elsif($part eq '%') {
$decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts)
+);
} else {
$decoded .= $part;
}
}
return $decoded;
}
# This is similar to decode_uri_part, but treats the plus sign literal
+ly instead of as space
sub decode_uri_path($orig) {
my $decoded = '';
return $decoded unless defined($orig);
my @parts = split//, $orig;
while(scalar @parts) {
my $part = shift @parts;
if($part eq '%') {
$decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts)
+);
} else {
$decoded .= $part;
}
}
return $decoded;
}
Note: the *_part() functions refer to the argument parts after the question mark, the *_path() does the URI path.
Note 2: The doFPad() just pads a string with leading zeros, it's from Padding.pm
Note 3: Yes, this part of my codebase needs refactoring for speed and clarity. But it currently works for me and its not used in any time-critical paths, so it's low priority ("never touch a running system").
Note 4: The different encoding schemes come from the origin of the different parts (at least for their historical reasons): Domains need an encoding scheme that compatible with older DNS servers. Path need to be compatible with file systems. And arguments originate from HTML forms with a variety of character sets.
|