Re: why URL::Encode deliberately mistreat '+'

URI Encoding is a strange beast. Different parts of Web-URLs have different encoding schemes:

urischeme://my.domain.gov/myuripath?uriargument1&uriargument2

The domain can have optional, special encoding for Unicode, especially Umlauts.
The path can have plus signs, that are "encoded" as plus signs, whereas spaces are encoded as hex value %20
The arguments (after the question mark) encode spaces as plus signs, and the plus sign is encoded as hex value

Hope i didn't mix this up. It's still very early in the morning for me. Here is the relevant code from my own webserver, URI.pm, which seems to work reasonably well:

sub encode_uri($orig) {

    my @oparts = split/\//, $orig;
    my @eparts;
    foreach my $opart (@oparts) {
        push @eparts, encode_uri_part($opart);
    }

    return join('/', @eparts);
}

sub encode_uri_part($orig) {

    $orig = encode_utf8($orig);

    my $encoded = '';

    my @parts = split//, $orig;
    foreach my $part (@parts) {
        if($part =~ /^[a-zA-Z0-9\:\~]/) {
            $encoded .= $part;
        }elsif($part eq ' ') {
            $encoded .= '+';
        } else {
            $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2))
+;
        }
    }

    return $encoded;
}

sub encode_uri_path($orig, $encodeslashes = 0) {

    my $encoded = '';

    my @parts = split//, $orig;
    foreach my $part (@parts) {
        if($part =~ /^[a-zA-Z0-9\/\:\~]/) {
            if($encodeslashes && $part eq '/') {
                $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)),
+ 2));
            } else {
                $encoded .= $part;
            }
        }elsif($part eq ' ') {
            $encoded .= '%20';
        } else {
            $encoded .= '%' . uc(doFPad(sprintf("%x", ord($part)), 2))
+;
        }
    }

    return $encoded;
}

sub decode_uri($orig) {

    my @oparts = split/\//, $orig;
    my @dparts;
    foreach my $opart (@oparts) {
        push @dparts, decode_uri_part($opart);
    }

    return join('/', @dparts);
}

sub decode_uri_part($orig) {

    my $decoded = '';
    return $decoded unless defined($orig);
    my @parts = split//, $orig;
    while(scalar @parts) {
        my $part = shift @parts;
        if($part eq '+') {
            $decoded .= ' ';
        } elsif($part eq '%') {
            $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts)
+);
        } else {
            $decoded .= $part;
        }
    }

    return $decoded;
}

# This is similar to decode_uri_part, but treats the plus sign literal
+ly instead of as space
sub decode_uri_path($orig) {

    my $decoded = '';
    return $decoded unless defined($orig);
    my @parts = split//, $orig;
    while(scalar @parts) {
        my $part = shift @parts;
        if($part eq '%') {
            $decoded .= chr(hex(shift @parts) * 16 + hex(shift @parts)
+);
        } else {
            $decoded .= $part;
        }
    }

    return $decoded;
}
[download]

Note: the *_part() functions refer to the argument parts after the question mark, the *_path() does the URI path.

Note 2: The doFPad() just pads a string with leading zeros, it's from Padding.pm

Note 3: Yes, this part of my codebase needs refactoring for speed and clarity. But it currently works for me and its not used in any time-critical paths, so it's low priority ("never touch a running system").

Note 4: The different encoding schemes come from the origin of the different parts (at least for their historical reasons): Domains need an encoding scheme that compatible with older DNS servers. Path need to be compatible with file systems. And arguments originate from HTML forms with a variety of character sets.

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Also check out my sisters artwork and my rather simple sketches/one-panel comics

Comment on Re: why URL::Encode deliberately mistreat '+' Download Code