r_ibsen has asked for the wisdom of the Perl Monks concerning the following question:

Starting out on Perl I am reading this code for obtaining params from a http request:
1. read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); 2. @pairs = split(/&/, $buffer); 3. foreach $pair (@pairs) { 4. ($name, $value) = split(/=/, $pair); 5. $value =~ tr/+/ /; 6. $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; 7. $value =~ s/\s/ /g; 8. $value =~ s/<([^>]|\n)*>//g; 9. $value =~ s/<//g; 10. $value =~ s/>//g; 11. $FORM{$name} = $value; 12.}
Lines 1-4 I understand. I tried to look up the tr in line 5 in some documentation. It stated that tr/// is a translator and identical to y///. Looking up y/// I was informed that it is identical to tr///. For a novice this doesn't add much clarity ;-) Could somebody please explain what is going on in lines 5-11, i.e. what is the purpose of each of the patterns? I've figured out that that pack-thing in line 6 creates some sort of binary structure, but what info does it contain and whats's the purpose? What are the origins of the $FORM variable? Any answers will be gratefully accepted :-)

Replies are listed 'Best First'.
Re: Explanation of regexps for obtaining POST params
by Corion (Patriarch) on Jul 04, 2002 at 20:50 UTC

    First of all, it's generally a bad idea to parse CGI yourself; use the CGI module via

    use CGI; my $req = CGI->new(); print $req->header(); print "<H1>Hello "; print $req->param('name'), "</H1>";

    But your question isn't as much about CGI as it is about regular expressions. Let's take a look at each line :

    5. $value =~ tr/+/ /;

    The tr function translates one character at a time into another character. Here, it replaces every + into a space (as was done in reverse by the browser before the parameter was sent to you).

    6. $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex ($1))/eg;

    Here, all other characters that were urlencoded are decoded. Characters that are unsuitable for URLs, like \0, newlines and other stuff, get encoded via %xx, where xx is the hexadecimal value of the character. The regular expression replaces every percent sign that is followed by two characters out of the set A-Fa-f0-9 (the hexadecimal digits) with the character that has the value of the number given by the hexadecimal digits. %00 would be the encoding for the character \0, a single newline has the number 10 and would be encoded as %0A.

    7. $value =~ s/\s/ /g;

    Here, all whitespace is converted to blanks. This is not necessarily a good idea, or might come as a surprise, if you were sending arbitrary urlencoded characters to the RE in line 6, like a "tab" (value 9), it now got replaced by a space.

    8. $value =~ s/<([^>]|\n)*>//g;

    Here, it looks like the processor is trying to strip all HTML from the values, as that regular expression matches the following : An opening bracket <, followed by any characters except a closing bracket >, and then either a newline or the closing bracket.

    9. $value =~ s/<//g;

    Now, to be extra sure, all opening brackets are removed as well.

    10. $value =~ s/>//g;

    As are all closing brackets.

    11. $FORM{$name} = $value;

    And here, the %FORM hash is populated with the $name => $value pair.. If you don't know about hashes, they are also called associative arrays, dictionaries or lookup tables, and if none of these words make sense, they are like arrays, except that the index is not a number but a string.

    Except for the HTML stripping, which might or might not be what you wanted, CGI.pm does the decoding of CGI parameters already and is certainly worth a look.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: Explanation of regexps for obtaining POST params
by screamingeagle (Curate) on Jul 04, 2002 at 20:39 UTC
    1) The $FORM variable is a hash which is defined (in your example, should have been defined) before LINE 3 as my %FORM;
    2) The tr// operator (i.e. the transliteration operator) is used to search and replace characters
    quoting from Perldoc :
    tr/SEARCHLIST/REPLACEMENTLIST/cds y/SEARCHLIST/REPLACEMENTLIST/cds Transliterates all occurrences of the characters found in the sear +ch list with the corresponding character in the replacement list. It +returns the number of characters replaced or deleted. If no string is + specified via the =~ or !~ operator, the $_ string is transliterated +. (The string specified with =~ must be a scalar variable, an array e +lement, a hash element, or an assignment to one of those, i.e., an lv +alue.) A character range may be specified with a hyphen, so tr/A-J/0-9/ d +oes the same replacement as tr/ACEGIBDFHJ/0246813579/. For sed devote +es, y is provided as a synonym for tr. If the SEARCHLIST is delimited + by bracketing quotes, the REPLACEMENTLIST has its own pair of quotes +, which may or may not be bracketing quotes, e.g., tr[A-Z][a-z] or tr +(+\-*/)/ABCD/. Note that tr does not do regular expression character classes such + as \d or [:lower:]. The <tr> operator is not equivalent to the tr(1) + utility. If you want to map strings between lower/upper cases, see p +erlfunc/lc and perlfunc/uc, and in general consider using the s opera +tor if you need regular expressions. Note also that the whole range idea is rather unportable between c +haracter sets--and even within character sets they may cause results +you probably didn't expect. A sound principle is to use only ranges t +hat begin from and end at either alphabets of equal case (a-e, A-E), +or digits (0-4). Anything else is unsafe. If in doubt, spell out the +character sets in full. Options: c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters.
    Line 5 replaces all "+" characters, which in a URL-encoded string, mean spaces.
    Line 6 decodes all URL-encoded data to character strings.
    The rest of the lines remove all HTML tag-characters , if present.
    and finally, the name=>value pair is assigned as a key-value pair to the $FORM hash, so that the rest of the script can refer to the data has $FORM{key}
Re: Explanation of regexps for obtaining POST params
by Anonymous Monk on Jul 05, 2002 at 10:09 UTC
    Just to emphasize what Corion said, the common advice is to use CGI or die; You really do not want to do this processing yourself. Really.