Explanation of regexps for obtaining POST params

r_ibsen has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Explanation of regexps for obtaining POST params by Corion (Patriarch) on Jul 04, 2002 at 20:50 UTC
First of all, it's generally a bad idea to parse CGI yourself; use the CGI module via `use CGI; my $req = CGI->new(); print $req->header(); print "<H1>Hello "; print $req->param('name'), "</H1>";` [download] But your question isn't as much about CGI as it is about regular expressions. Let's take a look at each line : `5. $value =~ tr/+/ /;` [download] The tr function translates one character at a time into another character. Here, it replaces every `+` into a space (as was done in reverse by the browser before the parameter was sent to you). `6. $value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex ($1))/eg;` [download] Here, all other characters that were urlencoded are decoded. Characters that are unsuitable for URLs, like \0, newlines and other stuff, get encoded via `%xx`, where `xx` is the hexadecimal value of the character. The regular expression replaces every percent sign that is followed by two characters out of the set A-Fa-f0-9 (the hexadecimal digits) with the character that has the value of the number given by the hexadecimal digits. `%00` would be the encoding for the character \0, a single newline has the number 10 and would be encoded as `%0A`. `7. $value =~ s/\s/ /g;` [download] Here, all whitespace is converted to blanks. This is not necessarily a good idea, or might come as a surprise, if you were sending arbitrary urlencoded characters to the RE in line 6, like a "tab" (value 9), it now got replaced by a space. `8. $value =~ s/<([^>]\|\n)*>//g;` [download] Here, it looks like the processor is trying to strip all HTML from the values, as that regular expression matches the following : An opening bracket `<`, followed by any characters except a closing bracket `>`, and then either a newline or the closing bracket. `9. $value =~ s/<//g;` [download] Now, to be extra sure, all opening brackets are removed as well. `10. $value =~ s/>//g;` [download] As are all closing brackets. `11. $FORM{$name} = $value;` [download] And here, the %FORM hash is populated with the `$name => $value pair.`. If you don't know about hashes, they are also called associative arrays, dictionaries or lookup tables, and if none of these words make sense, they are like arrays, except that the index is not a number but a string. Except for the HTML stripping, which might or might not be what you wanted, `CGI.pm` does the decoding of CGI parameters already and is certainly worth a look. `perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web` [download]	[reply] [d/l] [select]
Re: Explanation of regexps for obtaining POST params by screamingeagle (Curate) on Jul 04, 2002 at 20:39 UTC
1) The $FORM variable is a hash which is defined (in your example, should have been defined) before LINE 3 as my %FORM; 2) The tr// operator (i.e. the transliteration operator) is used to search and replace characters quoting from Perldoc : tr/SEARCHLIST/REPLACEMENTLIST/cds y/SEARCHLIST/REPLACEMENTLIST/cds Transliterates all occurrences of the characters found in the sear +ch list with the corresponding character in the replacement list. It +returns the number of characters replaced or deleted. If no string is + specified via the =~ or !~ operator, the $_ string is transliterated +. (The string specified with =~ must be a scalar variable, an array e +lement, a hash element, or an assignment to one of those, i.e., an lv +alue.) A character range may be specified with a hyphen, so tr/A-J/0-9/ d +oes the same replacement as tr/ACEGIBDFHJ/0246813579/. For sed devote +es, y is provided as a synonym for tr. If the SEARCHLIST is delimited + by bracketing quotes, the REPLACEMENTLIST has its own pair of quotes +, which may or may not be bracketing quotes, e.g., tr[A-Z][a-z] or tr +(+\-*/)/ABCD/. Note that tr does not do regular expression character classes such + as \d or [:lower:]. The <tr> operator is not equivalent to the tr(1) + utility. If you want to map strings between lower/upper cases, see p +erlfunc/lc and perlfunc/uc, and in general consider using the s opera +tor if you need regular expressions. Note also that the whole range idea is rather unportable between c +haracter sets--and even within character sets they may cause results +you probably didn't expect. A sound principle is to use only ranges t +hat begin from and end at either alphabets of equal case (a-e, A-E), +or digits (0-4). Anything else is unsafe. If in doubt, spell out the +character sets in full. Options: c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. [download] Line 5 replaces all "+" characters, which in a URL-encoded string, mean spaces. Line 6 decodes all URL-encoded data to character strings. The rest of the lines remove all HTML tag-characters , if present. and finally, the name=>value pair is assigned as a key-value pair to the $FORM hash, so that the rest of the script can refer to the data has $FORM{key}	[reply] [d/l]
Re: Explanation of regexps for obtaining POST params by Anonymous Monk on Jul 05, 2002 at 10:09 UTC
Just to emphasize what Corion said, the common advice is to use CGI or die; You really do not want to do this processing yourself. Really.	[reply]