Control Characters (\xNN) in HTML

garliqua has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks, I've got one for you.

I'm developing a content management system where the data are all stored in XML files. Everything is groovy with one exception: if a user tries to submit a web page with control characters (such as \x92 for single right quote) in it, then the XML Parser (XML::Simple, which uses XML::Parser) coughs, sputters and dies.

So, what I'd like to do is have a single regexp just go through and change all the \xNN characters to their XHTML entity equivalent. For instance, the single character \x92 would become .

My problem is that I can't seem to get something along these lines to work:

s/\x(\d+)/'&#' . hex($1) . ';'/ge

I think I know why this doesn't work (because the \d+ is searching for multiple digit characters whereas what I want is to find the single character specified by an expression like \x92 or \x93).

If I can avoid doing it, I'd rather not do something like:

for (127 .. 255) {
 my $regexp = "s/\\x" . sprintf("%lx", $_) . "/&#$_;/g";
 eval "$block =~ $regexp;";
}
[download]

Perhaps there is a solution involving pack(), though it hasn't occurred to me yet.

Any ideas?

Thanks.

Comment on Control Characters (\xNN) in HTML Select or Download Code

Replies are listed 'Best First'.
Re: Control Characters (\xNN) in HTML by blackmateria (Chaplain) on Oct 18, 2001 at 20:52 UTC
Like tommyw and scain said, you have to escape the backslash. Also, I don't think \d matches hex digits 'A'-'F'. If you need to match those (looks like you do from the sprintf), you can use [[:xdigit:]] instead. s/\\x([[:xdigit:]]+)/'&#'.hex($1).';'/eg Btw, if you only want 2 digits (so that stuff like "\x92Efficiency" doesn't confuse the regex), use {2} instead of +. If you want to match one or two, use {1,2}. You're probably better off matching an exact number rather than a range though if you can. s/\\x([[:xdigit:]]{2})/'&#'.hex($1).';'/eg Hope this helps! Update: Oops, I just read your reply to tommyw above. All you need is a range, combined with ord (not hex). s/([\x80-\xFF])/'&#'.ord($1).';'/eg	[reply]
Re: Re: Control Characters (\xNN) in HTML by tommyw (Hermit) on Oct 18, 2001 at 22:24 UTC
Pah! Updating your answer based on a reply to my message. Are there no depths to the plagarism people will stoop to? :-) In an attempt to retaliate, allow me to offer: `s/([^[:print:]])/'&#'.ord($1).';'/eg` [download] in return.	[reply] [d/l]
Re: Re: Control Characters (\xNN) in HTML by garliqua (Novice) on Oct 21, 2001 at 02:51 UTC
Thanks to everyone who replied. I appreciate it. I ended up going with blackmateria's solution: `s/([\x80-\xFF])/'&#'.ord($1).';'/eg` Since it kept the scope of the substitutions neatly to just the things I wanted to replace (as opposed to tommyw's followup using `[:print:]`, which—as I understand it from page 80 of the owl book, at least—would have performed replacements on tabs and other non-space-character whitespace as well). Thanks again to all y'all!	[reply] [d/l] [select]
Re: Control Characters (\xNN) in HTML by tommyw (Hermit) on Oct 18, 2001 at 20:38 UTC
You need to escape the backslash character: `s/\\x(\d+)/'&#' . hex($1) . ';'/ge` works for me. Oddly, you have doubled it in the loop you posted, but not in the original regexp	[reply] [d/l]
Re: Re: Control Characters (\xNN) in HTML by garliqua (Novice) on Oct 18, 2001 at 21:03 UTC
Thanks for your help, tommyw, but it still doesn't work for me. I should have been more clear throughout my original posting that what I am searching for (and trying to replace) is a single character that is expressed as `\xNN` (when working in `vi` for instance), where `NN` is the hex value of the character's position in the ASCII table. So your solution works if I'm searching for a string that looks like '\x92' but not if I'm searching for a single character (the right-single-quote character in this case) expressed as `\x92`. Sorry if I wasn't clear on that before. Thanks.	[reply] [d/l] [select]
Re: Control Characters (\xNN) in HTML by scain (Curate) on Oct 18, 2001 at 20:42 UTC
The reason your substitution isn't working is because you aren't escaping the first backslash \x probably has a special meaning and even if it doesn't \x means x (backslashed \> for example means >). To fix the regex, rewrite it like this: `s/\\x(\d+)/'&#' . hex($1) . ';'/ge` [download] The \d+ will work as you think it should (that is, as you wrote it). Scott Update: I should have guessed, but the \x followed by a two digit hex number means to look for a hex number, which is probably what confused you in the first place: you are looking for the text representation of a hex number (ie, \x92) which is plain text, not the hex number itself. The Real Update: OK, you said For instance, the single character \x92; scain, read the problem (or as my High School Chemistry teacher wrote RTGDP). \x followed by digits matches a hex nubmer. What I don't know is how to get it to match a range of numbers. You might try several things, like `[\x90-\xFF]`, which if Perl DWIM, would mean to match any hex number in that range. Also, since \x actually matches the number, what you really want to capture is the whole thing; so assuming the range works as I have written it, rewrite the regex like this: `s/([\x90-\xFF])/'&#' . $1 . ';'/g` [download] I am really fuzzy on whether that will work, but don't have an easy way to test it; let me know. Note that this way, you no longer need the e at the end of the regex.	[reply] [d/l] [select]


laziness, impatience, and hubris
	PerlMonks