Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Control Characters (\xNN) in HTML

by garliqua (Novice)
on Oct 18, 2001 at 20:15 UTC ( [id://119722]=perlquestion: print w/replies, xml ) Need Help??

garliqua has asked for the wisdom of the Perl Monks concerning the following question:

Wise monks, I've got one for you.

I'm developing a content management system where the data are all stored in XML files. Everything is groovy with one exception: if a user tries to submit a web page with control characters (such as \x92 for single right quote) in it, then the XML Parser (XML::Simple, which uses XML::Parser) coughs, sputters and dies.

So, what I'd like to do is have a single regexp just go through and change all the \xNN characters to their XHTML entity equivalent. For instance, the single character \x92 would become ’.

My problem is that I can't seem to get something along these lines to work:

s/\x(\d+)/'&#' . hex($1) . ';'/ge

I think I know why this doesn't work (because the \d+ is searching for multiple digit characters whereas what I want is to find the single character specified by an expression like \x92 or \x93).

If I can avoid doing it, I'd rather not do something like:

for (127 .. 255) { my $regexp = "s/\\x" . sprintf("%lx", $_) . "/&#$_;/g"; eval "$block =~ $regexp;"; }

Perhaps there is a solution involving pack(), though it hasn't occurred to me yet.

Any ideas?

Thanks.

Replies are listed 'Best First'.
Re: Control Characters (\xNN) in HTML
by blackmateria (Chaplain) on Oct 18, 2001 at 20:52 UTC
    Like tommyw and scain said, you have to escape the backslash. Also, I don't think \d matches hex digits 'A'-'F'. If you need to match those (looks like you do from the sprintf), you can use [[:xdigit:]] instead.
    s/\\x([[:xdigit:]]+)/'&#'.hex($1).';'/eg
    Btw, if you only want 2 digits (so that stuff like "\x92Efficiency" doesn't confuse the regex), use {2} instead of +. If you want to match one or two, use {1,2}. You're probably better off matching an exact number rather than a range though if you can.
    s/\\x([[:xdigit:]]{2})/'&#'.hex($1).';'/eg
    Hope this helps!

    Update: Oops, I just read your reply to tommyw above. All you need is a range, combined with ord (not hex).

    s/([\x80-\xFF])/'&#'.ord($1).';'/eg

      Pah! Updating your answer based on a reply to my message. Are there no depths to the plagarism people will stoop to? :-)

      In an attempt to retaliate, allow me to offer:

      s/([^[:print:]])/'&#'.ord($1).';'/eg
      in return.

      Thanks to everyone who replied. I appreciate it.

      I ended up going with blackmateria's solution:

      s/([\x80-\xFF])/'&#'.ord($1).';'/eg

      Since it kept the scope of the substitutions neatly to just the things I wanted to replace (as opposed to tommyw's followup using [:print:], which—as I understand it from page 80 of the owl book, at least—would have performed replacements on tabs and other non-space-character whitespace as well).

      Thanks again to all y'all!

Re: Control Characters (\xNN) in HTML
by tommyw (Hermit) on Oct 18, 2001 at 20:38 UTC

    You need to escape the backslash character: s/\\x(\d+)/'&#' . hex($1) . ';'/ge works for me.

    Oddly, you have doubled it in the loop you posted, but not in the original regexp

      Thanks for your help, tommyw, but it still doesn't work for me. I should have been more clear throughout my original posting that what I am searching for (and trying to replace) is a single character that is expressed as \xNN (when working in vi for instance), where NN is the hex value of the character's position in the ASCII table.

      So your solution works if I'm searching for a string that looks like '\x92' but not if I'm searching for a single character (the right-single-quote character in this case) expressed as \x92.

      Sorry if I wasn't clear on that before.

      Thanks.

Re: Control Characters (\xNN) in HTML
by scain (Curate) on Oct 18, 2001 at 20:42 UTC
    The reason your substitution isn't working is because you aren't escaping the first backslash \x probably has a special meaning and even if it doesn't \x means x (backslashed \> for example means >).

    To fix the regex, rewrite it like this:

    s/\\x(\d+)/'&#' . hex($1) . ';'/ge
    The \d+ will work as you think it should (that is, as you wrote it).

    Scott

    Update: I should have guessed, but the \x followed by a two digit hex number means to look for a hex number, which is probably what confused you in the first place: you are looking for the text representation of a hex number (ie, \x92) which is plain text, not the hex number itself.

    The Real Update: OK, you said For instance, the single character \x92; scain, read the problem (or as my High School Chemistry teacher wrote RTGDP). \x followed by digits matches a hex nubmer. What I don't know is how to get it to match a range of numbers. You might try several things, like [\x90-\xFF], which if Perl DWIM, would mean to match any hex number in that range. Also, since \x actually matches the number, what you really want to capture is the whole thing; so assuming the range works as I have written it, rewrite the regex like this:

    s/([\x90-\xFF])/'&#' . $1 . ';'/g
    I am really fuzzy on whether that will work, but don't have an easy way to test it; let me know. Note that this way, you no longer need the e at the end of the regex.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://119722]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2024-04-25 17:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found