in reply to Re^3: converting smart quotes
in thread converting smart quotes

In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char".

Escapes like this also work in interpolated strings. e.g.

perl -Mutf8::all -E'say qq(\x{263a})'
perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Replies are listed 'Best First'.
Re^5: converting smart quotes
by slugger415 (Monk) on Mar 20, 2012 at 14:49 UTC

    Ok I think I've gotten this working in my own way (that my novice brain can understand):

    utf8::decode($content); while($content =~ /([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/){ my $char = '&#' . ord($1) . ';'; $content =~ s/([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/$char/; }

    I'm sure there's a more efficient way to do this, but the resulting &#xxxx; structure seems to work. And the important thing is the regex should be finding all those weirdo characters. Thank you!

Re^5: converting smart quotes
by slugger415 (Monk) on Mar 20, 2012 at 14:30 UTC

    Thanks -- so what does sprintf('[U+%04X]' do, and why is it coming out as What[U+2019]s new?

      It replaces every character in that character class (basically: the unusual ASCII control characters, plus anything outside ASCII) with:

      1. opening square bracket, followed by
      2. capital letter "U", followed by
      3. plus sign, followed by
      4. the numeric value of character being replaced, in uppercase hexadecimal, padded to a minimum of four digits, followed by
      5. closing square bracket

      It's a regular expression I use quite... regularly. It makes any non-ASCII characters stick out like a sore thumb, so you can see exactly what characters are in your string.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'