in reply to Re^2: converting smart quotes
in thread converting smart quotes

My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

Replies are listed 'Best First'.
Re^4: converting smart quotes
by tobyink (Canon) on Mar 20, 2012 at 13:14 UTC

    In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char".

    Escapes like this also work in interpolated strings. e.g.

    perl -Mutf8::all -E'say qq(\x{263a})'
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Ok I think I've gotten this working in my own way (that my novice brain can understand):

      utf8::decode($content); while($content =~ /([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/){ my $char = '&#' . ord($1) . ';'; $content =~ s/([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/$char/; }

      I'm sure there's a more efficient way to do this, but the resulting &#xxxx; structure seems to work. And the important thing is the regex should be finding all those weirdo characters. Thank you!

      Thanks -- so what does sprintf('[U+%04X]' do, and why is it coming out as What[U+2019]s new?

        It replaces every character in that character class (basically: the unusual ASCII control characters, plus anything outside ASCII) with:

        1. opening square bracket, followed by
        2. capital letter "U", followed by
        3. plus sign, followed by
        4. the numeric value of character being replaced, in uppercase hexadecimal, padded to a minimum of four digits, followed by
        5. closing square bracket

        It's a regular expression I use quite... regularly. It makes any non-ASCII characters stick out like a sore thumb, so you can see exactly what characters are in your string.

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'