Re^3: converting smart quotes

My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

Comment on Re^3: converting smart quotes Select or Download Code

Replies are listed 'Best First'.
Re^4: converting smart quotes by tobyink (Canon) on Mar 20, 2012 at 13:14 UTC
In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char". Escapes like this also work in interpolated strings. e.g. `perl -Mutf8::all -E'say qq(\x{263a})'` [download] `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re^5: converting smart quotes by slugger415 (Monk) on Mar 20, 2012 at 14:49 UTC
Ok I think I've gotten this working in my own way (that my novice brain can understand): `utf8::decode($content); while($content =~ /([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/){ my $char = '&#' . ord($1) . ';'; $content =~ s/([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}])/$char/; }` [download] I'm sure there's a more efficient way to do this, but the resulting `&#xxxx;` structure seems to work. And the important thing is the regex should be finding all those weirdo characters. Thank you!	[reply] [d/l] [select]
Re^5: converting smart quotes by slugger415 (Monk) on Mar 20, 2012 at 14:30 UTC
Thanks -- so what does `sprintf('[U+%04X]'` do, and why is it coming out as `What[U+2019]s new`?	[reply] [d/l] [select]
Re^6: converting smart quotes by tobyink (Canon) on Mar 20, 2012 at 15:27 UTC
It replaces every character in that character class (basically: the unusual ASCII control characters, plus anything outside ASCII) with: opening square bracket, followed by capital letter "U", followed by plus sign, followed by the numeric value of character being replaced, in uppercase hexadecimal, padded to a minimum of four digits, followed by closing square bracket It's a regular expression I use quite... regularly. It makes any non-ASCII characters stick out like a sore thumb, so you can see exactly what characters are in your string. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply]