in reply to converting smart quotes

There are some problems in the way you posted which make it very hard to know just how to help.

The chief issue is your link to an IBM page (where "this page" eq http://publib.boulder.ibm.com/infocenter/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrules.doc/toc.xml) which leads me to a TOC where the phrase "What's New" is NOT found. As I'm sure you can imagine, even the Monks most generous with their time may consider that a war-stopper.

The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes." At least in the context of M$ Word refers to four very specific characters,

  1. (‘ or 0x91)
  2. (’)
  3. (&#147)
          and
  4. (&#148)
.

That leaves me completely at sixes and nines as to what you mean by your second paragraph. It would probably be better to actually insert the chars inside quotes or somesuch so we can see what's giving you grief... and thus be more likely able to help. (It would also be a helpful were you to post a compilable snippet of your code \the bare minimum to show us how you're trying to deal with the non-ascii chars]).

Third, demoronizer is probably not quite up to the job, unless you make the same patch to your copy (assuming versions are the same) that derby provided in a reply in the thread you cited.

And, fourth, please use tags from the PM variant of HTML; especially, please use the [id://485212] method of creating links. If you link with a full a href..., your link will result in some significant fraction of the Monks who follow it finding themselves logged out. For further reference, see What shortcuts can I use for linking to other information?.

Replies are listed 'Best First'.
Re^2: converting smart quotes
by ikegami (Patriarch) on Mar 20, 2012 at 03:23 UTC

    The second most serious shortcoming is the possibly unusual sense of your use of the phrase, "smart quotes."

    MS smart quotes are 91 (‘) and 92 (’) in cp1252. They are U+2018 and U+2019, so they are actually written as ‘ and ’ in HTML.

    ‘ and ’ refer to other characters that aren't even present in cp1252.

    U+2018 and U+2019 are E2 80 98 and E2 80 99, so the OP is indeed referring to smart quotes.

Re^2: converting smart quotes
by slugger415 (Monk) on Mar 20, 2012 at 04:18 UTC

    Hi all, thank you so much for your comments and suggestions, and many apologies for my bad linking and explanations. Some responses:

    First off, I'm not sure why (ww) you don't see the What's new string. Perhaps these screen grabs will help describe what I'm talking about, from the above URL (and I hope I'm not breaking a rule here):

    pic 1

    pic 2

    2nd, I believe your example #2 is the smart quote I'm discussing, though it appears slightly differently in my text editor than it does in my browser. Here's a paste of the text here:

    What’s new

    As for my specific Perl code:

    my $browser = LWP::UserAgent->new; my $response = $browser->get( "http://publib.boulder.ibm.com/infocente +r/brjrules/v7r0m3/basic/tocView.jsp?toc=/com.ibm.websphere.ilog.jrule +s.doc/toc.xml" ); my $content = $$response{_content}; ## yes inefficient coding, but it +works open(OUT, ">content.html"); print OUT $content; close(OUT);

    Adding utf8::decode to that, as suggested:

    utf8::decode($content); $content =~ s { ([\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x{1FFFFF}]) } { sprintf('[U+%04X]', ord($1)) }gex;

    produces this:

    What[U+2019]s new

    At least it's finding it! But I confess I don't follow the regex there (I'm still learning...), and is there some shortcut in the code I'm missing?

    Sorry if I'm asking dumb questions here or just not getting it. And I would like to better understand that regex -- is there some place to learn more about that?

    Thank you all once again.

      tobyink's regular expression is making the character (and others if present) visible.
      To convert the specific character you mention to a normal ASCII single quote:
      $content =~ s/\x{2019}/'/g;

      My bad. Didn't find it because I didn't look closely enough... and when I used 'find' I used a common, straight single quote instead of a smartquote for the symbol. Duh! So, my apologies for that.

      The regex is using a "character class" to match any single instance of a character in the range \x00 through \x08 or \x0c, \x0e through </c>\x1f</c> or ...

      ... well, at that point, I'm thoroughly puzzled. The curly bracket notation in the last element is usually used to specify ('quantify') the number of instances of a preceeding character, but in this case, my first guess would be that it's a typo. Wiser heads may have another intepretation. I don't understand and haven't found an explanation, yet for the use of {}s around the \x{1FFFFF})

      As for learning more about regexen, see perlrequick, perlretut, and the invaluable "Mastering Regular Expressions" by Friedl (ca USD 30, last I looked). The book is where I'll look first to try to understand the use of curly brackets as something other than a mistake.

        In a regular expression, the "\xNN" escape always takes exactly two hexadecimal digits, so can only match characters in the range "\x00" to "\xFF". Adding braces like "\x{1FFFFF}" allows an arbitrary number of hexadecimal digits (presumably limited only by your architecture's integer size). perlre should explain it - search it for "long hex char".

        Escapes like this also work in interpolated strings. e.g.

        perl -Mutf8::all -E'say qq(\x{263a})'
        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'