Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Reg Ex to strip MS smart quotes

by freddo411 (Chaplain)
on Aug 19, 2005 at 17:01 UTC ( [id://485212]=perlquestion: print w/replies, xml ) Need Help??

freddo411 has asked for the wisdom of the Perl Monks concerning the following question:

I'm in the position of needing to write a filter to change "smart quotes" and other MS characters into more friendly ASCII equivalents.

I've searched here ( and found this suggestive node ) , on google, and on CPAN but I haven't found anything even close to what I need.

Can anyone provide a code snippet that would be helpful? I'm looking in particular for the list of MS "smart" chars (probably in hex) so that I can match them and convert them.

Also, where would you think it should live on CPAN? Regexp::Common: perhaps? Or somewhere else?

--Freddo411

-------------------------------------
Nothing is too wonderful to be true
-- Michael Faraday

Replies are listed 'Best First'.
Re: Reg Ex to strip MS smart quotes
by derby (Abbot) on Aug 19, 2005 at 17:31 UTC
      I don't think so (not the version I have, anyway and can't find any recent updates); been trying to extend that for some time, but is a "free time" project and the "free" part is scarce...

      That said, demoronizer as it stands deserves ++, as does the name!

        Are you sure? What problems are you having? Here's the snippet from the code that translates smart-quotes:

        $s =~ s/\x93/"/g; $s =~ s/\x94/"/g;

        And here's how I've modified the core demoronise sub:

        sub de_cp1252 { my( $self, $s ) = @_; # Map incompatible CP-1252 characters $s =~ s/\x82/,/g; $s =~ s-\x83-<em>f</em>-g; $s =~ s/\x84/,,/g; $s =~ s/\x85/.../g; $s =~ s/\x88/^/g; $s =~ s-\x89- °/°°-g; $s =~ s/\x8B/</g; $s =~ s/\x8C/Oe/g; $s =~ s/\x91/'/g; $s =~ s/\x92/'/g; $s =~ s/\x93/"/g; $s =~ s/\x94/"/g; $s =~ s/\x95/*/g; $s =~ s/\x96/-/g; $s =~ s/\x97/--/g; $s =~ s-\x98-<sup>~</sup>-g; $s =~ s-\x99-<sup>TM</sup>-g; $s =~ s/\x9B/>/g; $s =~ s/\x9C/oe/g; # Now check for any remaining untranslated characters. if ($s =~ m/[\x00-\x08\x10-\x1F\x80-\x9F]/) { for( my $i = 0; $i < length($s); $i++) { my $c = substr($s, $i, 1); if ($c =~ m/[\x00-\x09\x10-\x1F\x80-\x9F]/) { printf(STDERR "warning--untranslated character 0x%02X i +n input line %s\n", unpack('C', $c), $s ); } } } $s; }

        I didn't really care about the other stuff (such as bad html or unicode) - just translating the known cp1252 misplaced characters into something reasonable.

        -derby
Re: Reg Ex to strip MS smart quotes
by xdg (Monsignor) on Aug 19, 2005 at 17:34 UTC

    Read perlunicode. You can use names for unicode characters and pick them out of a list to get the substitutions you want. The list also has the numeric codes if you prefer to do it that way.

    use charnames ":full"; $string =~ s{ \N{LEFT SINGLE QUOTATION MARK} | \N{RIGHT SINGLE QUOTATI +ON MARK} } { \N{APOSTROPHE} }xg;

    You may also get some mileage out of these properties, but the caveat from perlunicode would make me a bit nervous.

    Pi InitialPunctuation (may behave like Ps or Pe depending on usage) Pf FinalPunctuation (may behave like Ps or Pe depending on usage)

    (And yes, a module to do this translation automatically would be very nice!)

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Reg Ex to strip MS smart quotes
by wfsp (Abbot) on Aug 19, 2005 at 17:30 UTC
    The MS curly quotes are in the CP1252 0x80-0x9f range.
    There are no direct conversions for all of this range to ascii (or Latin1). This chart shows the conversion to unicode.

    Hope this helps, John

Re: Reg Ex to strip MS smart quotes
by planetscape (Chancellor) on Aug 19, 2005 at 21:47 UTC
    s/\x93|\x94/"/g;

    HTH,

    planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://485212]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-19 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found