in reply to Reg Ex to strip MS smart quotes

demoroniser will probably do what you want

-derby

Replies are listed 'Best First'.
Re^2: Reg Ex to strip MS smart quotes
by ww (Archbishop) on Aug 19, 2005 at 18:01 UTC
    I don't think so (not the version I have, anyway and can't find any recent updates); been trying to extend that for some time, but is a "free time" project and the "free" part is scarce...

    That said, demoronizer as it stands deserves ++, as does the name!

      Are you sure? What problems are you having? Here's the snippet from the code that translates smart-quotes:

      $s =~ s/\x93/"/g; $s =~ s/\x94/"/g;

      And here's how I've modified the core demoronise sub:

      sub de_cp1252 { my( $self, $s ) = @_; # Map incompatible CP-1252 characters $s =~ s/\x82/,/g; $s =~ s-\x83-<em>f</em>-g; $s =~ s/\x84/,,/g; $s =~ s/\x85/.../g; $s =~ s/\x88/^/g; $s =~ s-\x89- °/°°-g; $s =~ s/\x8B/</g; $s =~ s/\x8C/Oe/g; $s =~ s/\x91/'/g; $s =~ s/\x92/'/g; $s =~ s/\x93/"/g; $s =~ s/\x94/"/g; $s =~ s/\x95/*/g; $s =~ s/\x96/-/g; $s =~ s/\x97/--/g; $s =~ s-\x98-<sup>~</sup>-g; $s =~ s-\x99-<sup>TM</sup>-g; $s =~ s/\x9B/>/g; $s =~ s/\x9C/oe/g; # Now check for any remaining untranslated characters. if ($s =~ m/[\x00-\x08\x10-\x1F\x80-\x9F]/) { for( my $i = 0; $i < length($s); $i++) { my $c = substr($s, $i, 1); if ($c =~ m/[\x00-\x09\x10-\x1F\x80-\x9F]/) { printf(STDERR "warning--untranslated character 0x%02X i +n input line %s\n", unpack('C', $c), $s ); } } } $s; }

      I didn't really care about the other stuff (such as bad html or unicode) - just translating the known cp1252 misplaced characters into something reasonable.

      -derby
        Bingo. That snippit is perfect.

        Interestingly, I found demoronizer and I kept looking because I thought it only worked on HTML and output HTML entities.

        Thanks again.

        -------------------------------------
        Nothing is too wonderful to be true
        -- Michael Faraday