in reply to Re^4: Encoding/decoding question
in thread Encoding/decoding question

Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML; even if they were, they would likely be unprintable control characters +. Tidy assumed you wanted to refer to a character with the same byte val +ue in the specified encoding and replaced that reference with the Unicode equiva +lent.
That’s the C1 Control Set, which means this is surely an encoding error. I’ll bet anything you’re supplying the old Windows-1252 legacy encoding as input but telling something it is really ISO-8859-1 when it isn’t.
cp1252  0x80  ⇒  U+20AC  < € >  \N{EURO SIGN}
cp1252  0x81  ⇒  U+FFFD  < � >  \N{REPLACEMENT CHARACTER}
cp1252  0x82  ⇒  U+201A  < ‚ >  \N{SINGLE LOW-9 QUOTATION MARK}
cp1252  0x83  ⇒  U+0192  < ƒ >  \N{LATIN SMALL LETTER F WITH HOOK}
cp1252  0x84  ⇒  U+201E  < „ >  \N{DOUBLE LOW-9 QUOTATION MARK}
cp1252  0x85  ⇒  U+2026  < … >  \N{HORIZONTAL ELLIPSIS}
cp1252  0x86  ⇒  U+2020  < † >  \N{DAGGER}
cp1252  0x87  ⇒  U+2021  < ‡ >  \N{DOUBLE DAGGER}
cp1252  0x88  ⇒  U+02C6  < ˆ >  \N{MODIFIER LETTER CIRCUMFLEX ACCENT}
cp1252  0x89  ⇒  U+2030  < ‰ >  \N{PER MILLE SIGN}
cp1252  0x8A  ⇒  U+0160  < Š >  \N{LATIN CAPITAL LETTER S WITH CARON}
cp1252  0x8B  ⇒  U+2039  < ‹ >  \N{SINGLE LEFT-POINTING ANGLE QUOTATION MARK}
cp1252  0x8C  ⇒  U+0152  < Œ >  \N{LATIN CAPITAL LIGATURE OE}
cp1252  0x8D  ⇒  U+FFFD  < � >  \N{REPLACEMENT CHARACTER}
cp1252  0x8E  ⇒  U+017D  < Ž >  \N{LATIN CAPITAL LETTER Z WITH CARON}
cp1252  0x8F  ⇒  U+FFFD  < � >  \N{REPLACEMENT CHARACTER}
cp1252  0x90  ⇒  U+FFFD  < � >  \N{REPLACEMENT CHARACTER}
cp1252  0x91  ⇒  U+2018  < ‘ >  \N{LEFT SINGLE QUOTATION MARK}
cp1252  0x92  ⇒  U+2019  < ’ >  \N{RIGHT SINGLE QUOTATION MARK}
cp1252  0x93  ⇒  U+201C  < “ >  \N{LEFT DOUBLE QUOTATION MARK}
cp1252  0x94  ⇒  U+201D  < ” >  \N{RIGHT DOUBLE QUOTATION MARK}
cp1252  0x95  ⇒  U+2022  < • >  \N{BULLET}
cp1252  0x96  ⇒  U+2013  < – >  \N{EN DASH}
cp1252  0x97  ⇒  U+2014  < — >  \N{EM DASH}
cp1252  0x98  ⇒  U+02DC  < ˜ >  \N{SMALL TILDE}
cp1252  0x99  ⇒  U+2122  < ™ >  \N{TRADE MARK SIGN}
cp1252  0x9A  ⇒  U+0161  < š >  \N{LATIN SMALL LETTER S WITH CARON}
cp1252  0x9B  ⇒  U+203A  < › >  \N{SINGLE RIGHT-POINTING ANGLE QUOTATION MARK}
cp1252  0x9C  ⇒  U+0153  < œ >  \N{LATIN SMALL LIGATURE OE}
cp1252  0x9D  ⇒  U+FFFD  < � >  \N{REPLACEMENT CHARACTER}
cp1252  0x9E  ⇒  U+017E  < ž >  \N{LATIN SMALL LETTER Z WITH CARON}
cp1252  0x9F  ⇒  U+0178  < Ÿ >  \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}

I suppose you might be using the old MacRoman legacy encoding, as that also uses code points in the C1 Control Set for alternate purposes:

MacRoman  0x80  ⇒  U+00C4  < Ä >  \N{LATIN CAPITAL LETTER A WITH DIAERESIS}
MacRoman  0x81  ⇒  U+00C5  < Å >  \N{LATIN CAPITAL LETTER A WITH RING ABOVE}
MacRoman  0x82  ⇒  U+00C7  < Ç >  \N{LATIN CAPITAL LETTER C WITH CEDILLA}
MacRoman  0x83  ⇒  U+00C9  < É >  \N{LATIN CAPITAL LETTER E WITH ACUTE}
MacRoman  0x84  ⇒  U+00D1  < Ñ >  \N{LATIN CAPITAL LETTER N WITH TILDE}
MacRoman  0x85  ⇒  U+00D6  < Ö >  \N{LATIN CAPITAL LETTER O WITH DIAERESIS}
MacRoman  0x86  ⇒  U+00DC  < Ü >  \N{LATIN CAPITAL LETTER U WITH DIAERESIS}
MacRoman  0x87  ⇒  U+00E1  < á >  \N{LATIN SMALL LETTER A WITH ACUTE}
MacRoman  0x88  ⇒  U+00E0  < à >  \N{LATIN SMALL LETTER A WITH GRAVE}
MacRoman  0x89  ⇒  U+00E2  < â >  \N{LATIN SMALL LETTER A WITH CIRCUMFLEX}
MacRoman  0x8A  ⇒  U+00E4  < ä >  \N{LATIN SMALL LETTER A WITH DIAERESIS}
MacRoman  0x8B  ⇒  U+00E3  < ã >  \N{LATIN SMALL LETTER A WITH TILDE}
MacRoman  0x8C  ⇒  U+00E5  < å >  \N{LATIN SMALL LETTER A WITH RING ABOVE}
MacRoman  0x8D  ⇒  U+00E7  < ç >  \N{LATIN SMALL LETTER C WITH CEDILLA}
MacRoman  0x8E  ⇒  U+00E9  < é >  \N{LATIN SMALL LETTER E WITH ACUTE}
MacRoman  0x8F  ⇒  U+00E8  < è >  \N{LATIN SMALL LETTER E WITH GRAVE}
MacRoman  0x90  ⇒  U+00EA  < ê >  \N{LATIN SMALL LETTER E WITH CIRCUMFLEX}
MacRoman  0x91  ⇒  U+00EB  < ë >  \N{LATIN SMALL LETTER E WITH DIAERESIS}
MacRoman  0x92  ⇒  U+00ED  < í >  \N{LATIN SMALL LETTER I WITH ACUTE}
MacRoman  0x93  ⇒  U+00EC  < ì >  \N{LATIN SMALL LETTER I WITH GRAVE}
MacRoman  0x94  ⇒  U+00EE  < î >  \N{LATIN SMALL LETTER I WITH CIRCUMFLEX}
MacRoman  0x95  ⇒  U+00EF  < ï >  \N{LATIN SMALL LETTER I WITH DIAERESIS}
MacRoman  0x96  ⇒  U+00F1  < ñ >  \N{LATIN SMALL LETTER N WITH TILDE}
MacRoman  0x97  ⇒  U+00F3  < ó >  \N{LATIN SMALL LETTER O WITH ACUTE}
MacRoman  0x98  ⇒  U+00F2  < ò >  \N{LATIN SMALL LETTER O WITH GRAVE}
MacRoman  0x99  ⇒  U+00F4  < ô >  \N{LATIN SMALL LETTER O WITH CIRCUMFLEX}
MacRoman  0x9A  ⇒  U+00F6  < ö >  \N{LATIN SMALL LETTER O WITH DIAERESIS}
MacRoman  0x9B  ⇒  U+00F5  < õ >  \N{LATIN SMALL LETTER O WITH TILDE}
MacRoman  0x9C  ⇒  U+00FA  < ú >  \N{LATIN SMALL LETTER U WITH ACUTE}
MacRoman  0x9D  ⇒  U+00F9  < ù >  \N{LATIN SMALL LETTER U WITH GRAVE}
MacRoman  0x9E  ⇒  U+00FB  < û >  \N{LATIN SMALL LETTER U WITH CIRCUMFLEX}
MacRoman  0x9F  ⇒  U+00FC  < ü >  \N{LATIN SMALL LETTER U WITH DIAERESIS}

If it were me, the way I would figure this out is like this:

$ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE cp1252 $ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE latin1 $ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE macroman
And then eyeball which of those looks like it has the right character names. Here’s a demo. We’ll create a CP1252 text file, then look at what happens if we specify the right vs wrong encoding:
$ perl -wle 'binmode(STDOUT, "encoding(cp1252)")||die; print "He said +, \x{201C}I\x{2019}m r\x{E9}served.\x{201D}"' > sample $ uniquote -vE latin1 sample He said, \N{SET TRANSMIT STATE}I\N{PRIVATE USE TWO}m r\N{LATIN SMALL L +ETTER E WITH ACUTE}served.\N{CANCEL CHARACTER} $ uniquote -vE cp1252 sample He said, \N{LEFT DOUBLE QUOTATION MARK}I\N{RIGHT SINGLE QUOTATION MARK +}m r\N{LATIN SMALL LETTER E WITH ACUTE}served.\N{RIGHT DOUBLE QUOTATI +ON MARK} $ uniquote -vE macroman sample He said, \N{LATIN SMALL LETTER I WITH GRAVE}I\N{LATIN SMALL LETTER I W +ITH ACUTE}m r\N{LATIN CAPITAL LETTER E WITH GRAVE}served.\N{LATIN SMA +LL LETTER I WITH CIRCUMFLEX}
You should be able to eyeball those to see which ones have the right character names, but in case you find that harder than it should be, you can always make sure that the numbers you get out are right instead.
$ perl -wle 'binmode(STDOUT, "encoding(cp1252)")||die; print "He said, + \x{201C}I\x{2019}m r\x{E9}served.\x{201D}"' | uniquote --encoding cp +1252 He said, \N{U+201C}I\N{U+2019}m r\N{U+E9}served.\N{U+201D}
Again, that does require the uniquote tool, and Perl 5.10.1, to run.

Replies are listed 'Best First'.
Re^6: Encoding/decoding question
by slugger415 (Monk) on Sep 12, 2011 at 20:20 UTC

    heh - can't say I follow all that -- I save the FB page as HTML from Firefox, and run tidy on it to make it XHTML. I'm doing all this on Windows 7 so I have no idea how or where it's being encoded. Tidy does allow various encodings but I seem to be getting wonky results no matter what I set it at.

    Anyway I tried running uniquote on text file (test.txt) containing only this string:

    sous réserve
    

    Here's what I got:

    > perl -nle 'print if /\P{ASCII}/' test.txt | uniquote.pl -vE cp1252
    Can't find string terminator "'" anywhere before EOF at -e line 1.
    

    Not sure what that means... appreciate the help...

        thanks --

        ok if I run uniquote, the first two seem correct:

        > perl -nle "print if /\P{ASCII}/" test.txt | uniquote.pl -vE cp1252
        sous r\N{LATIN SMALL LETTER E WITH ACUTE}serve
        
        > perl -nle "print if /\P{ASCII}/" test.txt | uniquote.pl -vE latin1
        sous r\N{LATIN SMALL LETTER E WITH ACUTE}serve
        
        
        > perl -nle "print if /\P{ASCII}/" test.txt | uniquote.pl -vE macroman
        sous r\N{LATIN CAPITAL LETTER E WITH GRAVE}serve
        

        So what's happening? (sorry, still clueless.)

Re^6: Encoding/decoding question
by slugger415 (Monk) on Sep 12, 2011 at 20:30 UTC
    If I set encoding to ASCII, it converts it to this:
    &#195;&#169;

    is that correct for the accented e? It still comes out wonky after my Perl script gets ahold of it:

    réserve
      The input is UTF-8, but you are treating it as Latin-1. You can’t do that. That is why you are getting that sort of output.
        Sorry to be dumb here, but where am I treating it as Latin-1? How do I change it? -- Scott