Re^3: Encoding/decoding question
by ikegami (Patriarch) on Sep 11, 2011 at 20:26 UTC
|
I doubt that. I suspect the HTML was buggy too.
Could you show the HTML's HEAD element and the od -c output for réserve?
( Update: hum, .exe? You might not have od. Alternative: perl -nE"say unpack 'H*', $_ if /serv/;" file.html )
By the way, XML::LibXML has functions for parsing HTML.
| [reply] [d/l] [select] |
|
|
I doubt that. I suspect the HTML was buggy too.
Could you show the HTML's HEAD element and the od -c output for réserve?
( Update: hum, .exe? You might not have od. Alternative: perl -nE"say unpack 'H*', $_ if /serv/;" file.html )
I once again recommend the uniquote program for such things. It is really way better than od or cat -v or anything, because it actually shows you the proper characters.
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote
r\N{U+E9}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote -x
r\x{E9}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote -v
r\N{LATIN SMALL LETTER E WITH ACUTE}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote -b
r\xC3\xA9serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote --xml
réserve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote --html
réserve
$ perl -Mutf8 -CS -wle 'print "réserve"' | uniquote --html --verbose
réserve
$ perl -Mutf8 -CS -wle 'print "réserve"' | nfd | uniquote -v
re\N{COMBINING ACUTE ACCENT}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | iconv -f UTF-8 -t UTF-16 |
+uniquote --encoding=UTF-16 -x
r\x{E9}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | iconv -f UTF-8 -t UTF-16 |
+uniquote -b
\xFE\xFF\x00r\x00\xE9\x00s\x00e\x00r\x00v\x00e\x00
$ perl -Mutf8 -CS -wle 'print "réserve"' | iconv -f UTF-8 -t MacRoman
+| uniquote --encoding=MacRoman -x
r\x{E9}serve
$ perl -Mutf8 -CS -wle 'print "réserve"' | iconv -f UTF-8 -t MacRoman
+| uniquote -b
r\x8Eserve
$ perl -Mutf8 -CS -wle 'print "réserve"' > reserve.utf8
$ iconv -f UTF-8 -t MacRoman < reserve.utf8 > reserve.macroman
$ iconv -f UTF-8 -t UTF16-BE < reserve.utf8 > reserve.utf16be
$ uniwc reserve.{macroman,utf8,utf16be}
Paras Lines Words Graphs Chars Bytes File
0 1 1 8 8 8 reserve.macroman
0 1 1 8 8 9 reserve.utf8
0 1 1 8 8 16 reserve.utf16be
$ uniquote reserve.{macroman,utf8,utf16be}
r\N{U+E9}serve
r\N{U+E9}serve
r\N{U+E9}serve
$ uniquote -b reserve.{macroman,utf8,utf16be}
r\x8Eserve
r\xC3\xA9serve
\x00r\x00\xE9\x00s\x00e\x00r\x00v\x00e\x00
See how nifty that is?
| [reply] [d/l] |
|
|
Yes, uniquote -b produces a similar output to od -c, but why would I have the user approximate what I want using a tool he doesn't have?
| [reply] [d/l] [select] |
|
|
|
|
That assumes OP has utf8 capable shell on doesn' it?
| [reply] |
|
|
I see tidy.exe is giving me this message:
Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML;
even if they were, they would likely be unprintable control characters.
Tidy assumed you wanted to refer to a character with the same byte value in the
specified encoding and replaced that reference with the Unicode equivalent.
Here's the very top of the original (pre-tidy'd) HTML file (from our friend the facebook)
<!DOCTYPE HTML>
<html class=" videoCallEnabled" id="facebook" lang="en"><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8"><script>CavalryLogger=false;window._script_path = "\/home.php";window._EagleEyeSeed="Nq0j";</script><noscript> <meta http-equiv="refresh" content="0; URL=/?_fb_noscript=1" /> </noscript>
<meta name="robots" content="noodp,noydir">
... followed by loads of scripts and stylesheets.
The output from your command above, run on the html file, produced thousands of characters such as:
3c6c696e6b2068726566...
Not sure if you're looking for anything in particular. Thanks for your help, Scott
| [reply] |
|
|
Character codes 128 to 159 (U+0080 to U+009F) are not allowed in HTML;
even if they were, they would likely be unprintable control characters
+.
Tidy assumed you wanted to refer to a character with the same byte val
+ue in the
specified encoding and replaced that reference with the Unicode equiva
+lent.
That’s the C1 Control Set, which means this is surely an encoding error. I’ll bet anything you’re supplying the old Windows-1252 legacy encoding as input but telling something it is really ISO-8859-1 when it isn’t.
cp1252 0x80 ⇒ U+20AC < € > \N{EURO SIGN}
cp1252 0x81 ⇒ U+FFFD < � > \N{REPLACEMENT CHARACTER}
cp1252 0x82 ⇒ U+201A < ‚ > \N{SINGLE LOW-9 QUOTATION MARK}
cp1252 0x83 ⇒ U+0192 < ƒ > \N{LATIN SMALL LETTER F WITH HOOK}
cp1252 0x84 ⇒ U+201E < „ > \N{DOUBLE LOW-9 QUOTATION MARK}
cp1252 0x85 ⇒ U+2026 < … > \N{HORIZONTAL ELLIPSIS}
cp1252 0x86 ⇒ U+2020 < † > \N{DAGGER}
cp1252 0x87 ⇒ U+2021 < ‡ > \N{DOUBLE DAGGER}
cp1252 0x88 ⇒ U+02C6 < ˆ > \N{MODIFIER LETTER CIRCUMFLEX ACCENT}
cp1252 0x89 ⇒ U+2030 < ‰ > \N{PER MILLE SIGN}
cp1252 0x8A ⇒ U+0160 < Š > \N{LATIN CAPITAL LETTER S WITH CARON}
cp1252 0x8B ⇒ U+2039 < ‹ > \N{SINGLE LEFT-POINTING ANGLE QUOTATION MARK}
cp1252 0x8C ⇒ U+0152 < Œ > \N{LATIN CAPITAL LIGATURE OE}
cp1252 0x8D ⇒ U+FFFD < � > \N{REPLACEMENT CHARACTER}
cp1252 0x8E ⇒ U+017D < Ž > \N{LATIN CAPITAL LETTER Z WITH CARON}
cp1252 0x8F ⇒ U+FFFD < � > \N{REPLACEMENT CHARACTER}
cp1252 0x90 ⇒ U+FFFD < � > \N{REPLACEMENT CHARACTER}
cp1252 0x91 ⇒ U+2018 < ‘ > \N{LEFT SINGLE QUOTATION MARK}
cp1252 0x92 ⇒ U+2019 < ’ > \N{RIGHT SINGLE QUOTATION MARK}
cp1252 0x93 ⇒ U+201C < “ > \N{LEFT DOUBLE QUOTATION MARK}
cp1252 0x94 ⇒ U+201D < ” > \N{RIGHT DOUBLE QUOTATION MARK}
cp1252 0x95 ⇒ U+2022 < • > \N{BULLET}
cp1252 0x96 ⇒ U+2013 < – > \N{EN DASH}
cp1252 0x97 ⇒ U+2014 < — > \N{EM DASH}
cp1252 0x98 ⇒ U+02DC < ˜ > \N{SMALL TILDE}
cp1252 0x99 ⇒ U+2122 < ™ > \N{TRADE MARK SIGN}
cp1252 0x9A ⇒ U+0161 < š > \N{LATIN SMALL LETTER S WITH CARON}
cp1252 0x9B ⇒ U+203A < › > \N{SINGLE RIGHT-POINTING ANGLE QUOTATION MARK}
cp1252 0x9C ⇒ U+0153 < œ > \N{LATIN SMALL LIGATURE OE}
cp1252 0x9D ⇒ U+FFFD < � > \N{REPLACEMENT CHARACTER}
cp1252 0x9E ⇒ U+017E < ž > \N{LATIN SMALL LETTER Z WITH CARON}
cp1252 0x9F ⇒ U+0178 < Ÿ > \N{LATIN CAPITAL LETTER Y WITH DIAERESIS}
I suppose you might be using the old MacRoman legacy encoding, as that also uses code points in the C1 Control Set for alternate purposes:
MacRoman 0x80 ⇒ U+00C4 < Ä > \N{LATIN CAPITAL LETTER A WITH DIAERESIS}
MacRoman 0x81 ⇒ U+00C5 < Å > \N{LATIN CAPITAL LETTER A WITH RING ABOVE}
MacRoman 0x82 ⇒ U+00C7 < Ç > \N{LATIN CAPITAL LETTER C WITH CEDILLA}
MacRoman 0x83 ⇒ U+00C9 < É > \N{LATIN CAPITAL LETTER E WITH ACUTE}
MacRoman 0x84 ⇒ U+00D1 < Ñ > \N{LATIN CAPITAL LETTER N WITH TILDE}
MacRoman 0x85 ⇒ U+00D6 < Ö > \N{LATIN CAPITAL LETTER O WITH DIAERESIS}
MacRoman 0x86 ⇒ U+00DC < Ü > \N{LATIN CAPITAL LETTER U WITH DIAERESIS}
MacRoman 0x87 ⇒ U+00E1 < á > \N{LATIN SMALL LETTER A WITH ACUTE}
MacRoman 0x88 ⇒ U+00E0 < à > \N{LATIN SMALL LETTER A WITH GRAVE}
MacRoman 0x89 ⇒ U+00E2 < â > \N{LATIN SMALL LETTER A WITH CIRCUMFLEX}
MacRoman 0x8A ⇒ U+00E4 < ä > \N{LATIN SMALL LETTER A WITH DIAERESIS}
MacRoman 0x8B ⇒ U+00E3 < ã > \N{LATIN SMALL LETTER A WITH TILDE}
MacRoman 0x8C ⇒ U+00E5 < å > \N{LATIN SMALL LETTER A WITH RING ABOVE}
MacRoman 0x8D ⇒ U+00E7 < ç > \N{LATIN SMALL LETTER C WITH CEDILLA}
MacRoman 0x8E ⇒ U+00E9 < é > \N{LATIN SMALL LETTER E WITH ACUTE}
MacRoman 0x8F ⇒ U+00E8 < è > \N{LATIN SMALL LETTER E WITH GRAVE}
MacRoman 0x90 ⇒ U+00EA < ê > \N{LATIN SMALL LETTER E WITH CIRCUMFLEX}
MacRoman 0x91 ⇒ U+00EB < ë > \N{LATIN SMALL LETTER E WITH DIAERESIS}
MacRoman 0x92 ⇒ U+00ED < í > \N{LATIN SMALL LETTER I WITH ACUTE}
MacRoman 0x93 ⇒ U+00EC < ì > \N{LATIN SMALL LETTER I WITH GRAVE}
MacRoman 0x94 ⇒ U+00EE < î > \N{LATIN SMALL LETTER I WITH CIRCUMFLEX}
MacRoman 0x95 ⇒ U+00EF < ï > \N{LATIN SMALL LETTER I WITH DIAERESIS}
MacRoman 0x96 ⇒ U+00F1 < ñ > \N{LATIN SMALL LETTER N WITH TILDE}
MacRoman 0x97 ⇒ U+00F3 < ó > \N{LATIN SMALL LETTER O WITH ACUTE}
MacRoman 0x98 ⇒ U+00F2 < ò > \N{LATIN SMALL LETTER O WITH GRAVE}
MacRoman 0x99 ⇒ U+00F4 < ô > \N{LATIN SMALL LETTER O WITH CIRCUMFLEX}
MacRoman 0x9A ⇒ U+00F6 < ö > \N{LATIN SMALL LETTER O WITH DIAERESIS}
MacRoman 0x9B ⇒ U+00F5 < õ > \N{LATIN SMALL LETTER O WITH TILDE}
MacRoman 0x9C ⇒ U+00FA < ú > \N{LATIN SMALL LETTER U WITH ACUTE}
MacRoman 0x9D ⇒ U+00F9 < ù > \N{LATIN SMALL LETTER U WITH GRAVE}
MacRoman 0x9E ⇒ U+00FB < û > \N{LATIN SMALL LETTER U WITH CIRCUMFLEX}
MacRoman 0x9F ⇒ U+00FC < ü > \N{LATIN SMALL LETTER U WITH DIAERESIS}
If it were me, the way I would figure this out is like this:
$ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE cp1252
$ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE latin1
$ perl -nle 'print if /\P{ASCII}/' inputfile | uniquote -vE macroman
And then eyeball which of those looks like it has the right character names.
Here’s a demo. We’ll create a CP1252 text file, then look at what happens if we specify the right vs wrong encoding:
$ perl -wle 'binmode(STDOUT, "encoding(cp1252)")||die; print "He said
+, \x{201C}I\x{2019}m r\x{E9}served.\x{201D}"' > sample
$ uniquote -vE latin1 sample
He said, \N{SET TRANSMIT STATE}I\N{PRIVATE USE TWO}m r\N{LATIN SMALL L
+ETTER E WITH ACUTE}served.\N{CANCEL CHARACTER}
$ uniquote -vE cp1252 sample
He said, \N{LEFT DOUBLE QUOTATION MARK}I\N{RIGHT SINGLE QUOTATION MARK
+}m r\N{LATIN SMALL LETTER E WITH ACUTE}served.\N{RIGHT DOUBLE QUOTATI
+ON MARK}
$ uniquote -vE macroman sample
He said, \N{LATIN SMALL LETTER I WITH GRAVE}I\N{LATIN SMALL LETTER I W
+ITH ACUTE}m r\N{LATIN CAPITAL LETTER E WITH GRAVE}served.\N{LATIN SMA
+LL LETTER I WITH CIRCUMFLEX}
You should be able to eyeball those to see which ones have the right character names, but in case you find that harder than it should be, you can always make sure that the numbers you get out are right instead.
$ perl -wle 'binmode(STDOUT, "encoding(cp1252)")||die; print "He said,
+ \x{201C}I\x{2019}m r\x{E9}served.\x{201D}"' | uniquote --encoding cp
+1252
He said, \N{U+201C}I\N{U+2019}m r\N{U+E9}served.\N{U+201D}
Again, that does require the uniquote tool, and Perl 5.10.1, to run.
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
Re^3: Encoding/decoding question
by mirod (Canon) on Sep 13, 2011 at 08:36 UTC
|
You can use HTML::TreeBuilder to parse the HTML, then output it in XHTML, using the as_XML method, which works most of the time. It may not help with the encoding problem though, especially if the HTML lies about its encoding. XML::Twig can do this for you BTW, so in fact you may not need to use tidy at all, just install HTML::TreeBuilder and then use parsefile_html to parse the HTML.
Also HTML::Tidy uses a fork of tidy, and may be worth a try.
| [reply] |