HTML::Entities and Unicode quotes

tod222 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::Entities and Unicode quotes by Your Mother (Archbishop) on Aug 20, 2011 at 02:06 UTC
This will, I hope, explain what’s going on– use warnings; use strict; use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; print "Is $str UTF-8? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; $str = decode("UTF-8", $str, Encode::FB_CROAK); binmode STDOUT, ":encoding(UTF-8)"; print "It's still $str... UTF-8 now? ", Encode::is_utf8($str) ? "Yes!\n" : "No...\n"; my $wide_chars = "\x{201C}quotes\x{201D}"; print "How about this version: $wide_chars? ", Encode::is_utf8($wide_chars) ? "Yes!\n" : "No...\n"; print "Entities: ", encode_entities($str), $/; __END__ Is “quotes” UTF-8? No... It's still “quotes”... UTF-8 now? Yes! How about this version: “quotes”? Yes! Entities: “quotes” [download] Update: changed `$non_combining` to `$wide_chars` as the name was misleading.	[reply] [d/l] [select]
Re^2: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 20, 2011 at 06:22 UTC
The internal storage format (returned by `is_utf8`) has absolutely nothing to do with this. `use Encode; use HTML::Entities; my $str = "\xe2\x80\x9cquotes\xe2\x80\x9d"; utf8::downgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n"; utf8::upgrade($str); print Encode::is_utf8($str) ? 1 :0, " ", encode_entities($str), "\n";` [download] `0 âquotesâ 1 âquotesâ` [download]	[reply] [d/l] [select]
Re^3: HTML::Entities and Unicode quotes by Your Mother (Archbishop) on Aug 20, 2011 at 16:41 UTC
It was just to show what the natural state of the strings was assumed to be by perl. You artificially flipped the switch on/off—no decoding or encoding. You also showed code using the functions of utf8 which is probably a bad example to set. You know exactly what you’re doing but someone who doesn’t sees a top monk using it they think, oh, that must be a good idea, I’ll use `upgrade` and `downgrade` to “fix” my encodings too.	[reply] [d/l] [select]
Re^4: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 20, 2011 at 20:59 UTC
Re^5: HTML::Entities and Unicode quotes by Your Mother (Archbishop) on Aug 20, 2011 at 22:48 UTC
Some notes below your chosen depth have not been shown here
Re^2: HTML::Entities and Unicode quotes by tod222 (Pilgrim) on Aug 22, 2011 at 06:23 UTC
Thank you for this excellent response. I found it quite illuminating. Before posting I'd spent about 30 minutes reading perlunifaq and searching here on Perlmonks without things getting much clearer. In fact, some of what I read here was a bit disconcerting; the complaints that Perl no longer 'just worked' seemed apropos. One source of my original confusion was that I had a file containing \xe2\x80\x9c and \xe2\x80\x9d sequences when examined using 'od -t x1 foo2' which would display correctly on Ubuntu with 'cat' in gterm. Since the Unicode table I linked showed that the sequences were valid representations of “ and ” I wondered why HTML::Entities wasn't handling it correctly, particularly when cat could. Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this. A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called? The other is that I'd like for Perl to 'just work' to whatever extent possible. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"?	[reply]
Re^3: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 22, 2011 at 06:42 UTC
Thanks for pointing out Encode::is_utf8($str), as I'd been wondering if there was something like this. Ack! Please don't use that. It does NOT indicate whether something has been decoded or not. You have been misinformed. A couple of things are still puzzling me, though. One is, the \xe2\x80\x9d sequence is in an encoding. What's it called? It's the UTF-8 encoding of U+201D RIGHT DOUBLE QUOTATION MARK. Is there something that can be set at the start of a script to have all Perl IO default to ":encoding(UTF-8)"? There is open. It's not perfect, but it'll do a lot. It can handle STDIN, STDOUT and STDERR, and it can the default for `open`.	[reply] [d/l]
Re^4: HTML::Entities and Unicode quotes by tod222 (Pilgrim) on Aug 23, 2011 at 03:46 UTC
Re^5: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 23, 2011 at 06:10 UTC
Re^3: HTML::Entities and Unicode quotes by Anonymous Monk on Aug 22, 2011 at 06:51 UTC
See perlrun#* C [_number/list_]* and open `use open # make these handles ':std', # STDIN/STDOUT/STDERR 'IO', # and any I open ':encoding(UTF-8)'; # use strict UTF-8` [download] And don't use is_utf8 :) perlunitut: Unicode in Perl#What about the UTF-8 flag?	[reply] [d/l]
Re^4: HTML::Entities and Unicode quotes by tod222 (Pilgrim) on Aug 23, 2011 at 03:54 UTC
Re: HTML::Entities and Unicode quotes by ikegami (Patriarch) on Aug 20, 2011 at 06:35 UTC
`encode_entities` expects a string of text for argument. As far as it's concerned, you passed the following 12 characters: `U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0080 An unnamed control character U+009C STRING TERMINATOR, a control character q u o t e s U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX U+0080 An unnamed control character U+009D OPERATING SYSTEM COMMAND, a control character` [download] You want to pass `U+201C LEFT DOUBLE QUOTATION MARK q u o t e s U+201D RIGHT DOUBLE QUOTATION MARK` [download] So instead of `encode_entities("\xe2\x80\x9cquotes\xe2\x80\x9d")` [download] you should have used `encode_entities("\x{201C}quotes\x{201D}")` [download] What you passed is the bytes resulting from encoding "\x{201C}quotes\x{201D}" using UTF-8, so you could also use `encode_entities(decode("UTF-8", "\xe2\x80\x9cquotes\xe2\x80\x9d"))` [download] `decode` comes from Encode.	[reply] [d/l] [select]
Re: HTML::Entities and Unicode quotes by Anonymous Monk on Aug 20, 2011 at 02:08 UTC
HTML::Entities isn't correctly handling quotes as defined in this Unicode table. A poor workman blames his tools :) Read more... (515 Bytes) #!/usr/bin/perl -- use strict; use warnings; use utf8; use HTML::Entities; binmode STDOUT, ':encoding(UTF-8)'; { my $line = "This is a test of \xe2\x80\x9cquotes\xe2\x80\x9d\n"; print encode_entities($line, "\200-\377"); # looking for “ & +rdquo; in the output print $line; } print '#' x 11, "\n"; { my $line = 'xThis is a test of '.chr(8220).'quotes'.chr(8222)."\n" +; print encode_entities($line, '\x{201c}\x{201e}'); print $line; } print '#' x 11, "\n"; { my $line = 'xThis is a test of '.chr(8220).'quotes'.chr(8222)."\n" +; print encode_entities($line, chr(8220).chr(8222)); print $line; } __END__ This is a test of âquotesâ This is a test of âquotesâ ########### xThis is a test of “quotes&bdquo; xThis is a test of “quotes„ ########### xThis is a test of “quotes&bdquo; xThis is a test of “quotes„ [download]	[reply] [d/l] [select]